Chapter 1: Introduction

Published:2021
Crispin Gardiner, "Introduction", Elements of Stochastic Methods, Crispin Gardiner
Download citation file:
The everyday world around us is a mixture of predictability and chance. The “exact sciences,” such as physics or chemistry, are built on the idea that there are definite laws governing observable phenomena, and these sciences have had extraordinary success in elucidating the relevant laws, and in using these laws to describe the world.
The everyday world around us is a mixture of predictability and chance. The “exact sciences,” such as physics or chemistry, are built on the idea that there are definite laws governing observable phenomena, and these sciences have had extraordinary success in elucidating the relevant laws, and in using these laws to describe the world.
Life sciences and social sciences are in a different position. In describing the growth and decline of biological populations, or working out how money makes the world go round, there is always the tantalizingly obvious fact that there is an element of regularity in the phenomena observed, but at the same time nothing is exactly predictable. Economists are willing to say that a financial crisis will happen, but can’t say when—and even they will admit they may be wrong. A reliable medical test always has some false positives and negatives, and even something as certain as death comes at a time that can never be determined precisely.
Even in the case of the exact sciences, measured data are only connected with theoretical predictions with a certain margin of error, which in practice can usually be made very small, but cannot be entirely eliminated. And when looked at on the level of individual atoms, all processes become probabilistic, and quantum mechanics provides the appropriate description.
Probability theory is almost unquestioned as the correct way to handle the reality that the information we have about the world is never exact. In some sense, probability theory is the form that logic takes when nothing is absolutely certain. We will therefore start with an outline of the fundamentals of probability theory.
1.1 Events and their Probabilities
For our purposes, an event is a general concept, which covers ideas such as:
There are n bacteria in a sample.
A particle is within a small volume $d3x$ centered on the position x.
There were n_{1} viruses in a cell at time t_{1} and n_{2} viruses in the cell at time t_{2}.
These three possibilities exemplify the kinds of situation with which we would want to associate the idea of a probability.
1.1.1 Sets of Events
Mostly, events fall into two categories, those that are specified by sets of integers, and those characterized by sets of continuous variables. The first kind is specified by a vector of integers
These are countable events.
The second kind of event is not countable, and is specified by a vector of real numbers
More generally, it is useful to consider sets of events, which we can label as A, B, C…, etc. For countable events, a set can be specified by listing the events contained in it. The second kind of set can be specified in terms of unions and intersections of volumes ΔV in the space to which x belongs.
1.1.2 Notations
We will use the notations
1.1.3 Probabilities
Probability is most simply defined in terms of sets of events, A, within the space of all events of the kind we wish to consider. We introduce the quantity P(A) as the probability that an arbitrary event is contained in A.
Axioms: The probability must satisfy the following probability axioms for all sets:
 Positivity:This is a formalization of the intuitive belief that a probability is proportional to the number of times that something happens, which is clearly either positive or zero.(1.6)$P(A)\u22650.$
 Completeness:This is the expression of the fact that every event is certain to be contained within Ω.(1.7)$P(\Omega )=1.$
 Mutually Exclusive Events: If $Ai(i=1,2,3,\u2026)$ is a countable (but possibly infinite) collection of nonoverlapping sets, that isthen(1.8)$Ai\u222aAj=\xd8for alli\u2260j,$The condition that the sets are nonoverlapping is the formal statement that events in the various sets are mutually exclusive, and the axiom states that their probabilities simply add.(1.9)$P(\u222ai\u2061Ai)=\u2211iP(Ai).$
The number of sets must be countable because of the existence of sets labeled by a continuous index, for example x, the position in space. The probability of a molecule being in the set whose only element is x is zero, but the probability of it being in a region R of finite volume, or an infinitesimal region such as $d3x$, is nonzero. The regions R and $d3x$ can both be expressed as a union of sets of the form {x}—but not a countable union. Thus axiom iii) is not applicable and the probability of being in R or $d3x$ cannot be expressed as the sum of the probabilities of being in {x}.
Corollaries: Two further facts follow from the axioms.
 If $A\xaf$ is the complement of A, i.e., the set of all events not contained in A, then $A\u222aA\xaf=\Omega $, and hence from ii)(1.10)$P(A\xaf)=1\u2212P(A).$
 As a special case, since $\xd8=\Omega \xaf$, it follows that(1.11)$P(\xd8)=0.$
1.1.4 Relating Probability to the Real World
The probabilities that we have introduced cannot be directly and rigorously related to the real world. The classic example of tossing dice illustrates this immediately. Intuitively, we expect each of the values 1 to 6 will have the same probability of occurring. Obviously, it is possible to weight the dice to favor a particular number, or perhaps to use some sleight of hand to toss the dice to achieve the same end. We exclude this—the dice must be constructed “fairly” and tossed “fairly.” This means that we must construct and toss the dice so that the outcome is uncertain, and equally likely to happen. The reasoning is, of course, circular.
By eliminating what we now think of as intuitive ideas and axiomatizing probability, Kolmogorov [1.1] cleared the road for a rigorous development of mathematical probability. His insight was to recognize that the definition of what we mean by probability in the real world is not a mathematical question, and that the above axioms are both in correspondence with reality, and sufficient to formulate probability as a branch of mathematics.
The simplest way of looking at axiomatic probability is as a formal method of manipulating probabilities using the axioms. In order to apply the theory, the probability space must be defined and the probability measure P assigned. These are a priori probabilities, which are assigned on grounds appropriate to the system under study. The task of applying probability is:
To assume some set of a priori probabilities that seem reasonable and to deduce results from this and from the structure of the probability space.
To measure experimental results with some apparatus that is constructed to measure quantities in accordance with these a priori probabilities.
1.2 Joint and Conditional Probabilities
1.2.1 Joint Probabilities
We explained in Sec. 1.1.3 how the occurrence of mutually exclusive events is related to the concept of nonintersecting sets. We now consider the concept $P(A\u2229B)$, where the intersection $A\u2229B$ is nonempty. An event a within A will only be within $A\u2229B$ if it is also within B as well, hence
and $P(A\u2229B)$ is called the joint probability that the event a is contained in both classes—that is, that both the event $a\u2208A$ and the event $a\u2208B$ occur.
1.2.2 Relationship Between Joint Probabilities of Different Orders
Suppose that we have a collection of sets B_{i} such that
so that the sets divide up the space Ω into nonoverlapping subsets.
Then
Using now the probability axiom (iii), we see that $A\u2229Bi$ are a countable collection of nonoverlapping sets, and therefore satisfy the conditions on the A_{i} used there. Hence
and more generally,
Thus, summing over all mutually exclusive possibilities of B in the joint probability eliminates that variable.
1.2.3 Conditional Probabilities
We need to define conditional probabilities, which are defined only on the collection of all sets contained in B.
We define the conditional probability as
and this satisfies our intuitive conception that the conditional probability that $a\u2208A$ (given that we know $a\u2208B$) is given by dividing the probability of joint occurrence by the probability $(a\u2208B)$.
This kind of result has very significant consequences in the development of the theory of Markov processes, which will be considered in detail in Chap. 4.
1.2.4 Independence
We need a probabilistic way of specifying what we mean by independent events. Two sets of events A and B should represent independent sets of events if the specification that a particular event is contained in B has no influence on the probability of that event belonging to A. Thus, the conditional probability P(AB) should be independent of B, and hence
In the case of several events, we need a somewhat stronger specification.
 Definition of Independent Events: Events are considered to be independent if their joint probabilities factorize. More precisely, the events $(ai\u2208Ai,i=1,2,\u2026,n)$ will be considered to be independent if for any subset $(i1,i2,\u2026,ik)$ of the set $(1,2,\u2026,n)$,(1.20)$P(Ai1\u2229Ai2\u2026Aik)=P(Ai1)P(Ai2)\u2026P(Aik).$
All Possible Factorizations are Necessary: It is important to require factorization for all possible combinations, as in (1.20).
1.3 Probability Notations
The settheoretic formulation of probability used in Sec. 1.2 is powerful, and convenient for discussing general issues. When used in practice, it is more natural to use notations that are more specific to the actual situations under consideration. In particular, we will introduce the ideas of probability distributions and probability densities, as well as the idea of random variables.
1.3.1 Probability Distribution Function
For applications, the probability of occurrence of a single value n is the most convenient quantity to use. This corresponds to considering sets of only one event, such as {n}, and it is convenient to use the notation
P(n)—the probability of occurrence of the value n—is then called the probability distribution function.
1.3.2 Probability Density
For a probability space with members taking on a continuous range of values x in a space of r dimensions, we consider the probability associated with a set of points in an infinitesimal volume $drx$ around the value x, and write this in the form
This defines $p(x)$ as the probability density function for this system.
1.3.3 Random Variables
The idea of a random variable is a way of talking about a probability space and the associated probabilities using a single symbol. For example, if we have a probability distribution P(n), where n = 0, 1, 2, …, we can define the random variable N as a quantity that takes on the values n with probability P(n). This notation means that any function f(N) of N is itself a random variable, which takes on the values f(n) with probability P(N). For example, N might mean the number of molecules in a small volume ΔV. This is a quantity which we do not know exactly, but which can be reasonably described as taking on the value n with probability P(n).
1.3.4 Independent Random Variables
Random variables $N1,N2,N3,\u2026$, will be said to be independent random variables, if their joint probability distribution function factorizes as in Sec. 1.2.4, that is,
For all sets of the form A_{i} = (x such that a_{i} ≤ x ≤ b_{i}), the events $N1\u2208A1,N2\u2208A2,N3\u2208A3,\u2026$ are independent events. This will mean that all values of the N_{i} are assumed independently of those of the remaining N_{i}.
1.4 Mean Values of Random Variables
1.4.1 Definitions
 Countable Events: The mean value (or expectation) of a discrete random variable N is given by(1.24)$\u27e8N\u27e9=\u2211nnP(n).$
Notation for the Mean Value: The notation 〈N〉 for the expectation used in this book is a physicist’s notation. The more common mathematical notation is E(N).
 Events Described by a Probability Density: In this case, the mean value of a random variable is given by integration(1.25)$\u27e8X\u27e9=\u222bxp(x)drx.$
 The Variance: The variance $var[X]$ of the random variable X is given byAs is well known var [X], or its square root the standard deviation σ[X], is a measure of the degree to which the values of X deviate from the mean value 〈X〉.(1.26)$var[X]\u2261(\sigma [X])2\u2261\u27e8(X\u2212\u27e8X\u27e9)2\u27e9.$
1.4.2 Some History
The now almost universal acceptance of the mean as a representative of the “true value” of a random variable took some time to develop. In his very accessible article, Stahl [1.2]) notes that it was Galilei [1.3] who first considered the properties of random errors inherent in the observations of celestial phenomena, but he did not come to any conclusion as to how a “true” or “best” value corresponding to a set of observations should be estimated. Only at the beginning of the 19th century did the mean become accepted as the most practical measure of the “true value” of a random variable.
1.4.3 The Law of Large Numbers
Let us consider taking a finite number M of samples x_{i} of the random variable X. We intuitively expect that as M becomes very large, the average
approaches the mean of the random variable 〈X〉. Under the condition that the mean and variance exist, this can be proved. This result, called the law of large numbers, establishes the mean as the preferred estimator of a random variable under these conditions.
The law of large numbers is proved by showing that the variance of the average $x\xafM$ approaches zero as the number of observations M becomes very large. This is done by constructing the random variable corresponding to the average. We do this by considering the M measurements to be independent samples of the same probability distribution. This effectively constructs a set of independent random variables X_{i}, all with the same probability distribution. The X_{i} all have the same mean and variance, which we can write as 〈X〉 and var [X].
From these we construct the random variable
The mean and variance of $X\xafM$ are of course given by the standard results
This means that in the limit of large M, the only observable value of the average $X\xafM$ is the mean 〈X〉 of the random variable X.
This is the law of large numbers—a result that relates the abstract probability concepts to reality.
1.4.4 Applicability of the Law of Large Numbers
It is important to note that the validity of the law of large numbers requires that the variance does exist, and that this condition is not always satisfied. In practice, one would expect that for a very wide range of measurable quantities, the relevant random variable would have a finite range. For example, the ages of members of a human population can be expected to be confined to a finite range of about 130 years at most. In such a case the variance must exist.
1.4.5 HeavyTailed Distributions
Situations in which the variance does not exist are not only possible, but in fact are quite important. They are characterized by a slow falloff of the probability density as x → ∞—such a distribution has come to be named a heavytailed distribution. These are treated in detail in Chap. 3.
As an example, consider the Cauchy distribution, whose probability density is given by
This is a welldefined probability density, and is correctly normalized to 1. However, it is clear that $\u222b\u2212\u221e\u221ex2pCauchy(x)dx$ diverges, so that the variance cannot exist.
Even the mean, which by symmetry one might expect to be zero, can only be defined as the principal value integral
We will discuss the interpretation of the Cauchy distribution in more detail in Sec. 3.4.1.
The Cauchy distribution provides a very accurate description of the frequency distribution of photons emitted during a transition between atomic energy levels, where it is normally called the Lorentzian distribution. In practice, spectral measurements are done by measuring the frequency distribution with a spectrometer. Instead of a mean value and a variance, the distribution is characterized by the position of the maximum, and the full width at half maximum (FWHM), which for the distribution (1.31), has the value 2a.
1.4.6 Moments, Correlations, and Covariances
The moments 〈X^{n}〉 are often seen as quantities by which a probability distribution can be characterized. The mean and variance involve the first two moments, and provide the most elementary way of characterizing a probability distribution. To fully characterize a probability distribution requires the knowledge of all moments. However, because a probability distribution must always vanish as x → ±∞, the higher moments tell us only about the properties of unlikely large values of X.
Existence of Moments: There is no requirement that the moments of any order actually exist. This is demonstrated by the example of the Cauchy distribution (1.31) as above, and by other heavytailed distributions.
 Several Random Variables: In the case of several variables, we define the covariance matrix asObviously,(1.33)$\u27e8Xi,Xj\u27e9\u2261\u27e8(Xi\u2212\u27e8Xi\u27e9)(Xj\u2212\u27e8Xj\u27e9)\u27e9\u2261\u27e8XiXj\u27e9\u2212\u27e8Xi\u27e9\u27e8Xj\u27e9.$If the variables are independent in pairs, the covariance matrix is diagonal.(1.34)$\u27e8Xi,Xi\u27e9=var[Xi].$
1.5 The Characteristic Function
If s is the vector $(s1,s2,\u2026,sn)$, and $X=(X1,X2,\u2026,Xn)$ is a vector of random variables, then the characteristic function (or moment generating function) is defined by the Fourier transform
Because of the Fourier inversion formula
$\varphi (s)$ determines p(x) with probability 1. Hence, the characteristic function does truly characterize the probability density.
1.5.1 Properties of the Characteristic Function
The characteristic function has the following properties:
The most important property is that $\varphi (s)$ exists for any probability density function. It therefore provides a much more useful tool than the moments, which as we have seen do not always exist.
$\varphi (s)$ is a uniformly continuous function of its arguments for all finite real s.
$\varphi (0)=1$.
$\varphi (s)\u22641$.
 If the moments $\u27e8\u220fiXimi\u27e9$ exist, then they are given in terms of the characteristic function by the derivatives:Conversely, when moments do not exist, the corresponding derivative of the characteristic function does not exist at $s=0$. For example, the characteristic function of the Cauchy distribution (1.31) is exp(−as), for which no derivatives exist at s = 0.(1.37)$\u27e8\u220fiXimi\u27e9=[\u220fi(\u2212i\u2202\u2202si)mi\varphi (s)]s=0.$
A sequence of probability densities converges to a limiting probability density if and only if the corresponding characteristic functions converge to the corresponding characteristic function of the limiting probability density.
 Independent random variables X_{1}, X_{2} …X_{n}: The definition of independence in Sec. 1.2.4 shows that the set of variables $X1,X2\u2026Xn$ are independent if and only ifin which case,(1.38)$p(x1,x2,\u2026,xn)=p1(x1)p2(x2)\u2026pn(xn),$(1.39)$\varphi (s1,s2,\u2026sn)=\varphi 1(s1)\varphi 2(s2)\u2026\varphi n(sn).$
 Sum of independent random variables: If $X1,X2,\u2026$ are independent random variables, u_{i} are constants, and ifand the characteristic function of Y is(1.40)$Y=\u2211i=1n(uiXi+vi),$then(1.41)$\varphi y(s)=\u27e8exp\u2061(isY)\u27e9,$(1.42)$\varphi y(s)=\u220fi=1neivis\varphi i(uis).$
1.5.2 Role and Significance of the Characteristic Function
The characteristic function plays an important role, which arises from the convergence property iv). This allows us to perform limiting processes on the characteristic function rather than the probability distribution itself, and often makes proofs easier. As well as this, the straightforward derivation of the moments by (1.37) makes any determination of the characteristic function directly relevant to measurable quantities.