We construct an elementary equation fθ(x) with a single real valued parameter θ ∈ [0, 1] that, as θ varies, is capable of fitting any scatter plot on any number of points to within a fixed precision. Specifically, given ϵ > 0, we may construct fθ so that for any collection of ordered pairs with and yi ∈ (0, 1), there exists a θ ∈ [0, 1] giving |fθ(xj) − yj| < ϵ for all j simultaneously. To achieve this, we apply results about the logistic map, an iterated map in dynamical systems theory that can be solved exactly. The existence of an equation fθ with this property highlights that “parameter counting” fails as a measure of model complexity when the class of models under consideration is only slightly broad.
I. INTRODUCTION
The mathematician John von Neumann famously admonished that with four free parameters he could make an elephant, and with five he could make it wiggle its trunk.1 Indeed, the number of free parameters is often taken as a proxy of model complexity intuitively as well as in quantitative model comparison measures like AIC2 and BIC.3 While these measures can be shown to be statistically principled or optimal for certain classes of models,4 they are often used to evaluate arbitrary models by practitioners in a given field. The aim of this short note is to show that, in fact, very simple, elementary models exist that are capable of fitting arbitrarily many points to an arbitrary precision using only a single real-valued parameter θ. This is not always due to severe pathologies—one such model, studied here, is infinitely continuously differentiable as a function of θ and x. The existence of this model has implications for statistical model comparison, and shows that great care must be taken in machine learning efforts to discover equations from data5–7 since some simple models can fit any data set arbitrarily well.
We will consider the simple setting of a scatter plot of x-values at natural numbers 0, 1, 2, 3, …, n and y-values in (0, 1). We show how to construct an elementary function fθ that, as θ varies, may fit any collection of ordered pairs to an arbitrary precision ϵ > 0. The existence of a solution for x = 0, 1, 2, …, n implies the existence of solution for any subset of these integers. The general approach taken here will be to first find an initial condition of a chaotic dynamical system whose orbit comes close to values related to each yj. Then, an exact solution of this dynamical system yields a friendly and simple equation y = fθ(x) that, as x varies, recovers the system’s dynamics with initial condition θ. This approach is related to attempts to encode computations into chaotic dynamical systems.8,9 The techniques deployed here are not novel mathematically, but this lesson from dynamical systems theory has not been explicitly articulated in the literature on statistics and model comparison.
II. A DERIVATION OF
We will make use of the logistic map m(z) = 4z(1 − z) whose iterated application can be solved exactly10 for a given initial value θ as
This solution follows from the double angle identity,
and the requirement that m0(θ) = θ. The map m is chaotic11 and it is well-established that m may be viewed as a shift map on θ through its conjugacy via φ(z) = sin2(2πz) to the Bernoulli map,
S has the effect of removing the first bit of a binary expansion 0. z1z2z3⋯ of z, so that
This property of S means that we may construct a real number ω ∈ (0, 1) whose orbit under S will bring it arbitrarily close to each member of any collection of points. Specifically, let us fix ϵ > 0 and choose so that 2−r < ϵ/2. We will define and denote the binary expansion of as . Define the parameter value ω ∈ (0, 1) by concatenating the first r binary digits of each ,
Due to the construction of ω and the ability to interpret S as removing the leftmost bit, Srj(ω) agrees with on its first r bits, so,
The ability to construct such an orbit relies ultimately on the fact that S is continuous and topologically mixing. Since φ is a homeomorphism between S and m, Srj = φ−1○mrj○φ. Moreover, φ is Lipschitz continuous and in particular 2 |x − y| > |φ(x) − φ(y)| for all x, y ∈ (0, 1). Putting these two facts together with (6) yields that for all j,
where the last inequality follows the Lipschitz condition and application of φ to each term inside the absolute value. This presentation has elided one technical factor, which is that φ is not one-to-one on (0, 1) and so φ−1 has two possible values. This has the consequence that in (7), may be close to either yj or 1 − yj, since φ is symmetrical about 1/2. To address this, we may always use the lower value for φ−1 and scale the yj so that they are always below 0.5. This scaling may then be inverted in the output of the final equation if desired.
Equation (7) shows that mrj will come ϵ close to each of the yj when started on value θ = φ(ω). Thus, we may define a single parameter equation,
where choosing θ = φ(ω) yields that |fθ(xj) − yj| < ϵ for all j. Of course, the yj were freely chosen, showing that fθ can approximate any data set as θ varies in [0, 1]. Note that for a fixed ϵ, the number of data points n that can be fit is not bounded, although the number of bits of precision required of θ scales linearly with n (and also with r). In addition, this fθ is continuous and differentiable—indeed infinitely continuously differentiable—and so it will satisfy nearly all regularity conditions that would normally weed out such pathological functions in the context of parameter estimation.
To illustrate that (8) can fit an arbitrary data set as θ varies, Figure 1 shows the value fθ(0), fθ(1), fθ(2), …, for two values of θ. Each parameter value was created by following the construction above using target yj chosen at each x value from the black pixels of a line drawing of either an elephant (left) or signature (right). The implementation used the arbitrary precision library mpmath in python12 and is here made freely available.13 Both data sets are able to be fit well by fθ(x) if θ is appropriately tuned, as shown by the figures. This single parameter model provides a large improvement over the prior state of the art in fitting an elephant.14,15 Note that the only shown x values are integers, even though fθ is defined for all real numbers. Between the shown integers are the rapidly oscillating sinusoidal patterns implied by (8).
III. DISCUSSION
The equation fθ(x) is extraordinarily sensitive to its single parameter θ and in fact will generalize to x > n in ways that depend only on the digits of θ are after the last digit of . Thus, while fitting the data, generalization behavior is completely determined by the free parameter’s less significant digits. This implies that there can be no guarantees about the performance of fθ in extrapolation, despite its good fit. Thus, the construction shows that even a single parameter can overfit the data, and therefore it is not always preferable to use a model with fewer parameters. This fact is related to the observation, in a setting of classification, that f(x) = sin(x) has an infinite VC-dimension.16
The existence of such a simple equation with such freedom in behavior illustrates a more basic problem that model complexity cannot be determined by counting parameters. More generally, uncritical use of a “parameter counting” approach ignores the fact that a single real-valued parameter potentially contains an unboundedly large amount of information since a real number requires an infinite number of bits to specify—a fact that is inherently problematic.17 Indeed, the set of real numbers that can even be described with finitely many bits (e.g. by a Turing machine) is countable and thus has measure zero. Given the existence of injective maps between and ,18 the number of parameters in a model cannot be a meaningful measure of its complexity once the class of models is large enough to implement these maps and effectively decode one single number into many. However, such embeddings are not continuous nor likely constructible as an ordinary looking equation that a scientist is likely to encounter.
The example provided in this paper shows that the infinite amount of information in a real valued parameter can be decoded quite simply, using just sin and exponentiation. The existence of such a simple yet problematic equation shows that attempts both at broad model comparison and automatic discovery of equations from data may often be ill-posed. Quantitatively, parameter-counting and maximum-likelihood methods should be dispreferred relative to model comparisons based on measures that incorporate the precision of real-valued parameters, such as Minimum Description Length.19 Alternatively, such problems can be avoided by comparison methods like cross-validation that implicitly penalize over-fitting. The result also emphasizes the importance of constraints on scientific theories that are enforced independently from the measured data set, with a focus on careful a priori consideration of the class of models that should be compared.4