Human memory is an incredibly complex system of vast capacity but often unreliable. Measuring memory for realistic material, such as narratives, is quantitatively challenging as people rarely remember narratives verbatim. Cognitive psychologists developed experimental paradigms involving randomly collected lists of items that make possible quantitative measures of performance in memory tasks, such as recall and recognition. Here, we describe a set of mathematical models designed to predict the results of these experiments. The models are based on simple underlying assumptions and surprisingly agree with experimental results quite well, in addition to that they exhibit quite interesting mathematical behavior that can partially be understood analytically.
Our memory is often unreliable. One usually remembers where she parked her car in the morning when leaving work in the evening, but sometimes this fails, and what is worse, one can never know when this happens and why. This kind of everyday observations seem to indicate that human memory cannot be predicted, let alone described by mathematical equations. However, quantitative measures of memory performance were collected over years of study, in particular, with recognition and recall tasks utilizing random lists of words (see, e.g., Refs. 1–4), and a set of powerful and complex models was proposed to capture these measures.4–7 These models are characterized by a large number of parameters that have to be tuned to data and cannot be analyzed mathematically. We recently proposed a different type of models that are based on simple fundamental principles and as a result can be analyzed mathematically.8,9 Moreover, while these models necessarily provide a highly simplified description of data, they predict the important performance measures in recall and recognition experiments quite well and, most importantly, display a certain degree of universality. In this paper, we review the above models and the precision with which they predict memory performance and focus on their mathematical aspects, in particular, on the unsolved issues.
II. EXPERIMENTS AND MODELS
Here, we introduce the experiments that we performed on the Amazon’s Mechanical Turk platform with a relatively large number of participants of unknown age and education level. All experiments involved randomly assembled lists of common nouns (e.g., “table” and “car”) or common facts (e.g., “Earth is round” and “birds fly”). We assembled multiple word lists of seven lengths L = 8, 16, 32, 64, 128, 256, and 512 presented consecutively at two speeds (1 and 1.5 s per word) and fact lists of four lengths L = 8, 16, 32, and 64 presented at the speed of 3 s per fact, 18 conditions all-together. Each participant performed a recall experiment with a list of a particular presentation condition (words/facts with a particular length and presentation speed) and one recognition test with the different list of the same condition. The recognition test consisted of a pair of items, one of them from the presented list and one not, with the participant being required to point to the item from the list. More experimental details can be found in Refs. 9 and 10. For each presentation condition separately, we obtained the empirical distribution of the number of words recalled R, P(R) and the fraction of correct recognitions, C, both across the group of participants performing experiments with this condition. In Fig. 1, we plot the average R and C as a function of list length L separately for words of two presentation speeds and for facts. As in numerous other publications (see, e.g., Refs. 3 and 11), we observe the decline of memory performance with L and with presentation speed, which is quite intuitive. We now introduce a novel performance measure that we call “memory invariant” (X) defined as
As shown in Fig. 2, the value for X is approximately the same for all conditions, converging to an asymptotic value toward longer lists. The main goal of the recall model that we present next is to account for this behavior (in fact, the invariant was derived from model predictions as will be seen shortly). The black horizontal line in Fig. 2 is the theoretical prediction for the invariant, given by
In another set of experiments performed with random lists of words, we measured how the probability to recognize a word from the list declines as a function of the lag between its presentation time and the test, which is called the retention curve (RC).9 There is a wide literature in psychology where mathematical forms of RCs were evaluated, with the power-law form emerging as one of the best candidates.12 The hope is that the shape of the RC will shed some light into the mechanisms of forgetting, which did not bear fruit still (see, e.g., Ref. 13). The most accepted mechanism in the literature is “retrograde interference,” according to which memories are not erased passively but rather due to acquisition of new memories.14 In our experiments, we present to a participant a list of 500 common nouns at the speed of 1.5 s per word, interspersed with recognition trials of three types: one including a word presented two time steps before the test (two-back recognition); another one including a word presented ten time steps before the test (10-back recognition); and finally, the third type with one of the first 25 words of the list tested at various later time points during presentation. Recognition trials of all three types were presented throughout the trial at random times. The purpose of the two-back trials was to select the participants who were focused on the task; the ten-back trials were performed to check for the possible effects of fatigue and/or forward interference from previously presented words; finally, the third type of trials was the principal one to find the shape of the decay of recognition performance with the number of intervening items since presentation, i.e., the RC. Results of these experiments averaged over participants with perfect 2-back recognition are shown in Fig. 3. For participants with perfect 2-back performance, 10-back performance does not depend much on the position of the test, indicating that effects of pro-active interference are negligible. On the other hand, recognition of early items steadily declines toward the chance level of 50% with increasing lag between presentation and test. We also show a theoretical prediction for the shape of RC, resulting from the model that will be described in detail below.
III. MATHEMATICAL MODEL OF FREE MEMORY RECALL
This model, introduced in a series of our previous publications, is based on three basic principles: (i) We assume that encoding and forgetting of items in memory is a binary process, i.e., at each moment, a given item is either present in memory or not, and all items that remain in memory after presentation of the list are candidates for recall; (ii) items of a given type are encoded in dedicated memory networks as sparse random groups of neurons, i.e., each neuron encodes a given memory with some small probability, independently for all neurons and items; and (iii) the recall trajectory is determined by the matrix of encoding overlaps that we call “similarity matrix” (SM) between items, computed as the number of common neurons for each pair of items. The trajectory is generated as follows: the first item is chosen randomly; at each step of the process, the next item is chosen as the one with the largest similarity to the current item, chosen out of all items except for the one that was visited at the previous step. Mathematically, each element of SM is defined as a scalar product of the binary index vectors for the corresponding items,
where N is the number of neurons in the memory network and if the neuron with index i participates (does not participate) in the encoding of the item with index k, which has a probability f(1 − f), where f is the sparseness of memory representations.
(i) is a simplifying assumption that allows us to consider, at each moment, the number of items from the list that are encoded in memory, which we denote as M. This, in turn, defines the probability to give a correct answer in a recognition trial as , which is derived by assuming that if the word is in memory, the participants give a correct answer; otherwise, it is guessing. Inverting this relation, and taking into account that different participants remember different number of items after list presentation, results in the following expression for the average number of items in memory, :
(ii) and (iii) results in a simple recall algorithm illustrated in Fig. 4. For each row of the similarity matrix, we mark the position of the maximal and second-maximal elements [black and red spots, correspondingly, on Fig. 4(a)] and construct a graph with M nodes, where each node emits two arrows of corresponding colors pointing to the nodes given by the positions of black and red spots in the corresponding row. Beginning from a random node, the recall trajectory follows black arrows, unless it goes back to the previous node producing a 2-node loop, in which case the red arrow is chosen instead; see Fig. 4(b) for an example trajectory. Following this trajectory, one can see the “collision” where the previously visited node is reached for the second time (node 10), after which it transverses the original trajectory in the opposite order for several steps eventually breaking into new nodes, until finally converging to a cycle after the same transition is taken for the second time (12 → 16). As illustrated in this example, the recall model is mathematically quite involved and it is not currently clear how it can be solved to find the distribution of the trajectory lengths, corresponding to the number of words recalled, for an arbitrary M, over realizations of SM. In Ref. 10, we found a good asymptotic solution to this problem in the limit of large M by connecting it to a much simpler model with fully random SM with no restriction on avoiding 2-nodes loops (i.e., trajectory following black arrows on the corresponding graph). This model is then equivalent to a random map problem (also called “birthday paradox”), for which the trajectory enters a cycle after the first collision. Since all transitions in this model are equally probable, the probability for a collision with any one of the previously visited nodes are given by
The probability for having a trajectory of length R in a graph of M nodes can be easily written down as
The first moment of this distribution, m1 = ⟨R⟩, can be expressed via the Ramanujan function θ as
and all higher moments can also be computed, e.g.,
In the limit M → ∞, the first moment quickly converges to its asymptotic behavior,
In fact, one can derive the asymptotic expression for the distribution of R directly from Eq. (6) by replacing each bracket factor by the corresponding exponent, e.g., , resulting in
from which all the moments can be computed. Our model with SM, defined as, the overlaps between random item encodings [Eq. (3)], is much more complex, and we currently do not have a precise formula for the probability distribution of R analogous to Eq. (6), the reasons for which will be apparent shortly. Here, we only consider the limit of very sparse encoding, f → 0 [see Eq. (3)], in which case the correlation between different elements of the SM can be neglected, and one can approximate the matrix of encoding overlaps as a random symmetric SM of size M by M (see Ref. 16 for the analysis of a more general case of finite f). This model still differs from the one considered above in two important ways. First, because of the symmetry of SM, the probability of a collision with any one of the previously visited nodes, which is given by p0 ≈ 1/M in the model with random SM, is now given by p0 ≈ 1/(2M), i.e., approximately two times less (we are considering the asymptotic limit of M → ∞ in this analysis). The reason for this is that if, say, the process is currently at node k and we want to estimate the chance that it returns to the previously visited node l, we need to take into account that when the process was at node l, it did not choose the transition to node k, i.e., the SM element Skl is not the largest out of all M − 2 relevant elements of the l’s row of the SM. With this constraint, the chance that it will be the largest in the kth row is ∼1/(2M) as can be easily estimated.16 This estimation does not take into account other constraints, namely, that the current node k was not chosen at all previous steps of the process, but this constraint can be shown to be negligible in the asymptotic limit of large M. The second difference of the model from the one solved above is in the possibility of continuing the recall trajectory after the collision, as illustrated in Fig. 4(b). This happens if and only if the previous transition from the node on which a collision happens was following the red arrow, [10 → 7 in Fig. 4(b)], i.e., the largest element of the corresponding row of SM would bring the process back to the previous node, 14 in this case. One can estimate the probability for this event to be 1/3 asymptotically.16 Taking these two estimations together, the probability for a collision that results in a recall process entering a cycle is given by
This estimation ignores the cases in which the collision happens to the initial node of the process or to a node that was already transversed twice in opposite directions because in both of these cases, the process always enters a cycle; these cases, however, can be neglected in the asymptotic limit. Comparing Eqs. (5) and (12), we conclude that for large M, the statistic of recall trajectories in the model with symmetric SM asymptotically approaches that of the model with fully random SM with the substitution M → 3M, i.e., the probability distribution of the number of recalled items can be obtained from Eq. (11) as
with the corresponding moments being
Going back to Eq. (4), we can substitute this expression of M in terms of C in Eq. (1) and obtain the theoretical expression for the memory invariant mentioned above in Eq. (2) that is shown in Fig. 2.
Figure 5 contains the results of numerical simulations that illustrate the convergence of distribution of recalled items to its asymptotic form with increasing M, the same for the first and second moments. Interestingly, the first moment converges to its asymptotic value of Eq. (14) much faster than the second moment, but we could not yet estimate analytically the finite-M corrections to the distribution function of R and the moments. Another interesting open feature of the model that can potentially be observed experimentally is the number of recall cycles for a given SM when choosing different initial items for recall. In the random SM model, the number of cycles was shown to grow very slowly with the size the matrix,17,19 but we did not manage to generalize this result to the symmetric SM model.
IV. MATHEMATICAL MODELS OF FORGETTING
In this section, we introduce a family of models of forgetting that are based on the idea of “retrograde interference,” according to which memories are erased due to acquisition of new memories rather than passively by the passage of time.14 The simplest way to realize this process is to assume that each acquired memory item is characterized by a scalar “valence” measure and that at every time step a new item is presented to the system with a valence randomly sampled over some distribution. Each time a new memory of valence V is acquired, either the whole set of existing memories with valences smaller than V are erased (model I) or only one memory of this set with the smallest valence is erased if the set is not empty (model II). Finally, the third model (model III) that we proposed generalizes model I to multidimensional valences such that each time a new memory with an n-dimensional valence is acquired, the set of existing memories with valences smaller than along each dimension are erased. Figure 6 illustrate these three models. Model I is very easy to solve: the probability that the item stays in memory for at least t steps since its acquisition, which we call retention curve RC, is the same as the probability that its valence is higher than all of the t subsequently presented items,
i.e., it has the power-law shape compatible with a variety of psychology studies on forgetting. The items that remain in memory after many steps of acquisition occupy the tail of the valence distribution, and moreover, at each moment, the valence of the retained memories is monotonically increasing with their “age” (time since acquisition). The average number of items accumulated in memory, N, grows with time as
The distribution of N(T) can also be computed with the observation that if one considers the valences of presented items in the backward order, from last to first, the items that are retained correspond to the running maxima of the valences, also called “high water marks.” The distribution of the number of retained items is, therefore, the same as the distribution of the number of high water marks over the permutations of a list of T numbers, which is given by the unsigned Stirling number of the first kind s(T; k). This result can be obtained as a bijection between the distribution of the number of high watermarks and the distribution of the number of cycles in the corresponding permutation group.22 Another way to compute it is through a relation to dominance in random games (see Ref. 19 for an explicit derivation of the distribution) yielding
where Stirling numbers of the first kind are defined algebraically,
Model I cannot be considered realistic since the average number of items in memory only grows as a logarithm of the number of time steps, i.e., remains low in a lifetime if one assumes acquisition of a new memory every second. Models II and III are two possible ways to correct for this deficiency that we now consider one by one. Simulations of model II show an interesting behavior when the number of presented items is large: after a brief transient, items with valences above a certain threshold remain in memory indefinitely, while items below this threshold get erased eventually (see Fig. 7 that shows the probability for an item to remain in memory as a function of its valence for different total number of presented items). Showing that this is indeed the case mathematically requires some advanced techniques in probability theory (Ref. 23). If one assumes this behavior, the value of the threshold can be calculated in a following way (thanks to Friedgut for help with this derivation); without loss of generality, assume that item valences are uniformly distributed in the interval between 0 and 1. Denote the threshold as θ(0 < θ < 1). Each item below threshold (“IBT”) is eventually erased upon presentation of another item, which itself could be either an item above threshold (IAT) or below threshold. Denote p to be the fraction of IBTs that are erased by the presentation of one of IATs. Since each IAT erases exactly one IBT and all IATs remain in memory, we get
On the other hand, for each IBT with strength x (x < θ), the probability that it is erased by one of IBTs is ; hence, p can also be obtained by averaging this probability over all x between 0 and θ,
From the last two equations, we obtain , which, in turn, implies that the number of items that are retained in memory after time T is . Hence, model II predicts that the number of remaining items in memory grows linearly with time and each memory has a finite probability 1/e to be available after an arbitrary time since acquisition, i.e., the retention curve is flat. We conclude that while the model is very interesting from a mathematical point of view, it does not provide a good account of experimental properties of forgetting. We, therefore, turn to model III, which is another, more successful attempt to fix model I by increasing the speed of accumulation of items in memory. Model III, as opposed to the other two models, is characterized by one free parameter, namely, the dimensionality of the valence measure for memory items, n.
The closed-form analytical solution for the retention curve in model III can be obtained by noting that if valence components are distributed uniformly between 0 and 1, for an item with valence , the probability for a subsequent item to erase it is given by ; hence, the probability for it to survive t consecutive items is . Averaging this expression over results in the following expression for retention curve:
Alternatively, one can use the following recursive equation for this function:9
To derive this equation inductively, consider an item acquired at time 0 followed by t other items. Let k be the rank of the original item among the group of t + 1 ones along the last valence dimension, i.e., k − 1 of the subsequently acquired t items have higher valence along this dimension, while the rest have lower valence and, hence, cannot erase the original item regardless of other dimensions. In order for the original item to survive for t time steps, it has to survive the k − 1 potentially “dangerous” items, thanks to the first n − 1 dimensions, the probability for this being RCn−1(k − 1) (definition of the retention function). Since all values of k from 1 to t + 1 are equally likely and, hence, have a probability of , the total retention probability, averaged over possible values of k, is given by Eq. (23). One can use this equation to calculate the asymptotic behavior of RC for very large t,9,20
which has the same scaling as in the one-dimensional case [Eq. (16)] with a logarithmic correction.
As shown in Fig. 3 above, our experimental data with random lists of common nouns support the five-dimensional version of model III. Interestingly, our later experiments showed that this behavior is not universal: performing the same experiments with different types of items (verbs, short sentences, and sketches) results in retention curves better described by four-dimensional (for verbs) and seven-dimensional (sketches and sentences) variants of model III (Ref. 24).
Many properties of model III that could have experimental relevance beyond the retention curve remain open. Here, we illustrate some of them with numerical simulations. A set of items in model III can be viewed as a particular instance of a partially-ordered set (poset)25 with a product order if we say that two items are ordered if and only if one of them has higher valence in all dimensions (i.e., one item erases the other one if presented at a later time). Since any partial order can be extended to a total one20 one can always arrange the items such that all of them will remain in memory. The distribution of the smallest number of retained items over different realization of valences over the items for lists of 500 items, generated by simulating model III with n = 5, is shown in Fig. 8. This distribution can be compared with the distribution (over the realizations of valences) of the average number of retained items over different orderings of items in the list shown on the same figure.
Another interesting feature of the model is possible dependencies between the retention of items in different positions. In Fig. 9, we showed several examples of simulated matrices of item-to-item correlations after the list presentation, obtained with n = 1, 3, 5, 7. As mentioned above, different values of n correspond to different types of memory items used in the experiments (namely, verbs, nouns, and visual sketches), so these correlations could potentially be measured in future experiments. While most of the correlations are quite weak, one can see interesting patterns of these correlations developing, especially for n = 3.
We presented a number of mathematical models of human memory that are based on a small set of clear and intuitive principles and provide a good account for some of the experimental results involving recall and recognition of lists of randomly assembled words or sentences. The models have simple formulation and almost no free parameters; nevertheless, their mathematics is nontrivial and only partially explored. All of the models are formulated as deterministic discrete processes driven by certain intrinsic characterization of memory items, such as measures of their valence and inter-item similarities. Limiting the study to experiments with random lists of items allowed us to consider these measures as coming from simple statistical ensembles, resulting in the statistical nature of the mathematical results in terms of probability distributions and moments of memory performance in question. This statistical nature of the results should not hide the deterministic nature of the hypothesized memory processes that is crucial for understanding our results. Since the principles that govern the underlying processes assumed in our analysis are quite general and intuitive, they could be relevant in different contexts, such as random games.19 Whether the models considered in this study can be extended to describe memory for more natural type of information, such as meaningful narratives, is an open question for future studies.
We would like to thank Dr. Noga Alon for helpful discussions and Dr. Andrei Kupavski and Dr. Ehud Friedgut for help with mathematical derivations. This research has received funding from the European Union’s Horizon 2020 Framework Programme for Research and Innovation under Specific Grant Agreement No. 785907 (Human Brain Project SGA2), the Israeli Science Foundation (Grant Nos. 1657/19), EU-M-GATE 765549, and Foundation Adelis.
Conflict of Interest
The authors have no conflicts to disclose.
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
APPENDIX: MOMENTS OF LENGTH OF RECALL TRAJECTORY FOR ASYMMETRIC RANDOM MATRIX
We would proceed with direct calculation. First, transform the probability distribution (6),
Using this form, we can express the statistic as follows:
We can immediately see that (6) is a distribution, since [f(k) = 1 and f(M − n) − f(M − n − 1) = 0]. The mean value is
Using results on Ramanujan Question 294,15,21
we can write the exact formula for the mean given in Eq. (7),
Next, we can compute
from which Eq. (9) follows,