We propose a model for the social flow of information in the form of text data, which simulates the posting and sharing of short social media posts. Nodes in a graph representing a social network take turns generating words, leading to a symbolic time series associated with each node. Information propagates over the graph via a quoting mechanism, where nodes randomly copy short segments of text from each other. We characterize information flows from these text via information-theoretic estimators, and we derive analytic relationships between model parameters and the values of these estimators. We explore and validate the model with simulations on small network motifs and larger random graphs. Tractable models such as ours that generate symbolic data while controlling the information flow allow us to test and compare measures of information flow applicable to real social media data. In particular, by choosing different network structures, we can develop test scenarios to determine whether or not measures of information flow can distinguish between true and spurious interactions, and how topological network properties relate to information flow.
Rich datasets on human activity and behavior are now available, thanks to the widespread adoption of online platforms such as social media. The primary artifact generated by users of these platforms is text in the form of written communication. These symbolic data are invaluable for research on information flow between individuals and across large-scale social networks, but working with and modeling natural language data is challenging. While most models of social information flow focus on compartment models, contagion models, or cascades, the richness of the text data available to researchers underscores the importance of incorporating the full information present in text into modeling efforts. In this paper, we propose a model for how groups of individuals embedded in a social network can generate streams of text data based on their own interests and the interests of their neighbors in the network. The goal is to more explicitly capture the dynamics inherent to human discourse. We show how to relate parameters in the model to quantities underlying information-theoretic estimators specifically aimed at understanding information flow between sources of text. By controlling the graph topology and model parameters, we can benchmark how information flow measures applied to text deal with spurious interactions and confounds.
Recently, considerable effort has taken place to better understand information flow in dynamical systems and real datasets.1 On one hand, new measures and algorithms have been developed to better understand information flow interactions and related phenomena, including transfer entropy,2 symbolic transfer entropy,3 convergent cross-mapping,4 and causation entropy.5,6 On the other hand, new large-scale datasets have allowed researchers to better understand at scale the spread of information in a complex system, especially those involving online social networks and social media such as Twitter.7,8 Especially interesting are studies applying information-theoretic tools to large-scale social media data, such as Ver Steeg and Galstyan, who consider the shared information present in the timings of tweets posted by social ties on Twitter,9 and Borge-Holthoefer et al., who use symbolic transfer entropy to investigate predictive signals of collective action such as protests in the time series of the numbers of tweets posted in different geographic regions.10 These recent studies show that tools developed from information theory and dynamical systems theory can successfully be applied to human dynamics data captured from online platforms such as Twitter.
Most research on information flow within online media either considers proxies of information flow, such as tracking the spread of particular keywords, or uses information-theoretic tools focused on the timing of social media posts.9,10 Yet the posts themselves are packed with potentially useful data: the text generated by users of online platforms is their primary artifact and, when available for study, should be the focus of research. Fortunately for the study of information flow, information theory has a rich history of working with symbolic data such as text.
Given the importance of focusing on the text data, there is currently a lack of models for the problem of studying information flow as measured from the text generated by users in a social network. Most work focuses on modeling information flow as a type of contagion, cascade, or diffusion process.7,11–13 These works are invaluable for studying information flow but by compartmentalizing nodes into groups that have or have not adopted an innovation, been “infected,” etc. they generally neglect the full richness of the text generated by users in this setting.
Our goal here is to propose and analyze a simple model of the discourse underlying the text generation process online. Nodes within a given graph (representing individuals within a social network) generate symbolic time series (the time-ordered text) based on what they and their neighbors in the network say, and we relate this to information-theoretic estimators of information flow between the texts of different individuals. Doing so provides insights into how well these estimators can distinguish true versus spurious interactions, detect confounding effects, and help us relate network topological properties to the features of information flow.
The rest of this paper is organized as follows. In Sec. I, we discuss background material on entropy estimators for written text and how they may be used to measure information flow. In Sec. II, we introduce the quoter model and discuss its different components. In Sec. III, we analyze the quoter model between two individuals and compare our analytic predictions with simulations. Section IV extends these simulations to a number of network structures and investigates the interplay between network topology and information flow. We conclude with a discussion of our results and potential future directions in Sec. V.
I. Background
A. Entropy and information flow in text
The information content in a written text can be quantified with its entropy rate , the number of additional bits (or other unit of information) needed on average to determine the next word14 of the text given past words.15 The entropy rate is maximized for a text that is completely random such that preceding words will not give useful information for determining a subsequent word. Conversely, the entropy rate is zero for a deterministic sequence of words such that knowledge of previous words only gives all the information necessary to specify the subsequent word.
There is a rich history of practical entropy estimators for text.16–18 The challenge when working with real text is that there is information in the ordering of words, not just their relative frequencies—shuffling a text preserves the (unigram) Shannon entropy but destroys much of the information in the text. To account for the ordering of words, one needs to evaluate the complete joint (or conditional) distribution of word occurrences, and estimating these probabilities requires enormous amounts of data.
Kontoyianni et al.19 proved that the estimator
converges to the true entropy rate of a text, where is the length of the sequence of words and is the match length of the prefix at position : it is the length of the shortest substring (of words) starting at that has not previously appeared in the text. (For simplicity, we now omit the symbol distinguishing the estimator from the true quantity.) Theorems underlying nonparametric estimators such as Eq. (1) play an important role in the mathematics of data compression. Indeed, some authors have even used compression software to estimate the entropy of text. However, using compression software risks introducing bias, as specific compression code (such as gzip) trades off optimal compression rates in order to run much more efficiently. Due to these trade offs, one should instead work directly with the theoretical estimator [Eq. (1)] to more accurately estimate .
Equation (1) generalizes naturally to a cross-entropy between two sequences and .20,21 To do so, define the cross-parsed match length as the length of the shortest substring starting at position of sequence not previously seen in sequence . If sequences and are time-aligned, as in a written conversation unfolding over time, then “previously” refers to all the words of written prior to the time when the th word of was written. The estimator for the cross-entropy rate is then
where and are the lengths of and , respectively. The log term in Eq. (2) has changed to because now is the “database” we are searching over to compute the match lengths and the factor is due to the average of the ’s taking place over . The cross-entropy tells us how many bits on average we need to encode the next word of given the information previously seen in . Further, . Despite a similarity in notation, the cross-entropy is distinct from the conditional entropy (which requires estimating a joint probability distribution of and , something that is not easy to estimate from social media text data, for example). The cross-entropy can be applied directly to text of a pair of individuals by choosing to be the text stream of one individual and the text stream of the other.
While our focus in this work is on the cross-entropy between pairs of individuals, can be generalized further to , quantifying the predictive information regarding the text in string contained within a set of strings .21 This lets us understand the information flow from multiple social ties to a single individual. It also allows us to construct transfer entropy-like measures: measures how much if any extra information is present on average in the past text of about the future future text of , beyond the information already present in the past text of . Doing so is important when inferring information flow from data, as it is important to determine whether or not the information in is redundant if one already has the information in .2,5,6
B. Social information flow
In a previous work, we showed how to use the cross-entropy [Eq. (2)] as a measure of information flow between individuals posting to the Twitter.com social media platform.21 We concatenated the texts of all public tweets for a given Twitter user into a long stream of text and then applied the aforementioned entropy and cross-entropy measures to users, pairs of users, and ego-centric networks consisting of users and their most frequent contacts. Measuring information flow with the cross-entropy naturally incorporates the temporal ordering of the tweet text and uses all the available information in the texts of the individuals, whereas other measurement methods limit themselves to proxies of information flow, such as tracking the spread of keywords like hashtags or URLs.
The focus of that work was on measuring information flow from text data. When developing and applying estimators in such scenarios, it is useful to have plausible models with which to build examples and test cases. However, most work modeling information flow has focused on the study of information as “packets” spreading between individuals, typically represented in Twitter’s case by the hashtags or URLs. This allows researchers to apply contagion models, such as Susceptible-Infected or other compartmental models, complex contagion models, and more.11,22–24 Contagion models are very well studied on network topologies, but in this case they neglect the dynamical processes governing written communication. The back-and-forth nature of discussions, for example, may generate far more information flow within the text than would be measurable from the spread of keywords alone.
II. The quoter model
We propose the “quoter model” as a simplified way to capture the dynamics governing the written conversations taking place between individuals in a social network. The model consists of individuals embedded as the nodes of a social network where and there are edges connecting those nodes. For generality we take the graph to be directed such that an edge represents communication from node to node via the quoting process described below.
Each member of the graph generates written text over time, represented as a symbolic time series or “word stream.” At timestep , individual generates a number of new words according to one of the two mechanisms, growing his or her word stream. The number of new words at timestep is , where this number is drawn from an integer-valued length distribution . This probability distribution may be time-independent or evolve as a function of time, and this distribution may vary across users () or not (). After choosing the number of words to generate, the actual words are generated according to one of the two mechanisms:
draws with replacement from a vocabulary distribution (with probability ).
A contiguous sequence of words are copied from a random position within the previously written text of a neighbor of node (with probability ).
This process is then repeated for all individuals in the network until their text streams have reached a desired length or a desired number of timesteps have elapsed. The first mechanism is intended to represent the creation of new content while the second mechanism is the quoter action of the model. The quote probabilities tune the relative strengths of the two mechanisms by how often quotes from the past text of . We illustrate one step of the model for a pair of individuals in Fig. 1.
The idea underlying the second mechanism is that when two individuals are discussing a topic verbally or in writing, and they are listening to one another, then there will be a back and forth of small sequences of common words. The quotes generated by the second mechanism are not meant to capture full-length, long form quotations such as retweets, but instead short shared sequences of text. Alice: “That’s the right way to go”; Bob: “No, this is the right way.” In this example, the exchange between Alice and Bob leads to a short quotation of Alice by Bob (“the right way”) and from this exchange only we can at least surmise that Bob is probably receiving and reacting to Alice’s text. Of course, Bob could have responded in an equivalent way without that short quote. However, over the course of very long conversations we expect more such quotations to occur on average, and they will likely occur more often in conversations when there is more information flow than in conversations where there is little information flow.
A. Model components
The main components of the quoter model are (i) the graph topology, which may be as simple as a single directed link between two individuals, (ii) the quote probabilities , (iii) the length distributions , and (iv) the vocabulary distributions . We study several graph topologies in this work. The quote probabilities can be considered as edge weights on the social network, and there is considerable flexibility in assigning those weights.
The length distributions govern the amount of text generated per timestep and the total length of the text: the expected length after timesteps will be . We primarily focus on two cases here, the constant length distribution , where is the Kronecker delta; and a Poisson distribution with mean .
The vocabulary distribution gives the relative frequencies of words for individual . In this work we consider two example ’s. The first is a uniform distribution over a fixed number of unique words: . The binary case corresponds to . The second vocabulary distribution is a basic Zipf’s law that incorporates the skewed distributions typically observed in real text corpora.25 Here, the probability of a word depends on its rank , with the most probable word having rank . Zipf’s law then defines word probabilities that obey a power law form with : , where is a power law exponent. This distribution is normalized by , the generalized harmonic number. This distribution also holds for infinite vocabularies () so long as , in which case the normalization constant converges to the Reimann zeta .
III. Model analysis
Here, we study the basic quoter model between two individuals (referred to as the “ego” and the “alter”) where the ego copies the alter but the alter does not copy the ego. We focus on the case of uniform vocabulary distribution , and we assume both individuals draw from the same , although our analysis is not specific to these assumptions.
To quantify the flow information from the alter to the ego via the cross-entropy , we need to compute the mean , where is the length of the shortest substring of words beginning at position in the ego’s text which has not been observed in the text of the alter prior to “time” (Sec. I), and is the total length of the text. To model , we assume that (i) two terms contribute to : the mean when a quote occurs (call it ) and the mean when no quote occurs (call it ) and (ii) the quote probability weights these two possibilities:
where we have suppressed the dependence on position in and . We need to determine both and as functions of the vocabulary distribution and the current amounts of text generated.
A. Prefix matches when not quoting
It is possible as the ego is drawing words from the vocabulary distribution that due to chance a string of words will be generated that previously appeared in the past text of the alter. This will depend on the vocabulary distribution and the length of the alter’s past text.
Suppose the alter has posted a total of words so far and the ego has just posted new words. The probability that one of the new words posted by the ego matches a random word previously posted by the alter is . This is the probability that two draws from the vocabulary distribution give the same word, irrespective of the particular word, and is the Simpson index (also known as the Herfindahl–Hirschman index) of the vocabulary distribution.26,27 The probability of at least new ego words matching with prior alter words at a particular location in the alter’s past text is . Since there are approximately locations in the alter’s text at which a match may occur (assuming ), the expected number of matches of length or more is . Then, the expected length of the longest match occurs at the value of for which and . Solving for gives an expected longest match length of , or
since is always one more than the match length.
B. Prefix matches when quoting
If a quote of length occurs at position , then only if any words of the ego subsequent to the quoted words do not happen to match the words of the alter subsequent to the original quoted passage. In other words, even if deterministically a match of length occurs, may be longer due to chance. Specifically, the probability that , , is , as a value of requires that the next words will match and the -th word will not match. Note that, unlike the previous calculations, this probability does not involve the total text length of the alter because these post-quote matches cannot occur anywhere in the alter’s text except in the positions following the quoted passage (neglecting duplicate passages). From this probability, the expected is
meaning that, on average, random chance increases by an amount .
However, it is not necessarily reasonable to neglect duplicate passages. Indeed, the number of duplicate passages may be significant for certain combinations of parameters: the probability that a different location of the alter’s past is the start of a passage of length equal to the randomly chosen quoted passage is , and the expected number of such duplicate passages within the alter’s text (including the original passage) is . For , , and , for example, the expected number of duplicates is 17.
The probability for at least words of the ego’s text subsequent to the newly quoted passage to also match words following the original passage in the alter is , so the expected number of times matches of length or longer will occur following any of the duplicate passages in the alter is . The longest match length occurs at the value of for which the number of these matches is , or . Lastly, the expected total match length when quoting is then .
However, unlike with , adding 1 to this expected total match length is not an accurate expression for the average . When is much larger than , then the match length at that text position will almost certainly be due only to the single quoted passage. This means that the subsequent will likely be 1 fewer than , because a random match that would extend is unlikely. Likewise, , and so forth, until the match lengths are short enough that random matching is again probable. Accounting for this, we expect the average to be roughly equal to
where . Equivalently, this is the average of the two endpoints, and , and therefore:
We illustrate the relationship between , , and in Fig. 2, showing a single simulation of the model and highlighting a spike in above and how it decays back down to .
With these expressions for and , we can now compute and from it the cross-entropy.
C. Cross-entropy
To compute the cross-entropy between the ego and alter requires computing the total summed over all positions in the ego’s text where matches can occur then dividing by that : , where . Using the previously derived expected contributions to for the two mechanisms and approximating the sum over the text positions with an integral give the following expression for :
which can be substituted into to compute the cross-entropy as a function of , , , and .
The limit of large text using Eq. (8) gives
which is the Rényi entropy of the vocabulary distribution:
with . Note also that has dropped out of this limit, implying that, given sufficient text, the entropy of the model will be that of the underlying vocabulary distribution only. However, as we shall see, for finite , even quite large, still plays an important role in the overall cross-entropy.
D. Comparison with simulations
To test our theoretical predictions, we simulate the quoter model and compared our predicted cross-entropy [substituting Eq. (8) into and converting to bits] with that computed directly from the simulations [Eq. (2) on the simulated text sequences]. We simulate the one-link, two-node model for and timesteps, giving expected text lengths of and , respectively. Here, we choose for both nodes , , , and (denoting the ego as and the alter as ). Overall, we find reasonable qualitative agreement between our predictions and the simulations, as shown in Fig. 3. However, there are some systematic discrepancies. While the absolute difference in entropies between predictions and simulations is small, often less than 0.1-0.2 bits, this means that the treatment above does not capture everything present in the model.
Beyond Fig. 3, which explores the cross-entropy as a function of for different and parameters, it is also useful to inspect the two limiting cases of no quotes () and all quotes (). Figure 4 explores how the cross-entropy depends on when . Since there are no quotes, we expect no dependence on and we indeed see strong collapse across the simulations and the theory (there is a slight difference between the curves only because the total length of the generated text depends on ). Further, there is good agreement with predictions (solid lines) except at values of very low (equivalently, high ). Agreement improves considerably at higher although predicted values are still below those of the simulations. In this case, depends entirely on , and the expression for [Eq. (4)] primarily gives only the scaling of with accuracy.
The all-quote case is explored in Fig. 5. In this case, we expect a strong dependence on and indeed we see a change of more than two bits of cross-entropy at the lower diversity values when moving from to . We also see good agreement between predictions and simulations except at low , although in this case agreement improves considerably at low for the longer text length.
Overall, we find that our treatment of the model captures the basic qualitative links between , , , and the total text length. Agreement is not perfect, indicating that more behavior is going on than is being modeled, particularly at low , or entropy estimators based on are biased for finite text, or some combination thereof. A more rigorous treatment of the model may be able to distinguish between these two possibilities and can extend the analysis to more complex arrangements than a single link between a pair of individuals.
IV. The quoter model on networks
Moving beyond our treatment of a single pair of individuals (Sec. II), here we numerically investigate the quoter model on four simple network topologies (see Fig. 6): A chain of nodes where each node copies from the previous node (i), a fork where one node influences two nodes (ii), a collider where a node is influenced by two nodes simultaneously (iii), and larger Erdős-Rényi and Barabási-Albert random graphs (iv) (not shown in Fig. 6). These topologies allow us to better understand, in a simplified context, the interplay between network topology and the dynamics of information flow as measured via the cross-entropy. The chain allows us to understand the attenuation of information flow with distance, the fork and the collider provide simple motifs to investigate confounds and spurious links, and the larger graph models can shed light on how global network properties such as density can affect information flow.
A. (i) Chain of quoters
We investigate the attenuation of information by simulating the quoter model over a unidirectional chain of nodes , where each node has probability of quoting the node directly before them in the chain, except for the first node in the chain which only draws from :
At each timestep, each node in the chain writes or quotes words, each of which is then drawn from a 1000-word truncated Zipf distribution with exponent . (Results were found to be very similar when using a uniform distribution with the same number of words.) We simulate the model on nodes for 10,000 timesteps, so .
Figure 7 shows the cross-entropy of node from the first node 0 in the chain, which generates original text. For reasonable values of the quote probability information attenuates quickly, with having saturated by approximately the third link in the chain. Only at very high quoting probabilities () do we observe greater information flow (lower cross-entropy) for nodes further along the chain.
B. (ii) Fork and (iii) Collider
To investigate how cross-entropy distinguishes between information flow from different sources, we simulate the quoter model on the three-node “fork” and “collider” graph shown in Fig. 6. First, for the fork graph [Fig. 6(ii)], using the same parameters as above [, ], we vary the probabilities and with which nodes B and C, respectively, copy the source node A, which generates original content (drawing words from only). The top and bottom panels of Fig. 8 show the cross-entropy of C from A and of C from B, respectively, averaged over 1000 realizations of the model. As expected, shows no dependence on and decreases approximately linearly as the quote probability grows (Fig. 8; top).
The dependence of C upon B in the fork is more complex, however, with the cross-entropy of the non-existent link between B and C decreasing with both increasing and (Fig. 8; bottom). However, there exists a clear separation in the values of cross-entropy between the two cases, with being significantly larger than for most quote probabilities except the region where both and are close to 1. Cross-entropy therefore effectively identifies the direction of real information flow for this model graph.
Due to the fork’s symmetry, the results for and are identical to those shown in Fig. 8. Likewise, the analogous and for the collider network topology [Fig. 6(iii)] appear similar to the top panel of Fig. 8: with no dependence between A and B in the collider, decreases linearly with and shows no dependence on (not shown).
C. (iv) Random networks
Finally, we investigate the quoter model on larger networks, modeled as random graphs. We simulate the quoter model on Erdős-Rényi (ER)28,29 and Barabási-Albert (BA)30 random graphs. ER graphs are simple models that capture only the overall density of a network but are a useful starting point. BA graphs capture the “scale-free” property observed in real-life social networks. Using graphs of , we create directed, weighted networks of varying average node degree.31 To create directed ER networks, we chose pairs of nodes and and created an edge from to with probability . For the BA networks, we used the standard preferential attachment method with edges pointing in both directions. This construction means that quoting is always bidirectional in the BA networks, but not necessarily in the ER networks. Other options are possible for the BA network, e.g., creating directed links from newer nodes to older nodes through the preferential attachment process; however, this would have rendered these networks a directed tree rather than graph, as was desired here.
Quote probabilities are chosen from , with representing the probability of a node generating new content (after normalizing such that = 1, where is the adjacency matrix of the graph). The quoter model is then run for timesteps over the network, updating a randomly chosen node at each timestep, and using the same vocabulary () and quote-length () distributions as above. At the end of the simulation each node has generated text of length words. We simulate 100 realizations of the network and quoter model dynamics on both the ER and BA networks.
Information flow on these graphs as a function of the graph’s average node degree is shown in Fig. 9. As average degree increases in the network, the average cross-entropy of a node from its neighbors also increases, meaning that becomes less predictable from its neighbors with increasing density. The BA graphs show slightly lower median cross-entropy, however, with larger variation across realizations. The presence of high-degree hubs in BA graphs means that cross-entropy can exhibit a larger range of variation, with the self probability at hub nodes to generate new content driving much of the information flow on the network. The increasing trend of cross-entropy with average node degree indicates that information “sources” and “sinks” become increasingly difficult to identify in a network, as the density of connections increases.
V. Discussion
In this paper, we introduced the quoter model as a simple, paradigmatic model of the flow of information. Considerable effort has been put into measuring information flow in online social media, both from proxies such as tracking keywords and from information-theoretic tools. Models of the dynamics underlying these processes are invaluable for better understanding information flow, and the goal of our work is to introduce a model that more directly relates to information flow in text data than traditional contagion-style models, but without being overly complicated. Our model mimics at a basic level the overall dynamics of text streams posted online, and here we showed that one can derive expressions for the information flow between written texts as measured via the cross-entropy.
The analysis we performed here showed good qualitative agreement with simulations in general, but there remains room for improvement. Nevertheless, the ability to find tractable expressions for information-theoretic quantities highlights how the basic quoter model can provide better insights into information flow over social networks. Indeed, we proposed this model because empirical benchmarks for information flow over social networks are difficult to find. However, as many dynamic processes can be represented by symbolic time series, models like the quoter model may even be useful when studying information flow in more general contexts.
The language generator we studied here is a relatively simplistic bag-of-words model: individuals simply draw words from a given vocabulary distribution . More realistic models should be explored. One possibility would be a time-dependent . For example, one could endow with a latent context : and allow the context to vary (slowly) over a space of contexts. A Markov chain over this context space would be one way to introduce dynamic context shifts. Such a context dependence can then be used to model topical shifts over the length of a discourse. If two users exhibit the same context shifts, their vocabulary distributions will tend to “sync up” “with each other”, and this should lead to a lower cross-entropy than if contexts were not shared.
This dynamic context shift in quoted discourse suggests a natural time-based generalization to the model as well. With quoting behavior likely to occur within a short “attention span” of the time of the original message, it makes sense to incorporate a probability of quoting into the model which decays over time. While the form of this probability likely introduces an extra parameter, it is plausible that this parameter could be estimated from real data. Future work will explore the possibility of fitting the quoter model to real datasets.
Lastly, there is much room for future exploration of network topology and its relationship to information flow. As the quoter model allows us to design “planted” interactions, we can implement the quoter dynamics on constructed networks and then test whether algorithms can successfully infer true interactions and reject spurious interactions. We did this here with the fork and collider graphs. Moving beyond those small motifs, one area of network structure worth exploring in future work is that of network topologies exhibiting clustering, to investigate the effect of community structure32 on information flow.
ACKNOWLEDGMENTS
We gratefully acknowledge the resources provided by the Vermont Advanced Computing Core at the University of Vermont and the Phoenix HPC service at the University of Adelaide. This material is based upon work supported by the National Science Foundation (Grant No. IIS-1447634). L.M. acknowledges support from the Data To Decisions Cooperative Research Centre (D2D CRC) and the ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS).