Evaluating any model underlying the control of speech requires segmenting the continuous flow of speech effectors into sequences of movements. A virtually universal practice in this segmentation is to use a velocity-based threshold which identifies a movement onset or offset as the time at which the velocity of the relevant effector breaches some threshold percentage of the maximal velocity. Depending on the threshold choice, more or less of the movement's trajectory is left in for model regression. This paper makes explicit how the choice of this threshold modulates the regression performance of a dynamical model hypothesized to govern speech movements.
Modern linguistic theory and all models of speech agree that linguistic messages such as syllables or whole sentences are constructed out of combinations or sequences of discrete units (Chomsky, 2000; von Humboldt, 1836) or, in spoken languages, phonemes actuated as vocal tract gestures. Even though a sequence of units suggests that all properties of these units are located at the same point in time, the sequences generated in biological movements, be it of the vocal tract, the face, or the limbs in other activities, have an important characteristic. The units composing these sequences are coarticulated (Öhman, 1966) or interwoven in a continuous flow of smooth transitions from one unit to the next. This continuity burdens experimenters and modelers concerned with recovering the dynamics of individual units with the task of segmenting the continuous flow of effectors into sequences of movements (corresponding to the hypothesized units).
Whether speaking, pointing, running, or dancing, the continuous flow of effectors is typically captured by a number of sensors attached to the human body whose position is tracked over time by an external device. In the resulting data, the movement of a single effector is said to initiate, when the velocity (i.e., the rate of change of position) of the sensor attached to that effector exceeds some minimal value, and likewise the movement of the effector is said to end when the sensor's velocity falls below some (potentially different) minimal value (Hogan and Sternad, 2007). Accordingly, it has become standard to consider velocity as the primary source of information in carrying out the segmentation, with the virtually universal practice of using a velocity-based threshold parameter for identifying the onset or offset of a movement as the time at which the instantaneous velocity of a relevant effector breaches some threshold percentage of its maximal velocity. Notwithstanding the universality of this practice, a review of the literature indicates widespread quantitative differences regarding the value of the threshold parameter, with the consequences of these differences for model evaluation remaining, to our knowledge, so far unexplored.
In speech, an often used value for the segmentation threshold is 20% (e.g., Bombien , 2013; Hoole , 1994; Parrell and Narayanan, 2014; Shaw and Chen, 2019; Shaw , 2011), with deviations from this value not uncommon. To give some examples: Pouplier (2020) uses 20 ± 5% (as appropriate) for tongue movements in German and Georgian consonant-consonant-vowel sequences, Chitoran (2008) makes use of 15% or 20% for tongue and lip movements in Georgian stop-stop sequences, and Lee (2015) reduces the threshold to 10% for tongue movements in Korean liquids. Yet others (e.g., Kuberski and Gafos, 2019; Munhall , 1985; Mücke , 2012) use a threshold of 0% (no threshold), conceivably, to not cut away relevant information contained in the regions near the movement endpoints. For cogent discussion of potential problems with the threshold method and a proposal for using different sources of information than velocity for the process of segmentation, see Liu (2022).
A similar variety in the choice of thresholds is evident in the fields of general motor control. For example, Hug (2011) calls a range of 15%–25% (or 1, 2, or 3 standard deviations) typical for onset detection in kinematic muscle activity data (EMG) and Baum and Li (2003) makes use of 10% (alternatively, 20% if more appropriate after visual inspection) in EMG data of the lower extremities in cycling. Coderre (2010) mentions 5% to be typical for arm movements in reaching tasks and Buchanan (2003) likewise uses 5% for data of hand tapping and elbow flexion-extension motions. In a study explicitly focusing on transitions between individual point-to-point arm movements, Sternad (2013) considered threshold values of 1%, 3%, and 5%.
Despite the variety in segmentation practices, with threshold values ranging from 0% to 25% within and across fields, what is undoubtedly common in all these studies is the ultimate aim of understanding the underlying model controlling the so-segmented movements.1 It is to this aim that our presented study is devoted. More specifically, our aim is to make explicit how the performance of the dynamical model in Task Dynamics (Saltzman, 1986; Saltzman and Kelso, 1987; Saltzman and Munhall, 1989) depends on the threshold value used in segmentation. Among other dynamical model candidates (see, e.g., Parrell , 2019, for an overview), we chose here the model proposed in Task Dynamics, which is that of the damped linear oscillator, , because of its explicitness on how to relate movement data to model parameters and because of its time honored presence in the field of speech. As the first explicit dynamical formulation seeking to characterize some of the most important general aspects of the control and coordination of movements (e.g., task-specific invariance of effector trajectories, equifinality of movements, self-adjustment under perturbation; Fowler , 1980; Kelso , 1983; Saltzman and Kelso, 1987), the linear model is still considered a plausible candidate in increasing our understanding of speech. Recently, for example, on the basis of high-density intracranial electrocorticography, (ECoG) signals (Chartier , 2018, p. 1048) estimated speech effector positions from the recorded neural activity in the human sensorimotor cortex and, by considering characteristics of the estimated data in the position-velocity plane (also known as phase plane, see Sec. 4) as well as the linearity in the peak velocity-displacement relation concluded that the so-estimated movements exhibited damped oscillatory dynamics in accordance with the linear model in Task Dynamics.
In the following three sections, we first describe our approach of registering articulatory movement data (Sec. 2) to be segmented by six different threshold values (0%, 5%, 10%, 15%, 20%, and 25%) drawn from the entire range of values found in the literature. We then turn to a statistical assessment (Sec. 3) of the linear model's fit to the so-segmented data. Finally, we explain (Sec. 4) why thresholding modulates model performance in the ways revealed in our results, by considering dynamical tools of analysis, and outline implications for future work.
Five native speakers of German (three female, two male) and five native speakers of English (three female, two male) were recruited at the authors' institution to participate in an experiment of repeated syllable production. The speakers produced repeated /ka/ and /ta/ syllables at a rate indicated by the beats of an audible metronome. In blocks of four trials for each rate, the speakers were exposed to the beat of a metronome (with a duration allowing for the production of 30 consecutive syllables) and begun articulation at a point of their choice. The rate of the metronome was set to 90, 150, 210, 300, 390, and 480 beats per minute (bpm), covering slow, normal, and fast speech rates. Data registration was conducted using Electromagnetic Articulography (EMA; Carstens AG501, Carstens Medizinelektronik GmbH, Bovenden, Germany). Relevant for the purposes of this work were the movements (after head movement correction) of a sensor placed on the tongue body and a sensor placed on the tip of the tongue, the two primary effectors involved in the formation and release of constrictions in /ka/ and /ta/ syllables. The registered raw data were processed using a heptic-order spline smoothing approach with a fixed predicted mean square error, obtained by a fortran-to-matlab software port of Woltring's classical spline smoothing code (Kuberski, 2023; Woltring, 1986).
The resulting tongue body (for /ka/) and tongue tip (for /ta/) signals were then segmented into individual movements by means of local tangential velocity extrema identified using the full three dimensions. As per its definition, tangential velocity was computed by the square root of the sum of squares of the horizontal, vertical, and lateral velocity components in three dimensions. Timestamps at which the tangential velocity of a relevant effector was minimal served as (pre-threshold) locations for a movement's endpoints. The maximal value of tangential velocity between any two consecutive minima was next used to apply the segmentation threshold and adjust the threshold-specific onset and offset locations accordingly. In particular, the difference between the maximal and minimal value multiplied by the threshold value was used as the offset in velocity an effector had to breach (exceed or fall below) to be regarded as the onset or offset of movement. Overall, for each considered threshold value, the segmentation process yielded speech movement data of about 5000 /ka/ and 5000 /ta/ syllables, across all speakers and metronome rates.
In the context of our ultimate aim, which is dynamical model evaluation, it is important to note that the so-obtained movement data represented motion in three-dimensional space. Yet the formulation of the linear model takes place in a lower-dimensional so-called task space, consisting of only one dimension along the major line of action, referred to as constriction degree (Saltzman, 1986; Saltzman and Munhall, 1989) or reaching axis (Saltzman and Kelso, 1987).2 Thus, to evaluate the performance of the model, we projected the segmented three-dimensional effector trajectories onto their individual constriction degree axes as indicated by the large gray arrows in Fig. 1. In accordance with the framework of Task Dynamics in which the linear model is formulated, we determined the constriction degree axes (i.e., the major lines of action) by the end point-to-end point vectors of the movements (ibidem).
We now turn to evaluate the linear model, , where , with an eye to assessing how different segmentation thresholds affect the regression performance of that model. To do so, we first obtained time series for each movement in position xi, velocity , and acceleration , with , using N = 250 Chebyshev nodes in time. Chebyshev sampling (e.g., Mason and Handscomb, 2002; Rivlin, 2020) is typically used in polynomial regressions to minimize the effect of Runge's phenomenon because of its characteristic, nonuniform spacing of samples with an increased resolution at the edges of the sampling interval. In addition, the use of Chebyshev nodes yields time series approximately evenly spaced along the phase space trajectories of movements to be regressed.
In applying a conventional statistical measure for the performance of a regression, we computed R-squared values of the regression results via , with the residual sum of squares and the total sum of squares , where was the average of the response variable (in our case, acceleration). Grand mean performances of the ten speakers were computed by first averaging the different experimental conditions for each speaker, then averaging across the entire group of speakers.3
Let us first evaluate the overall regression performance of the linear model by inspection of the so-obtained grand mean performance values given in Table 1. The table lists average R-squared values for both syllables /ka/ and /ta/, separated by segmentation thresholds (columns) and experimental conditions (metronome rates, rows). It turns out that, across all experimental conditions, regressions of the linear model performed fairly well, with R-squared values in the range of 0.682 to 0.841 for /ka/ and between 0.713 and 0.856 for /ta/ syllables (see the final row in Table 1). Interpretation of these performance ranges as proportions of the explained variance yields ranges of about 68%–84% for /ka/ and 71%–86% for /ta/ syllables, both indicating that a considerable amount of variance in the empirical data is explained by the linear model.
|/ka/ .||0% .||5% .||10% .||15% .||20% .||25% .|
|/ka/ .||0% .||5% .||10% .||15% .||20% .||25% .|
We focus next on the individual segmentation thresholds expressed by the different columns of Table 1. Inspection of the R-squared values in the table makes clear that, for any fixed experimental condition (metronome rate, rows), increases in the threshold value (i.e., moving rightwards in the table) come along with increases in regression performance. In other words, there is a systematic gain in the regression performance of the linear model as a function of increasing the segmentation threshold. In a way that would speak to researchers (ultimately) interested in assessing their dynamical model assumptions given speech movement data, we thus proceeded to evaluate this gain by comparing R-squared values of the different thresholds in relation to the pairwise increments in their values. Specifically, we aimed at providing an answer to the question: to what extent and significance does increasing the segmentation threshold by 5% (e.g., using 25% instead of 20%, or 20% instead of 15%, etc.), or any other such percentage, affect the performance of the hypothesized underlying model? Figure 2 visualizes the answers to this question by showing the two characteristics of concern as a function of all possible threshold increments in our study (0%, 5%, 10%, 15%, 20%, and 25%, color-coded) as well as all experimental conditions (metronome rate, abscissae). In particular, these two characteristics are, first, the amount of performance gain in percentage point of the explained variance (right ordinates) and second, the statistical significance of the gain in standard z-score (left ordinates).4 To give an example of the observed effect of thresholding on model performance, let us chose a critical z-score of 1.96 for an alpha-level of 0.05 (i.e., 95% confidence) indicated by the dotted lines in Fig. 2. For this conventional choice, increments in the segmentation threshold larger than about 15% (e.g., choosing a threshold of 15% over one of 0%, but also 20% over 0%, or 25% over 0%) generally lead to a significant difference in the regression performance. This can be inferred from the figure by inspecting which of the differently color-coded threshold contours tend to reside above the critical z-score, which are the three darkest contour lines associated with the three highest threshold increments of 15%, 20%, and 25%. Likewise, the performance gains of these three highest threshold increments reach substantial values: expressed in proportions of the explained variance, they range from about 3% to 24% point. That is, a regression performance of (for example) 50% explained variance, determined by a 0%-threshold segmentation, can be significantly boosted up to about 75% explained variance just by increasing the threshold to 25%. This observation holds true for both syllables (/ka/ and /ta/) and all experimental conditions (metronome rates). Thus, overall, there is no question that changes from a lower to a higher threshold come with an increase in regression performance. Moreover, the larger the threshold increment is, the larger is the gain in model performance as well as its statistical significance.
Any evaluation of a dynamical model underlying the control and organization of movements requires segmenting a continuous flow of some effector into a sequence of movements. Since the beginnings of the availability of techniques like x-ray, magnetometer, etc., for obtaining quantitative records of articulatory motion, a virtually universal practice in speech is to use a velocity-based threshold which enables the researcher to declare the occurrence of a movement onset or offset as the first time at which the velocity of the relevant effector breaches some threshold percentage of the maximal velocity. Depending on the value of the threshold, more or less of the movement's trajectory is left in for fitting to the model, which is assumed to be underlying. Here, using one dynamical model, the linear model, hypothesized to govern speech, we have made explicit how the choice of this threshold modulates the resulting regression performance.
The systematicity in the patterning of our results (significant gains in the regression performance when using higher and higher segmentation thresholds) demands an explanation. An understanding of this systematicity can be obtained by considering what information in the movement trajectories is eliminated as one progressively increases the segmentation threshold. Figure 3 shows an example of unsegmented data of the tongue body effector during production of three consecutive /ka/ syllables at normal speech rate. The left panel of the figure shows the so-called phase plane of the data, that is, the trajectory of the effector's velocity as a function of its position x. In this representation, positive peak positions of the trajectory correspond to the consonant closure /k/ and negative peaks to the maximum opening for the vowel /a/. The arrows in the panel denote the direction of motion (time) in the plane. Thus, one cycle in the phase plane corresponds to one syllable in the repetition of many, with the closing movements (from /a/ to /k/) in the top and the opening movements (from /k/ to /a/) in the bottom half plane. In the same panel, we superimposed indications of how the trajectory data are affected by the different thresholds: regions shaded in gray cover those parts of the trajectory which are cut off during the segmentations process, with darker shades corresponding to higher thresholds (bigger cuts) and fainter shades to lower thresholds (smaller cuts). This illustration seems to suggest that the different segmentation thresholds can have only a marginal effect on the performance of the linear model: no matter the amount of data excised by a particular threshold, the overall shape of the movements remains intact, which appears to predict comparable regression performances across different thresholds.
However, this picture changes entirely, if one considers another diagnostic regularly entertained in the fields of dynamics. The right panel of Fig. 3 shows data of the same three /ka/ syllables in the so-called Hooke plane, a representation which renders the effector's acceleration as a function of its position x. In this panel, as in the phase plane panel before, positive peak positions of the trajectory correspond to the consonant closure /k/ and negative peaks to the maximum opening for the vowel /a/, with arrows indicating the direction of time. Representing the data in the Hooke plane makes apparent a clear non-linear relation between the values of acceleration and position: the shape of the Hooke trajectory is similar to that of the letter N, thus, strongly indicating the presence of a cubic term (in x) in the underlying dynamics. [Note that the linear model, , by definition, renders acceleration as a linear function in position x.] Comparable N-shaped Hooke trajectories have been repeatedly observed in the past, in speech and non-speech movements, and brought into focus in, for example, Kelso (1985), Kelso (1986), Mottet and Bootsma (1999), Buchanan (2003), and Sorensen and Gafos (2016), all of which eventually considered additional, non-linear terms in the dynamical model governing these movements. Crucially here, the indicators of non-linearity (the two bends in the Hooke trajectory, one close to the /k/ and another one close to the /a/ target), brought out in the Hooke plane but not in the phase plane, are located precisely in those (gray-shaded) regions eliminated by segmentation. This, in turn, explains the significant gain in the regression performance of the linear model when using higher and higher segmentation thresholds: the higher the threshold (i.e., the darker the shade in the figure), the greater the non-linear portion of the Hooke trajectory excluded by segmentation, yielding an almost exact linear relation between acceleration and position at the highest threshold of 25%. It is precisely at this value of threshold where the linear model reaches maximal performance (cf. Table 1, final column).
Overall, our results and their attendant explanation disclose the extent to which thresholding has an effect on the regression performance of the linear model hypothesized to govern speech in Task Dynamics. The present approach can be applied in principle to other experimental speech paradigms as well as to other models that are sufficiently explicit to allow for evaluation using movement data. In the analysis of our repetitive speech movement data, we applied a conventional to non-repetitive speech segmentation procedure; thus, it is likely that the results presented here would generalize to data from non-repetitive tasks. Regarding extending our approach beyond the linear model, other promising models for the control and organization of human movements (e.g., Birkholz , 2011; Sorensen and Gafos, 2016, in speech; Beek , 1995; Huys , 2008; Jirsa and Kelso, 2005; Kay , 1987; Mottet and Bootsma, 1999; Schöner, 1990, in non-speech motor control) should be considered. Common to these alternative models is that they extend the linear model equation in Task Dynamics to more advanced equations by inclusion of additional, non-linear terms. For example, in speech, Sorensen and Gafos (2016) propose to include the cubic term to the model equation, improving conformity between model predictions and kinematic properties of observed speech movements and resulting in a more accurate modeling of trajectory shapes in the Hooke plane. In general motor control, Schöner (1990) includes several non-linear (both in x and ) terms which likewise improve model predictions but also allow one to generate movements governed by qualitatively distinct dynamics (not just fixed point but also limit cycle dynamics). It remains to be seen how such revised models that admit non-linearities perform under different segmentation thresholds.
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Project ID 317633480, SFB 1287, Project C04. Data collection has been partially funded by the European Research Council (ERC) under Grant agreement ID 249440.
Though the aforementioned studies are not necessarily concerned with evaluating a specific dynamical model, any study of gestures and their spatio-temporal organization employs movement data assumed to be the surface consequences of an underlying dynamical model giving rise to that data. This means that quantified landmarks (e.g., gestural onset or target) or intervals (e.g., time to peak velocity, closing phase duration), as widely employed in all these studies, depend crucially on the nature of the model assumed to govern the observed movements. Hence, even though not explicitly stated or assessed in these works, the model underlying the quantified aspects of the movements in these studies is essential. It is in this sense that we consider an evaluation of the relation between segmentation practice and underlying model performance to be of key importance and ultimately of relevance to these studies.
The linear model is hypothesized to control change also in another task space dimension (orthogonal to that of the constriction degree as depicted in Fig. 1) and referred to as constriction location. This paper assesses the effect of thresholding only for the constriction degree task space variable.
Averaging correlations is not as straightforward as averaging other statistical quantities. This is because of the skewed sample distribution of R's, generally leading to an underestimation of the ground truth population mean (the higher the R-values, the more so). Taking into account the skewed statistics of correlations, statistical advice recommends using Fisher's z-transformation when averaging correlations (e.g., Corey , 1998; Silver and Dunlap, 1987). The z-transformation is given by , with a back-transformation of (Fisher, 1921). Following this practice, we transformed R-values of the individual regressions into z-coordinates before computing their average and back-transformed the result to obtain an unbiased estimate of the mean.
In comparing correlation performance, Fisher (1992) recommends a significance test for the difference of two correlations, R and S, of the same underlying sample size N, by means of the z-score (atanh R − atanh S)/ . This z-score is normal with unit variance and can be read as the result of a standard z-test: e.g., z-score values larger than 1.96 indicate significance at the 0.05 level (95% confidence), etc. Similarly, using the same transformation, one can obtain an unbiased estimate of the difference between two correlations via tanh(atanh R− atanh S).The square of this expression quantifies the amount of difference in the proportions of the explained variance of the two correlations, that is, the amount of performance gain between R and S.