This article presents a different experiment examining the impact of feedback timing on its perception. Dialog sequences, featuring a main speaker's utterance followed by a listener's feedback, were extracted from spontaneous conversations. The original feedback instances were manipulated to be produced earlier, up to 1.5 s in advance, or to be delayed, up to 2 s later. Participants evaluated the feedback acceptability and engagement level of the listener. The findings reveal that 76% of the time feedback remains acceptable regardless of the delay. However, engagement decreases after a 1-s delay while no consistent effect is observed for feedback anticipation.

During conversations, listeners produce vocal, visual, and multimodal responses or reactions known as feedback, which serve as explicit markers of attention, interest, and understanding (Allwood , 1992; Bunt, 2012; Schegloff, 1982) and guide conversational flow (Gandolfi , 2023). Various cues within the main speaker's speech, such as intonation patterns, pauses, or eye gazes, may serve as triggers for listener's feedback. Building on studies of transition-relevance places (Sacks , 1974), which denote moments when it is relevant for a speaker to take a turn, Heldner (2013) and Howes and Eshghi (2017) proposed investigating the potential space for feedback realization, termed feedback relevance spaces (or backchannel relevance spaces). However, there is no strict temporal alignment among the main speaker's cues, and the conditions for feedback production are gradually met, making the boundary of the potential feedback position blurred. Although studies have shown that the gap between turn-taking variations is around 250 ms (Stivers , 2009), there has been no investigation into the optimal timing of feedback. Our goal is to determine the temporal limits beyond which feedback is no longer acceptable. Otherwise, the acceptability of the feedback is not the only factor to consider when evaluating the quality of listening. Feedback also serves as a means to demonstrate engagement (Dermouche and Pelachaud, 2019; Ishii , 2013; Leite , 2015; Sidner and Dzikovska, 2002). Engagement is characterized as the perceived connection between speakers (Sidner and Dzikovska, 2002). According to Pellet-Rostaing (2023), engagement is defined as a “state of attentional and emotional investment in contributing to the conversation by processing partner's multimodal behaviors and grounding new information.”

We propose to evaluate, for the first time, the optimal window for feedback production, distinguishing between generic feedback (i.e., reactions that show understanding) and specific feedback (i.e., reactions that involve some form of evaluation or display a certain attitude toward the main speaker's discourse; Bavelas , 2000). Our study will explore the impact of timing on feedback acceptability as well as the perceived level of engagement of the listener. In this study, from spontaneous conversations, we extract original sequences featuring a main speaker production and listener subsequent feedback. We manipulate these sequences by anticipating and delaying feedback. Participants were asked to assess the acceptability of feedback and level of engagement of the listener. We found that participants generally consider original feedback as the most acceptable response for generic and specific types. Notably, increased anticipation or delay in feedback leads to a decline in acceptability rates. Additionally, specific feedback demonstrates higher engagement compared to generic feedback, whereas delays in feedback negatively impact listener engagement. These findings contribute valuable insights into our understanding of feedback timing and its consequences on listeners' engagement.

Research has extensively explored the functions, types, and forms of feedback (Bavelas , 2000; Schegloff, 1982; Tolins and Fox Tree, 2014). Simultaneously, numerous studies have investigated cues in the main speaker's signals that precede feedback, also known as feedback-inviting features (Allwood , 2007; Bertrand , 2007; Brusco , 2020; Gravano and Hirschberg, 2011; Koiso , 1998). Feedback-inviting features encompass various modalities, including prosodic features (e.g., a rising intonation followed by a pause), mimo-gestural features (e.g., gazing or nodding), and morpho-syntactic features (e.g., determinant-adverb-noun trigram; Brusco , 2020; Gravano and Hirschberg, 2011; Poppe , 2010). These features have been leveraged in computational models designed to predict vocal, visual, and multimodal feedback in human-human and human-machine interactions (Cathcart , 2003; de Kok , 2010; Morency , 2010; Mueller , 2015; Ozkan and Morency, 2013; Ruede , 2019; Truong , 2010; Ward and Tsukahara, 2000). At small intervals (usually 40 or 50 ms), these models typically predict whether feedback should be produced based on preceding main speaker features extracted within a given window (e.g., 2 s). Evaluation of continuous feedback predictive models often involves comparing model predictions with observed feedback in corpora. One common approach is to assess whether the prediction falls within a brief window around the observed feedback onset, typically ± 500 ms, as proposed in the seminal work by Ward and Tsukahara (2000). This evaluation window (also called margin of error) has been reused and adapted in various studies (Poppe , 2010; Ruede , 2019; Truong , 2010). For a comprehensive review, see de Kok and Heylen (2012). However, it is important to note that as far as we know, the validity of this 500 ms error window has never been experimentally confirmed, neither by Ward and Tsukahara (2000) (see quote on p. 1192, “The decision to tolerate misalignments of up to 500 milliseconds was based on informal judgments of ‘how much earlier or later a back-channel could appear and still sound appropriate’ in various contexts.”) nor by the subsequent studies mentioned above. The problem raised is that the choice of evaluation window can significantly influence the assessment of model performance. A wider evaluation window may capture more predicted feedback instances, consequently inflating the number of correct predictions and, therefore, the overall performance score (e.g., F-score; Boudin , 2024). Moreover, most of these models have focused on a limited set of feedback types (e.g., nods or vocalizations). In Boudin (2024), we proposed a feedback predictive model of feedback position by considering two main types of feedback to be as comprehensive as possible.

Following Bavelas (2000), we distinguished between generic and specific feedback. Generic feedback expresses understanding. It plays a role in encouraging the main speaker to continue his/her speech and is conveyed by different components such as nods, vocalizations “mhm, yeah, ok,” and/or smile. In contrast, specific feedback is dealing with the semantic and pragmatic context of the main's speaker discourse, providing a form of assessment and displaying various attitudes (e.g., happiness, surprise, etc.). Different feedback components can be used such as eyebrow movements, laughter, lexicalization, etc. Specific feedback is highly context dependent, involving the evaluation of the semantic and pragmatic content of the main speaker as opposed to generic feedback, which may simply demonstrate an update of the common ground or show understanding and can fit into a multitude of contexts (Tolins and Fox Tree, 2014). In this study, we introduce an original behavioral experiment aimed at gaining a deeper understanding of the variability in feedback production timing. To achieve this goal, short sequences from the Cheese! (Priego-Valverde , 2020) and PACO (Amoyal , 2020) corpora have been extracted to create our material of utterance-feedback. Through video editing, the original feedback, generic and specific, was artificially anticipated (up to 1500 ms) or delayed (up to 2000 ms) by steps of 500 ms. Participants evaluated the response produced by the producer of the feedback. We test four hypotheses. The first hypothesis is that feedback can be delayed or anticipated by more than 500 ms and remain acceptable. The second hypothesis is that the maximum acceptable delay for generic feedback is longer for generic feedback than for specific feedback. The third hypothesis is that the perceived engagement of the listener gradually decreases with delay until the feedback is ultimately rejected. For example, feedback with a delay of 1000 ms may still be considered acceptable in the conversation, but the listener's perceived level of engagement decreases significantly. Delayed feedback can imply disinterest or distraction, giving the impression of reduced engagement from the listener. The fourth and final hypothesis posits that when feedback is anticipated, the listener will be perceived as equally engaged as with the original feedback. Indeed, we believe that feedback can be anticipated and produced with a short reaction time in relation to the feedback target without being misperceived as a result of predictive mechanisms (Gandolfi , 2023; Pickering and Garrod, 2021), demonstrating a significant investment in interaction and strong collaboration.

128 participants have been involved in the experiment [mean age = 24 years old, standard deviation (sd) = 4.6, min = 18, max = 49] of which 108 identified themselves as a woman and 20 participants identified themselves as a man. All participants reported being native speakers of French. All were recruited from different students Facebook groups (Meta Platforms, Inc., Cambridge, MA) in different regions of France (Strasbourg, Bordeaux, Lyon, Toulouse, Aix-en-Provence, and Montpellier) and through the mailing lists of Laboratoire Parole et Langage. The experiment was conducted online via the FindingFive platform, and participants received a compensation of 7€ on PayPal (PayPal, Inc., San Jose, CA). One participant was excluded because response times exceeding 30 min.

For this experiment, conversation excerpts from the Cheese! (Priego-Valverde , 2020) and PACO (Amoyal , 2020) corpora were used to construct the stimuli. These corpora involved participants seated face-to-face in a soundproof room, engaging in free conversation for 15 min. Each participant was recorded by a front-facing camera. We used ten dyads, selecting sequences consisting of an utterance from one interlocutor followed by feedback from the other. We used Sony Vegas Pro software (Sony Creative Software, Inc., New York, NY) to artificially anticipate or delay the feedback from its original production. We tested eight temporal steps (separated by 500 ms steps): three feedback anticipation steps (−1500, −1000, and −500 ms), four feedback delay steps (+500, +1000, +1500, and +2000 ms), and the original time of production. We tested the feedback delayed up to 2000 ms and the feedback anticipated up to 1500 ms. We have chosen not to go beyond 1500 ms of anticipation as, typically, beyond this threshold, the feedback is either produced simultaneously with or before the main speaker utterance. To test generic and specific feedback, we selected 32 feedback per type. Our final set of stimuli is composed of 512 video clips (64 original sequences, where each is manipulated in every temporal condition) with an average duration of 5.66 s (sd = 1.85, min = 1, max = 12). Among specific feedback, we exclusively retained the most prevalent type observed in our dataset: positive-new feedback, which responds to a positive stance expressed by the main speaker and pertains to newly introduced information. The selection of utterances consistently ensured syntactic saturation. To streamline our experimental design, we opted to avoid testing various combinations of verbal, gestural, and multimodal feedback, which could introduce unnecessary complexity. It is anticipated that perceived engagement may vary between unimodal verbal, gestural, and multimodal feedback. Furthermore, multimodal feedback is prevalent in our dataset, constituting 68.55% of feedback instances among the 26 annotated participants (Boudin , 2024). Therefore, only multimodal feedback instances have been selected for generic and specific types. Examples of utterance-feedback sequences are provided in Table 1. To avoid speaker effect and dyad effect, we created six or seven stimuli per dyad. We balanced speakers' roles (main speaker vs listener) and the types of feedback (generic vs specific) within each dyad. Each speaker provided both types of feedback and took on the main speaker role at least once. The main speaker always appeared on the left of the screen and the listener appeared on the right of the screen. In few cases, when feedback was anticipated or delayed, it was possible for non-feedback-related gestural or verbal components (e.g., the listener's previous turn) to be visible in the video. Through video editing techniques, we ensured that these extraneous components were removed from the final stimuli. We accomplished this by replacing them with sequences in which the listener remained still and silent (either duplicated a video frame multiple times or inserted a sequence of the same duration without any gestures or speech).

Table 1.

Examples of utterance-feedback sequences. The first three lines show examples of generic (Gen) feedback while the next three lines display examples of specific (Spe) feedback. The “ ” symbol indicates a rising intonation or rising eyebrows.

Main speaker speech Feedback
Non moi j'avais fait un master de linguistique un master recherche  Gen: “Ah d'accord” + nod 
No I'd done a master's in linguistics a master in research  “Oh ok” 
Hum à Paris y'a un truc qui s'appelle la Cité de la musique  Gen: “Ouais” + nod 
Hum In Paris there's a thing called the Cité de la musique  Yeah” 
J'attends de finir l'année pour partir à l'armée  Spe: “Allez” + eyebrows + smile 
I'm waiting until the end of the year to go to the army  Really” 
Ca fait 6 h par jour si tu t'inscris à tous les créneaux   
donc c'est pas mal quoi  Spe: “Ah ouais c'est cool hein” + nod + eyebrows  
That's 6 hours a day if you subscribe to all the slots  Oh yeah it's cool huh” 
so it's not bad at all   
Main speaker speech Feedback
Non moi j'avais fait un master de linguistique un master recherche  Gen: “Ah d'accord” + nod 
No I'd done a master's in linguistics a master in research  “Oh ok” 
Hum à Paris y'a un truc qui s'appelle la Cité de la musique  Gen: “Ouais” + nod 
Hum In Paris there's a thing called the Cité de la musique  Yeah” 
J'attends de finir l'année pour partir à l'armée  Spe: “Allez” + eyebrows + smile 
I'm waiting until the end of the year to go to the army  Really” 
Ca fait 6 h par jour si tu t'inscris à tous les créneaux   
donc c'est pas mal quoi  Spe: “Ah ouais c'est cool hein” + nod + eyebrows  
That's 6 hours a day if you subscribe to all the slots  Oh yeah it's cool huh” 
so it's not bad at all   

Eight experimental lists were elaborated such that a participant evaluated all sequences and all temporal conditions (−1500 ms, −1000 ms, −500 ms, 0 ms, 500 ms, 1500 ms, and 2000 ms), but a participant could not see the same sequence twice in different temporal conditions. Each list comprised 64 stimuli, divided into two blocks of 32 each. One block consisted of 16 generic and 16 specific stimuli with each type presented twice across all temporal conditions. In summary, each participant evaluated a total of 64 items, including 32 distinct generic feedback and 32 distinct specific feedback instances. A participant evaluated each temporal condition eight times, including four times for each type of feedback.

Participants were first informed of their rights and signed a consent form. They were given a personal link and password to access the experiment on FindingFive (FindingFiveTeam, 2023) from their home computer. Each participant was informed that the purpose of the study is to better understand spontaneous conversation. They were instructed that they would be watching short video clips of conversations between two interlocutors, where the person on the left was speaking while the person on the right was listening. They were asked to focus on the person on the right of the screen and answer two questions for each video: (1) Does the reaction of the participant on the right of the screen seem strange to you?—yes, the reaction seems strange, inappropriate, or unnatural or no, the reaction seems normal and appropriate. (2) Does the participant on the right of the screen seem involved/interested by the conversation?—1, not at all involved/interested; 2, not very involved/interested; 3, somewhat involved/interested; 4, interested/involved; and 5, very involved/interested. They were asked to respond as quickly and accurately as possible. After reading the instructions, participants began the experiment with a training block containing 11 trials not used in the blocks. The stimuli were separated by 1 s of white screen. The first question appeared on the screen 300 ms after the video ended and the second question appeared 300 ms after the participant answered the first question. The experiment was divided into 2 blocks each containing 32 trials. Blocks were separated by a maximum break of 2 min. The order of the blocks remained consistent, whereas the presentation of stimuli within each block was randomized. At the end of the blocks, we asked them to make comments on the experience if desired and answer two questions to find out if they perceived the editing of the videos. The first question was “Did you feel that some of the videos were buggy?” and “Most of the videos you just saw were edited. Did you realize that?”. The average total duration of the experiment was 18.72 min (sd = 3.72, min = 12.81, max = 30.13).

The duration and reaction times of the trials were automatically recorded by FindingFive. After manually reviewing all responses and reaction times, trials with abnormal duration were removed (greater than 110 000 ms). In a second step, all trials whose duration was more than 2.5σ compared to the logarithmic mean reaction time were removed. Thus, 2% of the data was deleted. From the responses to the first question, participants who always responded in the same way (always “no” or “yes” answer) were removed. Seven participants were removed. In the same vein, participants that showed an overly small variability in their responses were discarded. In practice, participants with a standard deviation of responses too small (at the 2.5σ level) compared to the mean standard deviation over the participants were excluded. The criterion concerns only one participant of the analysis. Finally, participants who noticed the video editing were removed, corresponding to 49 participants. A total of 57 participants were finally removed for the following results.

All of the following analyses were performed with Rstudio (R version 4.2.2; RStudio Team, 2020). Figure 1 (available in the supplementary material) shows the average proportion of “yes” responses to the question “Does the reaction of the participant on the right of the screen seem strange to you?” with generic feedback in yellow and specific feedback in blue. These average proportions and associated 1 σ error bars are obtained from the distribution of the individual proportion of each participant for a given time delay and feedback type. The original feedback timing (0 ms) obtained a proportion of “weird” responses of 9.27% for specific feedback and 10.92% for generic feedback. The proportion of feedback rated as weird increases as the feedback is anticipated or delayed. However, even for the minimum and maximum timing, the proportion of feedback rated as weird never exceeds 30%. The results can be visualized in Fig. 1 (available in the supplementary material) and need to be confirmed by assessing the statistical significance of these findings. To assess the increase in proportion of weird responses as the timing condition moves apart from zero, we analysed the responses to the first question by applying a general linear mixed-effects model (glmer function from R lme4 package; Bates , 2015) using the binomial family and bobyqa optimizer. The original feedback timing was defined as the reference level and all other timing was defined as the contrast levels. The variable type was treated as a categorical predictor in the model (using dummy coding with generic type as reference level). The type of feedback and its interaction with timing was defined as fixed effects. The model also incorporates participants as random effects. Results of the model are presented in Table 2. The model revealed a significant effect of the feedback timing conditions beyond ± 500 ms on the perceived feedback acceptability. Additionally, we found an interaction effect between type and timing at −500 ms.

Figure 2 (available in the supplementary material) presents the average response score to the question “Does the participant on the right of the screen seem involved/interested by the conversation?” depending on the timing condition and the type of feedback. The score varies from one, corresponding to the response “not involved/interested at all,” to 5 for the “very involved/interested” response. Comparable to Sec. 4.1, we ran a mixed model with feedback type and timing as predictor variables. Given that the response variable exhibits a Gaussian distribution over time, we opted for a linear mixed-effects model using the lmer R function. The results of the model are presented in Table 3. As a first result, the model reveals an effect of feedback type on perceived level of engagement. Concerning the relationship between engagement and timing, listeners' perceived engagement begins to be affected from 1000 to 2000 ms of delay. In terms of anticipated feedback, the perceived engagement of listeners is not significantly affected, except for an anticipation of −1000 ms. Regarding interaction effects, a significant effect was found between feedback anticipated by 1500 ms and type. These findings suggest that the timing and type of feedback production significantly influence the perceived engagement of the listeners.

Table 2.

Estimate, standard error (SE), z-value, and p-value obtained by the general linear mixed-effects model, which was conducted to test the impact of feedback timing and feedback type on the feedback acceptability rates. The significance levels (p-levels) are defined as follows: “***” indicates a p-value inferior to 0.001, “**” indicates a p-value inferior to 0.01, and “*” indicates a p-value inferior to 0.05.

Fixed effects Estimate SE z-value p-value p-level
Type at 0 ms  0.246 10  0.295 18  0.834  0.4 044 29   
−1500 ms  −1.154 16  0.252 60  −4.569  4.90e-06  *** 
−1000 ms  −0.769 16  0.257 12  −2.991  0.0 027 77  ** 
−500 ms  0.038 17  0.283 64  0.135  0.8 929 62   
+500 ms  −0.332 70  0.268 86  −1.237  0.2 159 24   
+1000 ms  −0.911 86  0.254 68  −3.580  0.0 003 43  *** 
+1500 ms  −0.845 94  0.252 65  −3.348  0.0 008 13  *** 
+2000 ms  −1.239 80  0.249 00  −4.979  6.39e-07  *** 
Interaction terms 
−1500:type  −0.043 94  0.368 05  −0.119  0.9 049 66   
−1000:type  −0.188 09  0.374 20  −0.503  0.6 152 07   
−500:type  −0.828 25  0.396 91  −2.087  0.0 369 13  * 
+500:type  −0.275 16  0.389 52  −0.706  0.4 799 40   
+1000:type  0.099 63  0.375 83  0.265  0.7 909 44   
+1500:type  −0.463 10  0.366 45  −1.264  0.2 063 26   
+2000:type  0.062 39  0.366 07  0.170  0.8 646 80   
Fixed effects Estimate SE z-value p-value p-level
Type at 0 ms  0.246 10  0.295 18  0.834  0.4 044 29   
−1500 ms  −1.154 16  0.252 60  −4.569  4.90e-06  *** 
−1000 ms  −0.769 16  0.257 12  −2.991  0.0 027 77  ** 
−500 ms  0.038 17  0.283 64  0.135  0.8 929 62   
+500 ms  −0.332 70  0.268 86  −1.237  0.2 159 24   
+1000 ms  −0.911 86  0.254 68  −3.580  0.0 003 43  *** 
+1500 ms  −0.845 94  0.252 65  −3.348  0.0 008 13  *** 
+2000 ms  −1.239 80  0.249 00  −4.979  6.39e-07  *** 
Interaction terms 
−1500:type  −0.043 94  0.368 05  −0.119  0.9 049 66   
−1000:type  −0.188 09  0.374 20  −0.503  0.6 152 07   
−500:type  −0.828 25  0.396 91  −2.087  0.0 369 13  * 
+500:type  −0.275 16  0.389 52  −0.706  0.4 799 40   
+1000:type  0.099 63  0.375 83  0.265  0.7 909 44   
+1500:type  −0.463 10  0.366 45  −1.264  0.2 063 26   
+2000:type  0.062 39  0.366 07  0.170  0.8 646 80   
Table 3.

Estimate, SE, t-value, and p-value obtained by the linear mixed-effects model, which was conducted to test the impact of feedback timing and feedback type on the perceived level of engagement. The significance levels (p-levels) are defined as follows: “***” indicates a p-value inferior to 0.001, “**” indicates a p-value inferior to 0.01, and “*” indicates a p-value inferior to 0.05.

Fixed effects Estimate SE t-value p-value p-level
Type at 0 ms  0.597 82  0.088 05  6.789  1.28e-11  *** 
−1500 ms  −0.098 10  0.089 33  −1.098  0.272 19   
−1000 ms  −0.201 02  0.087 97  −2.285  0.022 35  * 
−500 ms  0.020 10  0.087 57  0.230  0.818 49   
+500 ms  −0.021 82  0.087 57  −0.249  0.803 28   
+1000 ms  −0.217 36  0.088 06  −2.468  0.013 61  * 
+1500 ms  −0.179 95  0.086 62  −2.077  0.037 82  * 
+2000 ms  −0.359 54  0.088 13  −4.079  4.59e-05  *** 
Interaction terms 
−1500:type  −0.052 66  0.125 83  −0.419  0.675 60   
−1000:type  0.178 78  0.124 36  1.438  0.150 59   
−500:type  −0.142 28  0.124 41  −1.144  0.252 81   
+500:type  −0.105 31  0.124 24  −0.848  0.396 68   
+1000:type  0.154 41  0.124 69  −1.238  0.215 63   
+1500:type  −0.321 13  0.123 52  −2.600  0.009 36  ** 
+2000:type  −0.014 47  0.125 04  −0.116  0.907 89   
Fixed effects Estimate SE t-value p-value p-level
Type at 0 ms  0.597 82  0.088 05  6.789  1.28e-11  *** 
−1500 ms  −0.098 10  0.089 33  −1.098  0.272 19   
−1000 ms  −0.201 02  0.087 97  −2.285  0.022 35  * 
−500 ms  0.020 10  0.087 57  0.230  0.818 49   
+500 ms  −0.021 82  0.087 57  −0.249  0.803 28   
+1000 ms  −0.217 36  0.088 06  −2.468  0.013 61  * 
+1500 ms  −0.179 95  0.086 62  −2.077  0.037 82  * 
+2000 ms  −0.359 54  0.088 13  −4.079  4.59e-05  *** 
Interaction terms 
−1500:type  −0.052 66  0.125 83  −0.419  0.675 60   
−1000:type  0.178 78  0.124 36  1.438  0.150 59   
−500:type  −0.142 28  0.124 41  −1.144  0.252 81   
+500:type  −0.105 31  0.124 24  −0.848  0.396 68   
+1000:type  0.154 41  0.124 69  −1.238  0.215 63   
+1500:type  −0.321 13  0.123 52  −2.600  0.009 36  ** 
+2000:type  −0.014 47  0.125 04  −0.116  0.907 89   

In this study, our objective was to investigate the optimum window, which has never been experimentally validated, for the occurrence of a conversational feedback. For this purpose, we designed an online behavioral experiment in which the time taken for feedback to appear is manipulated. Participants were asked to evaluate the level of feedback acceptability (Q1) and level of engagement of the listener (Q2). Participants were unaware of the timing manipulation or the precise purpose of this experiment.

Original generic feedback was judged acceptable 89.08% of the time, whereas original specific feedback was judged acceptable 90.73% of the time. The findings suggest that feedback timing between −500 and +500 ms is not perceived by participants. However, we found that the acceptability rate decreased significantly when feedback was anticipated or delayed by more than 1 s. For a maximum feedback anticipation of −1.5 s, generic feedback is judged acceptable 74.76% of the time and specific feedback is judged acceptable 78.05% of the time. For a maximum feedback delay of +2 s, generic feedback is judged acceptable 73.01% of the time and specific feedback is judged acceptable 78.29% of the time. Therefore, the unacceptability of these feedback production delays is not so clear cut as there is still a low rejection rate, even in the most extreme cases. These results tend to support our first hypothesis that feedback can be anticipated and delayed by more than 500 ms without becoming unacceptable. This also seems to validate the notion that feedback should be apprehended within a time window rather than at a specific point in time. However, it is essential to note that this temporal apprehension is not arbitrary as this window of occurrence depends on necessary conditions (feedback-inviting features). Finally, the analysis of responses to the feedback acceptability question with respect to timing conditions does not reveal consistent differences between the two types of feedback. However, interaction effects were observed: the acceptability at −500 ms and perceived engagement at +1500 ms varied significantly between the two types of feedback. Despite these findings, the present study does not allow us to validate our second hypothesis, which posited that the window of acceptability for feedback realization is larger for generic feedback than for specific feedback.

The results of the second question about listener engagement show slightly different outcomes. Specifically, the model identifies a significant effect in the level of engagement between generic and specific feedback, where specific feedback elicits higher engagement. The original generic feedback obtained an average engagement score of 3.30 but decreased to 2.91 with a delay of +2 s. In contrast, original specific feedback obtained an average score of 3.89, reaching its lowest point at a delay of +1.5 s with a score of 3.39 (with no significant difference observed compared to the 2-s delay). This finding is unsurprising given that specific feedback typically includes more salient components such as laughter, eyebrow movement, and larger intonational span. However, it offers valuable insights into how listener engagement is expressed. Additionally, the perceived level of listeners' engagement significantly decreases as feedback is delayed by more than 1 s, supporting our third hypothesis. As a third observation, we noticed that listeners engagement is not significantly affected by anticipated feedback, except for the timing of −1.0 s. However, this is not a consistent effect as the more extreme anticipation of −1.5 s does not show a significant effect. This finding provides support for our fourth hypothesis, which states that anticipated feedback does not impact perceived engagement, at least not consistently.

The first contribution of this work lies in the design of an online experiment with a third-party analysis, which is necessary because it is not possible to ask a person to anticipate or delay naturally his/her feedback production. This method is relevant to study the impact of different conversational behaviors that are not consciously manageable in spontaneous and natural conversations. Nevertheless, given the baseline error rate of 10% identified in Sec. 4.1, it might be worth considering using an equal number of original sequences and manipulated sequences for subsequent studies.

The experiment presented in this paper serves a dual purpose. First, it seeks to deepen our understanding of how the timing of feedback delivery impacts its acceptability in conversation and listener level of engagement. Second, we aim to validate the window of evaluation (margin of error) used to assess the performance of feedback predictive models. In existing literature, it has been claimed that an acceptable delay for generating feedback typically falls within approximately 500 ms relative to the onset time of the original feedback produced by a listener (Ward and Tsukahara, 2000). However, various studies have employed different windows. For example, Ruede (2019) used a window of 1 s after the feedback onset based on the assumption that anticipated feedback may not be acceptable, whereas delayed feedback is acceptable with up to a 1000-ms delay. Mueller (2015) used a window of ±200 ms. Nevertheless, these windows are based on arbitrary choice. With the insights gained from these results, our goal is to propose an objective metric that provides a more nuanced evaluation of predictions, considering the temporal distance of the prediction from the feedback onset. This approach aims to refine the assessment beyond binary classifications of good or bad predictions. Our results suggest that participants treat the issues of acceptability and engagement as distinct concepts in conversation. In essence, evaluating acceptability requires semantic interpretation and inference from context, whereas engagement is influenced directly by the type of feedback and its position. Participants indicated that it was easier to answer the second question. We plan to conduct two follow-up experiments. The first will reproduce this procedure by not manipulating the timing of the feedback but only its type and content. The second experiment will evaluate the cumulative effect of timing on participant responses, and instead of displaying only one feedback instance at a time, participants will be presented with longer sequences containing several feedback instances. This approach will provide more context to the participants. Finally, it is important to note that this experiment exclusively focuses on manipulation of multimodal feedback. However, investigating the individual roles of visual and vocal modalities is essential for gaining a deeper understanding of their respective contributions to feedback perception. Furthermore, the exclusion of a significant number of participants who noticed the video manipulation could call for caution in interpreting our results. However, we would like to nuance this point by specifying that none of the participants reported perceiving any delays or anticipations in the feedback (which was our main variable of interest). Instead, some participants who responded “yes” to the question related to video editing believed that the editing was performed by compiling individuals who never interacted together, possibly resulting from the most anticipated or delayed feedback. A potential approach to mitigate this problem could be to test only audio feedback or adopt a cumulative experimental design as mentioned above.

See supplementary material for figures illustrating the average responses per condition obtained for Question 1 (feedback acceptability) and Question 2 (listener engagement).

This work, carried out within the Institute Convergence ILCB (ANR-16-CONV-0002) and Laboratoire Parole et Langage (UMR 7309), has benefited from support from the French government. Our warmest thanks go to Sophie Dufour for her help with design and analysis and Amandine Michelas for her valuable advice. A.B. sincerely thanks Morgane Peirolo and Lydia Dorokhova for their help with the FindingFive platform. We would also like to thank the reviewers for their extremely valuable suggestions, which have helped to considerably improve the reporting of our results.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
Allwood
,
J.
,
Cerrato
,
L.
,
Jokinen
,
K.
,
Navarretta
,
C.
, and
Paggio
,
P.
(
2007
). “
The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena
,”
Lang. Resour. Eval.
41
,
273
287
.
2.
Allwood
,
J.
,
Nivre
,
J.
, and
Ahlsén
,
E.
(
1992
). “
On the semantics and pragmatics of linguistic feedback
,”
J. Semantics
9
(
1
),
1
26
.
3.
Amoyal
,
M.
,
Priego-Valverde
,
B.
, and
Rauzy
,
S.
(
2020
). “
Paco: A corpus to analyze the impact of common ground in spontaneous face-to-face interaction
,” in
Proceedings of the 12th Language Resources and Evaluation Conference
, May 11–16,
Marseille, France
(
European Language Resources Association
,
Paris
), pp.
628
633
.
4.
Bates
,
D.
,
Mächler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
(
1
),
1
48
.
5.
Bavelas
,
J. B.
,
Coates
,
L.
, and
Johnson
,
T.
(
2000
). “
Listeners as co-narrators
,”
J. Pers. Social Psychol.
79
(
6
),
941
952
.
6.
Bertrand
,
R.
,
Ferré
,
G.
,
Blache
,
P.
,
Espesser
,
R.
, and
Rauzy
,
S.
(
2007
). “
Backchannels revisited from a multimodal perspective
,” in
Auditory-Visual Speech Processing
(
ISCA
,
Hilvarenbeek
), paper P09.
7.
Boudin
,
A.
,
Bertrand
,
R.
,
Rauzy
,
S.
,
Ochs
,
M.
, and
Blache
,
P.
(
2024
). “
A multimodal model for predicting feedback position and type during conversation
,”
Speech Commun.
159
,
103066
.
8.
Brusco
,
P.
,
Vidal
,
J.
,
Beňuš
,
Š.
, and
Gravano
,
A.
(
2020
). “
A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool
,”
Speech Commun.
125
,
24
40
.
9.
Bunt
,
H.
(
2012
). “
The semantics of feedback
,” in
Proceedings of the 16th Workshop on the Semantics and Pragmatics of Dialogue (SEMDIAL 2012)
, September 19–21,
Paris, France
, edited by
S.
Brown-Schmidt
,
J.
Ginzburg
, and
S.
Larsson
(
University Paris-Diderot
,
Paris Sorbonne-Cite
), pp.
118
127
.
10.
Cathcart
,
N.
,
Carletta
,
J.
, and
Klein
,
E.
(
2003
). “
A shallow model of backchannel continuers in spoken dialogue
,” in
Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics—EACL '03
,
April 12–17
,
Budapest, Hungary
(
Association for Computational Linguistics
,
Stroudsburg, PA
), Vol.
1
, pp.
51
58
.
11.
de Kok
,
I.
, and
Heylen
,
D.
(
2012
). “
A survey on evaluation metrics for backchannel prediction models
,” in
Proceedings of the Interdisciplinary Workshop on Feedback Behaviors in Dialog
,
September 7
,
Stevenson, WA
(
University of Texas
,
El Paso,TX
), pp.
15
18
.
12.
de Kok
,
I.
,
Ozkan
,
D.
,
Heylen
,
D.
, and
Morency
,
L.-P.
(
2010
). “
Learning and evaluating response prediction models using parallel listener consensus
,” in
International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, ICMI-MLMI '10
,
Chania, Crete
, Greece,
October 20–22, 2008
(
Association for Computing Machinery
,
New York
).
13.
Dermouche
,
S.
, and
Pelachaud
,
C.
(
2019
). “
Engagement modeling in dyadic interaction
,” in
2019 International Conference on Multimodal Interaction
, ICMI '19,
Suzhou, China
,
October 14–18
(
Association for Computing Machinery
,
New York
), pp.
440
445
.
14.
FindingFiveTeam
(
2023
). “
FindingFive: An online platform for creating, running, and managing your experiments
,” available at https://www.findingfive.com/ (Last viewed March 6, 2023).
15.
Gandolfi
,
G.
,
Pickering
,
M. J.
, and
Garrod
,
S.
(
2023
). “
Mechanisms of alignment: Shared control, social cognition and metacognition
,”
Phil. Trans. R. Soc. B
378
(
1870
),
20210362
.
16.
Gravano
,
A.
, and
Hirschberg
,
J.
(
2011
). “
Turn-taking cues in task-oriented dialogue
,”
Comput. Speech Lang.
25
(
3
),
601
634
.
17.
Heldner
,
M.
,
Hjalmarsson
,
A.
, and
Edlund
,
J.
(
2013
). “
Backchannel relevance spaces
,” in
Nordic Prosody: Proceedings of XIth Conference, Tartu 2012
, Tartu, Estonia, August 15–17 (
Peter Lang Publishing Group
,
Frankfurt, Germany
), pp.
137
146
.
18.
Howes
,
C.
, and
Eshghi
,
A.
(
2017
). “
Feedback relevance spaces: The organisation of increments in conversation
,” in
Proceedings of the 12th International Conference on Computational Semantics (IWCS)—Short papers
, Montpellier, France, September 19–22, edited by
C.
Gardent
and
C.
Retoré
(
Curran Associates, Inc.
,
New York
), available at https://aclanthology.org/W17-6913 (Last viewed June 25, 2024).
19.
Ishii
,
R.
,
Nakano
,
Y. I.
, and
Nishida
,
T.
(
2013
). “
Gaze awareness in conversational agents: Estimating a user's conversational engagement from eye gaze
,”
ACM Trans. Interact. Intell. Syst.
3
(
2
),
1
25
.
20.
Koiso
,
H.
,
Horiuchi
,
Y.
,
Tutiya
,
S.
,
Ichikawa
,
A.
, and
Den
,
Y.
(
1998
). “
An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs
,”
Lang. Speech
41
(
3-4
),
295
321
.
21.
Leite
,
I.
,
McCoy
,
M.
,
Ullman
,
D.
,
Salomons
,
N.
, and
Scassellati
,
B.
(
2015
). “
Comparing models of disengagement in individual and group interactions
,” in
Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI '15
, March 2–5, Portland, OR (
Association for Computing Machinery
,
New York
), pp.
99
105
.
22.
Morency
,
L.-P.
,
Kok
,
I.
, and
Gratch
,
J.
(
2010
). “
A probabilistic multimodal approach for predicting listener backchannels
,”
Auton. Agent Multi-Agent Syst.
20
(
1
),
70
84
.
23.
Mueller
,
M.
,
Leuschner
,
D.
,
Briem
,
L.
,
Schmidt
,
M.
,
Kilgour
,
K.
,
Stueker
,
S.
, and
Waibel
,
A.
(
2015
). “
Using neural networks for data-driven backchannel prediction: A survey on input features and training techniques
,” in
Human-Computer Interaction: Interaction Technologies
, edited by
M.
Kurosu
(
Springer International
,
Cham
), pp.
329
340
.
24.
Ozkan
,
D.
, and
Morency
,
L.-P.
(
2013
). “
Latent mixture of discriminative experts
,”
IEEE Trans. Multimedia
15
(
2
),
326
338
.
25.
Pellet-Rostaing
,
A.
,
Bertrand
,
R.
,
Boudin
,
A.
,
Rauzy
,
S.
, and
Blache
,
P.
(
2023
). “
A multimodal approach for modeling engagement in conversation
,”
Front. Comput. Sci.
5
,
1062342
.
26.
Pickering
,
M. J.
, and
Garrod
,
S.
(
2021
).
Understanding Dialogue: Language Use and Social Interaction
(
Cambridge University Press
,
Cambridge, UK
).
27.
Poppe
,
R.
,
Truong
,
K. P.
,
Reidsma
,
D.
, and
Heylen
,
D.
(
2010
). “
Backchannel strategies for artificial listeners
,” in
Proceedings of the 10th International Conference on Intelligent Virtual Agents, IVA'10
, Philadelphia, PA, September 20–22, 2008 (
Springer
,
Berlin
), pp.
146
158
.
28.
Priego-Valverde
,
B.
,
Bigi
,
B.
, and
Amoyal
,
M.
(
2020
). “
‘Cheese!’: A corpus of face-to-face French interactions. A case study for analyzing smiling and conversational humor
,” in
Proceedings of The 12th Language Resources and Evaluation Conference
,
May 11–16
,
Marseille, France
(
European Language Resources Association
,
Paris
), pp.
467
475
.
29.
RStudio Team
(
2020
). “
RStudio: Integrated development environment for R
” (
RStudio, PBC
.,
Boston, MA
), available at http://www.rstudio.com/ (Last viewed May 25, 2024).
30.
Ruede
,
R.
,
Müller
,
M.
,
Stüker
,
S.
, and
Waibel
,
l
(
2019
).
Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor
(
Springer International
,
Cham
), pp.
247
258
.
31.
Sacks
,
H.
,
Schegloff
,
E. A.
, and
Jefferson
,
G.
(
1974
). “
A simplest systematics for the organization of turn-taking for conversation
,”
Language
50
(
4
),
696
735
.
32.
Schegloff
,
E. A.
(
1982
). “
Discourse as an interactional achievement: Some uses of ‘uh huh’ and other things that come between sentences
,” in
Analyzing Discourse: Text and Talk
, edited by
D.
Tannen
(
Georgetown University Press
,
Washington, DC
), Vol.
71
, pp.
71
93
.
33.
Sidner
,
C.
, and
Dzikovska
,
M.
(
2002
). “
Human-robot interaction: Engagement between humans and robots for hosting activities
,” in
Proceedings
.
Fourth IEEE International Conference on Multimodal Interfaces
, 16 October, Pittsburgh, PA (
IEEE
,
New York
), pp.
123
128
.
34.
Stivers
,
T.
,
Enfield
,
N. J.
,
Brown
,
P.
,
Englert
,
C.
,
Hayashi
,
M.
,
Heinemann
,
T.
,
Hoymann
,
G.
,
Rossano
,
F.
,
de Ruiter
,
J. P.
,
Yoon
,
K.-E.
, and
Levinson
,
S. C.
(
2009
). “
Universals and cultural variation in turn-taking in conversation
,”
Proc. Natl. Acad. Sci. U. S. A.
106
(
26
),
10587
10592
.
35.
Tolins
,
J.
, and
Fox Tree
,
J. E.
(
2014
). “
Addressee backchannels steer narrative development
,”
J. Pragmatics
70
,
152
164
.
36.
Truong
,
K.
,
Poppe
,
R.
, and
Heylen
,
D.
(
2010
). “
A rule-based backchannel prediction model using pitch and pause information
,” in
Proceedings of Interspeech 2010
,
September 26–30
, Makuhari, Chiba, Japan (
International Speech Communication Association
), pp.
3058
3061
.
37.
Ward
,
N.
, and
Tsukahara
,
W.
(
2000
). “
Prosodic features which cue back-channel responses in English and Japanese
,”
J. Pragmatics
32
(
8
),
1177
1207
.

Supplementary Material