A wide variety of research and clinical assessments involve presenting speech stimuli in the presence of some kind of noise. Here, I selectively review two theoretical perspectives and discuss ways in which these perspectives may help researchers understand the consequences for listeners of adding noise to a speech signal. I argue that adding noise changes more about the listening task than merely making the signal more difficult to perceive. To fully understand the effects of an added noise on speech perception, we must consider not just how much the noise affects task difficulty, but also how it affects all of the systems involved in understanding speech: increasing message uncertainty, modifying attentional demand, altering affective response, and changing motivation to perform the task.
I. INTRODUCTION
In the study of speech perception, when there is a need to make a listening task more difficult, a nearly ubiquitous method among researchers is simply to add noise. The added sound may be white noise (Sommers, 1996), speech-shaped noise (Banks , 2015; Holt , 2011; Surprenant and Watson, 2001), environmental noise (Klatte , 2007; Meyer , 2013), or “cafeteria” noise (Tajima , 1997), but it could also be multitalker babble of various sorts (Imai , 2005; Sommers , 2005) or even the speech of a single competing talker (Freyman , 2001). In these examples, we already see some potential problems with the use of the term “noise,” as it seems to cover a huge variety of acoustically distinct sounds. From a casual, mostly vernacular perspective, we might expect noise to have certain specific acoustic properties: perhaps high intensity, a broad frequency distribution, and little or no harmonicity. It might also have some affective qualities: It is probably annoying or aversive, and it certainly interferes with whatever it is we're trying to do. In speech perception research, there might be an even more general understanding of noise as any added sound that distorts or obscures or physically alters a target signal. The ubiquitous use of the term “masker” in speech perception research to refer to the added signal when discussing experimental methods reflects this perspective very clearly—a noise is then any signal that masks or obscures properties of a target sound. In addition, however, we can also include added sounds that interfere with speech perception even when they do not mask a target sound (Freyman , 2001; Ihlefeld and Shinn-Cunningham, 2008a, 2008b), as well as cases in which two sounds that provide the same amount of masking nevertheless have different effects on listeners (Helfer , 2010; Tun , 2002). Thus, there are many things that might constitute noise. Similarly, there are many reasons to add noise to a signal.
In many cases, the goal of studies involving added noise is to investigate how different properties of the noise itself may make speech perception or other behaviors more difficult. Such studies include research investigating informational masking (Dai , 2017; Francis , 2016; Freyman , 2001, 2007; Helfer and Freyman, 2014; Kidd , 2016; Kidd and Colburn, 2017; Koelewijn , 2015) and the role of glimpsing in perception of speech in noise (Culling and Stone, 2017; Stone and Canavan, 2016; Summers and Molis, 2004), studies of listening effort (Francis , 2016, 2021; Krueger , 2017; Ohlenforst , 2018; Sarampalis , 2009), simulations of hearing loss in typically hearing listeners (Bernarding , 2011; Humes , 1987; Richie , 2005; Sommers and Humes, 1993), and studies of noise sensitivity and annoyance, especially in realistic auditory environments (Lee , 2017; Lindvall and Västfjäll, 2013; Love , 2021).
In general, however, the primary consideration when adding noise seems to be that doing so makes the task more difficult, even if the research question is focused on determining how such a noise makes the task more difficult. And often noise level, or signal-to-noise ratio (SNR), is used as a proxy for difficulty, assuming that the main determiner of difficulty is how much the noise masks the target sound. Here, I argue that adding noise means more than masking and much more than just increasing task difficulty for listeners. I do not mean to suggest that adding sound to a target speech signal does not make it more difficult to understand. Nor do I mean to suggest that one should not add noise to target speech or that the nature of the noise should not matter. Rather, I wish to highlight some of the ways in which adding sound to target signal can fundamentally change how listeners process what they are hearing in ways that are not fully captured by just considering changes to the difficulty of the listening task itself and to suggest that these factors require us to think more deeply about how we interpret the results of studies on the perception of speech in noise.
II. BASIC MEANINGS OF “NOISE”
Historically speaking, the use of the term noise in studies of speech perception seems to build on two potentially distinct concepts: noise as a source of uncertainty and noise as an unwanted sound. The more vernacular conceptualization of noise as a masking sound draws on both of these traditions to some degree, or, perhaps more properly, both of these traditions emphasize different aspects of the idea of noise as a masker (maskers obscure or distort the signal, making it less certain, but they are also annoying and unwanted). In any case, it is likely that modern researchers generally do not make these distinctions nearly as strongly as I will here for illustrative purposes, but making this distinction is a useful starting point for considering how our current thinking about noise might be worth reconsidering.
A. Noise is a source of increased uncertainty
Within the study of speech perception, the historically dominant conceptualization of noise is found in the information-theoretic sense of a source of increased uncertainty regarding the identity of a transmitted signal (Fano, 1950; Shannon, 1949). Researchers initially working in this area were largely concerned with the problem of determining how reliable an information transmission channel could be, with particular focus on radio and telephony (Pierce, 1973). Thus, in the most basic sense, the “noise” that Shannon and colleagues were concerned with was not even necessarily acoustic. Indeed, given that the transmission channel was initially thought of as perhaps “… a pair of wires, a coaxial cable, a band of radio frequencies, etc.” (Shannon, 1949), the location of noise in the transmission channel as shown in Fig. 1 [after Shannon (1949), Fig. 1] suggests that acoustic noise was not necessarily a major consideration at the time. The addition of noise in this sense, then, refers simply to an increase in uncertainty of signal reception of some sort and could in principle include any source of uncertainty, even occurring elsewhere in the transmission chain, from source to receiver (inclusive).
1. Additional uncertainty elsewhere in the speech chain
Perhaps due to the focus on transmission problems, there does not seem to have been strong consideration of what it might mean for uncertainty to arise at the transmitter (talker) or receiver level (i.e., entirely within the ear and brain of the listener). Although Shannon (1949) does mention the possibility of destination noise, early theorists seem not to have been as concerned as modern researchers with how perceptual uncertainty might arise from properties of the talker (information source) or within the mind or brain of a listener (receiver). For example, although Peterson (1952) considers the effects of transmission channel noise on the perception of the most basic acoustic properties of speech, his characterization of the process for perception of higher-level properties (phonetic features, phonemes, words, etc.) makes no mention of the possibility of increased uncertainty in these receiver-based domains (Peterson, 1952). Nevertheless, there is nothing in the information-theoretic conceptualization of noise that requires that uncertainty (noise) arise in the transmission channel alone. I highlight this point because a considerable variety of recent studies have begun to investigate challenges to speech perception that arise either prior to or later than the transmission channel, and there seems to be some trend toward treating these seemingly disparate challenges as representative of a single overarching problem (i.e., adverse conditions) (Mattys , 2012). It seems to me that this broader focus makes very good sense, as these conditions could quite plausibly be all considered simply sources of increased uncertainty, also known as noise.
For example, properties of a particular talker's speech may increase the uncertainty of message reception, as in the case of a talker with dysarthria or other motor-speech difficulties (Bent , 2016; Borrie , 2017) or a talker who is using an unfamiliar (to the listener) dialect (Clopper, 2021) or accent (Baese-Berk , 2020; Van Engen and Peelle, 2014). All of these serve to increase the listener's uncertainty about what speech sounds the talker is intending to produce.
Similarly, much of what makes speech perception difficult for individuals with hearing impairment could be considered in terms of increased receiver-located uncertainty. Changes in peripheral auditory function due to age (presbycusis), such as attenuation of higher frequencies, loudness compression, broadening of critical bands, and temporal jitter (among other consequences), all increase uncertainty in speech perception, though probably to different degrees at different stages of auditory/cognitive processing or with respect to different aspects of speech understanding (Gordon-Salant , 2006; Humes and Dubno, 2010; Pichora-Fuller and Souza, 2003). Thus, we might also consider uncertainty arising at either the talker or listener stages as a kind of “added noise” in the information-theoretic sense, even though it does not typically involve the addition of acoustic noise as such.
Taking this perspective would enable us to shift the emphasis of the investigation from the origin of the challenge (i.e., source, transmission channel, or receiver) to the cognitive/linguistic/neural mechanisms that are affected by the challenge. For example, a number of recent studies have attempted to explicitly distinguish mechanisms of speech perception that might be differently engaged in response to different sources of interference, especially talker-related vs transmission channel-related (acoustic) uncertainty (Adank , 2012; Alain , 2018; Francis , 2021; McLaughlin , 2018). Such studies have found significant and potentially meaningful differences in patterns of listeners' responses, but, as I have argued previously (Francis , 2021) and will argue in more detail later in this paper, although these findings suggest that listeners engage different processing mechanisms to cope with each challenge (Calandruccio , 2010; McLaughlin , 2018; Viswanathan, 2021), these differences may depend more on what stage(s) of mental processing is affected by the increased uncertainty than on the location or origin of the interference in the transmission channel.
In summary, there are significant benefits to considering within a single theoretical framework such seemingly disparate challenges to speech understanding as those arising at the source, transmission channel, and receiver of the spoken message, and the information-theoretic concept of noise as increased uncertainty in message recognition serves that purpose well. Researchers investigating the mechanisms that listeners use to cope with uncertainty have identified similarities as well as differences in the cognitive and neural processes involved in dealing with various sorts of “adverse conditions” (Borrie , 2017; Francis , 2021; Guediche , 2014; Mattys , 2012; McLaughlin , 2018), suggesting that listeners process the consequences of some sources of increased uncertainty independently of the stage of transmission at which they arise. In Sec. IV, I will argue that what likely matters most is not the nature of the source of uncertainty itself, but rather how any given source of uncertainty interferes with the mental processes that must be engaged to understand a spoken message. First, however, we must consider the other commonly understood meaning of the term “noise.”
B. Noise is unwanted sound
In addition to the information-theoretic concept of noise as uncertainty, one can also find definitions of noise as unwanted sound (Basner , 2014; Fink, 2019; Richards, 1935). This characterization becomes particularly useful and important when relating the study of speech perception in noise to other considerations of the effects of noise on humans (Babisch , 2013; Basner , 2014; Kryter, 1972, 2013; Pretzsch , 2021). Here, I will argue that adopting this conceptualization of noise ultimately highlights the importance of a wide range of factors that must be considered carefully when thinking about how listeners accomplish speech perception in noise.
To begin with, though, it is important to note that characterizing noise as unwanted sound is fully compatible with the information-theoretic one precisely (though perhaps only) in the idealized experimental case in which a single listener is instructed to attend to and repeat as accurately as possible the speech of a single talker and in which the listener is assumed to be trying to do so to the best of their ability. In such a context, the goal of listening is solely to accurately receive and repeat a specific message. Thus, any sound whose presence increases the uncertainty of that reception is, by definition, unwanted (i.e., in conflict with the desired goal). In this default case, any interfering sound is by definition unwanted simply because it interferes with achieving a goal (i.e., correctly repeating the target speech). However, as I will argue in Sec. II B 1, “unwantedness” and “interference” can be considered distinctly, and doing so sets the stage for developing a much richer understanding of what is involved in listening to speech in the presence of noise.
1. Interference is unwanted
I start with separating the question of how a signal might interfere with reception of a target signal from the question of why a signal might be unwanted. The first is, obviously, one of the primary concerns of research involving speech perception in the presence of other sounds, and considerable work has already been done to understand the different ways in which properties of non-target sounds interfere with the perception of target speech. Beyond the obvious effects of acoustic interference (“energetic masking”), there is also “informational masking,” which, if anything, is an even more complex and less completely understood phenomenon (Durlach , 2003; Kidd , 2008; Kidd and Colburn, 2017; Shinn-Cunningham, 2008; Watson, 2005; Yost, 2007). Rather than going into further detail on this very large and sophisticated body of work, I will simply adopt the premise that there are many ways for one sound to interfere with the reception of another, and, as we shall see, some of these different manners of interference may have different consequences for listeners' response to the presence of interfering sounds.
With respect to why a signal might be unwanted, the general assumption seems to be that, if a sound interferes with accomplishing a listening task, then it will be unwanted. There are two reasonably common versions of this: one in which the sound actually reduces the likelihood of accomplishing the task (i.e., interference affects performance) and one in which the sound simply forces the listener to apply more mental computation to accomplish the goal (i.e., interference affects effort). This is an important distinction in recent research on listening effort, in which researchers have determined that even when two listening conditions do not differentially affect performance (i.e., target signal reception), they can nevertheless still strongly differ in how they affect listeners' assessment of how hard it is to accomplish the listening task. Such different conditions can also induce different patterns of physiological markers associated with attention and effort (Alhanbali , 2019; Brown and Strand, 2019; Francis and Love, 2020; McGarrigle , 2014; Peelle, 2018; Sarampalis , 2009). For present purposes, though, it is sufficient to note that there is a broad assumption that the unwantedness of a noise is presumed to be directly (though perhaps not linearly) related to the degree to which it interferes with recognizing the target speech. While this is, again, consistent with the information-theoretic idea of noise as a source of uncertainty and is therefore quite reasonable in the context of the default experimental paradigm in which the listener's goal is assumed to be entirely and only to accurately recognize the target speech, it is by no means the only reason that a sound may be unwanted. And, as I will describe below, the fact that a sound may be unwanted has consequences for how listeners behave, whether or not it introduces additional uncertainty in target speech recognition.
2. Unpleasant sounds are unwanted
In addition to being unwanted because it causes interference, a sound may be unwanted because it is intrinsically unpleasant. A wide variety of sounds are considered unpleasant by most people, though even here there is considerable variability across individuals (Cox, 2008). Psychoacoustic and noise quality research suggests that some acoustic properties may contribute to a sound's unpleasantness. For example, “griding” sounds (a metal tool scraped across slate, for example, or fingernails on a chalkboard) tend to have predominant energy in the 2500–5500 Hz range with a low (1–16 Hz) fluctuation (Halpern , 1986; Kumar , 2008). These sounds may be specifically unpleasant because they engage primitive alerting systems (Kumar , 2008) or evoke unpleasant haptic sensations (Cox, 2008; Ely, 1975). More generally, though, unpleasant sounds tend to have higher loudness (Zwicker and Fastl, 1999), sharpness (high frequency emphasis; von Bismarck, 1974), roughness (amplitude and frequency fluctuations in the 15–300 Hz range; Daniel and Weber, 1997; Vitale , 2020), fluctuation (quasi-periodic variability of amplitude and frequency in the 1–20 Hz range; Zwicker and Fastl, 1999), and tonality (related to harmonic-to-noise ratio; Lee , 2017).
In summary, the unpleasantness of many kinds of sounds has been attributed to basic psychophysiological responses to either their acoustic properties themselves (Kumar , 2008) or physical properties of actions associated with the generation of those acoustic properties (Cox, 2008) and has also been attributed to cultural and learned associations (Cox, 2008; Kumar , 2008). Importantly for present purposes, however, these sounds are considered to be unpleasant (and hence unwanted) irrespective of the degree to which they might interfere with a speech perception task. While this supposition warrants future research, considering that many of the acoustic properties that are identified as likely to be more annoying may also make them more similar to speech, this characterization suggests that unpleasantness alone might cause sounds to interfere with a listening task even if there is no direct interference with the acoustic signal of the target speech. That is, rather than becoming unwanted because they interfere with achieving a goal, intrinsically unpleasant sounds might interfere with achieving a goal precisely because they are unpleasant and therefore distracting. Although, to my knowledge, this hypothesis has yet to be tested directly, such distraction-based unwantedness may also occur with other sounds (De Coensel , 2009), not just intrinsically unpleasant ones (e.g., the barely heard sounds of an exciting movie heard through the wall from a neighboring apartment whilst one is trying to study), as I will discuss in Sec. III.
III. ADDING SOUND CHANGES THE AUDITORY SCENE
Here, I will use the term “auditory scene” to refer to the sonic environment a listener is exposed to in a given instance, including signals that are to be attended to (targets) and those that should be ignored (maskers or distractors). In principle, this term is similar to that of “soundscape,” though soundscapes are often considered over longer spans of time (minutes or hours) and often include some sense of social and environmental context as well as more subjective qualities than are intended to be included here (Botteldooren , 2018). In this section, I will argue that, once we introduce into the auditory scene a sound that can be distinguished from the target speech, many additional factors must be considered before we can draw strong conclusions about how the noise is interfering with recognition of the speech or why the noise may be unwanted.
A. The ecology of audition
The auditory system, like all sensory systems, fulfills a specific (range of) ecological function (Ramsier and Rauschecker, 2017). Of particular note for present purposes is the function of early warning (alerting). In humans and other primates, audition is sensitive over relatively large distances (unlike taste, touch, and internally directed senses, such as interoception and proprioception) and functions relatively effectively over an extremely wide angular range (unlike vision). It is also good at identifying the spatial origin of a signal (unlike olfaction), though not as good as vision, and is also always operating, even during sleep (like olfaction, but unlike vision). It thus arguably serves as an important “early warning system,” sensitive to the occurrence of events in the environment over large distances and essentially in all directions (Murphy , 2017), evaluating their relevance for action (Asutay and Västfjäll, 2016), and guiding action appropriately (Arnott and Alain, 2011). Therefore, when we change properties of an auditory scene, we must take into account how an auditory/perceptual system that is optimized for acquiring information about the world would treat that new scene. When we add noise to a target signal, we cannot simply assume that a listener will treat the new auditory scene as consisting simply of the same sound source that is inexplicably more difficult to recognize.
B. Noise as an auditory object
The concept of an auditory object has been discussed at length in previous work (Griffiths and Warren, 2004; Kubovy and Van Valkenburg, 2001; Shinn-Cunningham , 2017; Shinn-Cunningham, 2008). For present purposes, we can take this term to mean something like the mental representation of a set of acoustic phenomena that share sufficient spectrotemporal properties to be attributable to a single distal source and toward which attention may be directed (Alain and Arnott, 2000; Bizley and Cohen, 2013; Shinn-Cunningham , 2017). While the formation of an auditory object likely involves both pre-attentive and attentive processes (Backer and Alain, 2012; Fritz , 2007; Shamma , 2011; Shinn-Cunningham, 2008), once formed it is likely that auditory objects must be attended to some degree, at least while spare capacity remains available (Fairnie , 2016; Francis, 2010; Lavie, 2005) and perhaps even irrespective of the availability of perceptual capacity (Murphy , 2017). Therefore, an added noise can, simply by virtue of being represented as a distinct auditory object, cause the listener to engage cognitively with the auditory scene in a fundamentally different manner when the noise is present from when it is not. To the extent that an acoustic signal that is added to that of a target speech stream is spectrotemporally structured in a way that allows it to be treated as a separate auditory object, it becomes something to which attention may or even must be paid.
1. Auditory object formation
In the analysis of auditory scenes (Bregman, 1990), a variety of principles appear to govern, or at least strongly guide, the process by which the spectrotemporal properties that reach the ear are grouped together into separate objects (i.e., are attributed to separate distal sources) [see Shinn-Cunningham (2017) for a brief summary]. The formation of auditory objects is not necessarily perfect or complete, such that, for example, some spectrotemporal properties may be perceived both as distinct objects and as parts of a more complex object (e.g., the phenomenon of “duplex perception”; cf. Ciocca and Bregman, 1989), and it is possible that some spectrotemporal properties may remain unassigned to specific objects [see discussion by Shinn-Cunningham (2017) and Shinn-Cunningham (2007)]. However, when noise is present in a signal, two crucial questions arise: (1) Is the added noise structured in such a way as to engage the pre-attentive and attentional mechanisms underlying auditory object formation, and, if so, (2) how might the listener respond to the presence of an additional auditory object that is not the target?
To address the first question, I first consider work on the perception of auditory scenes (e.g., Bregman, 1990; McDermott, 2009; Shamma , 2011; Shinn-Cunningham , 2017) as well as more recent work characterizing the perception of auditory objects from an information-theoretic perspective (Kluender , 2019; Stilp , 2018). A thorough discussion of auditory scene analysis is beyond the scope of the present article, but a few basic generalizations are still possible. Frequency components that are temporally similar (i.e., have similar onset and/or offset timing or are amplitude modulated at the same rate) or are related in frequency (e.g., harmonics of the same fundamental) or are perceived as sharing a spatial location all tend to be grouped perceptually and perceived as separate streams or objects (Darwin, 1997; Shamma , 2011; Shinn-Cunningham , 2017). In particular, then, it seems as if acoustic phenomena that are more similar to one another and/or that are more predictable from one another will be perceived as belonging to the same auditory object.
The process of auditory object formation can therefore be seen as a particular instance of predictive processing, the extraction of regularities in the environment in the service of generating a hypothesis about the current and future state of the sensory environment (Kluender , 2019; Stilp , 2018; Winkler , 2009). As described by Stilp (2018), from this perspective, we can then consider the perception of speech in noise as a task requiring the detection of auditory objects with greater spectrotemporal regularity or structure (speech) with a context of sound(s) with a lower degree of structure (noise). To consider a few commonly used types of interfering signals, in the case of white noise, there is minimal structure in either time or frequency, while a single competing talker would have essentially the same degree of spectrotemporal structure as the target, and other sorts of maskers (e.g., multitalker babble, speech-shaped and -modulated noise) would be intermediate between the two [see discussion by Stilp (2022)]. That means, however, that the degree to which a noise is perceived as a distinct auditory object depends on its spectrotemporal regularity. If we want to minimize the “objectness” of a noise, we must use sound that is high in entropy (low in information, or spectrotemporal regularity) (Kluender , 2019). In Sec. IV, I address the question of how listeners might respond to a task in which there is more than one object detectable in the auditory scene, but it is important to note that future research should also consider the consequences of listening to poorly formed or incomplete objects as well.
IV. LISTENERS ARE NOT PASSIVE RECEIVERS
To understand all of the effects of an additional auditory object on listeners, it is first necessary to consider what we know about how the auditory scene is processed by listeners. A useful figure derived from work by Hari Bharadwaj (2022) is shown in Fig. 2, elaborating on work by Shinn-Cunningham (2017). This schematic shows a series of stages of processing from initial acquisition of the signal at the cochlea (bottom) through to some abstract cognitive-linguistic stage of message understanding (top). The earlier (lower) stages correspond roughly to those referred to by Shinn-Cunningham (2008) as (1) object formation and (2) object selection, while the topmost stage represents (3) the “coping processes” or “repair strategies.” A similar mapping may be made to the processes described by Edwards (2016) based on the model of Rönnberg (2013), with stages (1) and (2) corresponding to mechanisms of external attention and (3) to internal attention [see discussion by Francis and Love (2020) and Strauss and Francis (2017); see also Heald and Nusbaum (2014)].
What is important here is that there exist multiple stages of processing at which the presence of an added noise may interact with speech perception, these interactions may have different consequences for perception, and those consequences are what determine measurable outcomes [i.e., recognition performance, various indices of listening effort, etc. (Francis , 2016; Francis and Love, 2020; Francis and Oliver, 2018)].
For example, in the prototypical case in which a steady-state broadband speech-shaped noise is presented from the same speaker as a target signal, the noise may not be sufficiently distinguishable from the target for it to be treated as a separate auditory object in the object formation stage (point 1). Indeed, it seems likely that even perceived spatial separation alone is not sufficient to enable the formation of a completely distinct auditory object out of broadband noise (Freyman , 1999, 2001). If an added noise exhibits properties that help listeners segregate it from the auditory scene as a distinct object, for example, by having an identifiable onset or offset, spatial origin, or amplitude modulation (Bregman, 1990; Darwin, 1997; Shinn-Cunningham , 2017) or perhaps by otherwise having lower entropy than the background it appears in (Stilp , 2018), then it will be more likely to be treated as a separate object. The more object-like a concurrent sound is, the more likely it is to enable low-level processes of object formation and selection to improve message reception by allowing the listener to segregate or “stream off” (Bregman, 1990) the target speech from the noise more effectively (Gordon, 2000). If, however, the added noise cannot be successfully segregated from the target, as seems likely in the case of broadband steady-state noise, the representation of the target signal will contain a mixture of target and masker, leading to greater uncertainty in recognizing the linguistic elements of the signal at point 3. Such a mixed signal would support a larger number of possible, plausibly valid interpretations, which in turn would increase cognitive demand, reduce recognition accuracy, and/or introduce greater need for the application of postperceptual repair strategies to correct misperceptions [see discussion by Francis and Love (2020)].
In contrast, if the added noise can be successfully streamed off, as for example in the case of a target speech signal produced by one talker presented in the context of masking speech produced by a single, different, easily distinguished talker emanating from a distinct spatial location, the listener may be quite successful at separating the two sound sources into distinct auditory objects. The formation of separate auditory objects for target and masker in turn would facilitate the application of selective attention to suppress the representation of the distracting speech, causing the target signal to be represented relatively unambiguously and therefore recognized with little error or need for repair strategies. The relative difficulty that listeners have with treating target and masker signals as separate auditory objects likely underlies much of the distinction between energetic and informational masking (Brungart, 2005). In addition, however, because in this case the masking speech constitutes an intelligible speech signal, it plausibly demands the allocation of some attention as well (Lavie, 2005; Wöstmann and Obleser, 2016) and is likely to be processed and remembered if it is sufficiently clear (Wöstmann and Obleser, 2016) and the listener has sufficient attentional capacity (Tun , 2002).
Thus, a well-segregated single-talker masker and a poorly segregated broadband noise masker both are likely to interfere with speech perception, but in very different ways with potentially dissociable consequences for listeners. The broadband noise may primarily affect demand for the application of postperceptual repair strategies, while highly intelligible competing speech may place greater demand on the allocation of spatial selective attention and possibly late-occurring linguistic/memory processes [see, for example, discussion of two types of attentional effort by Strauss and Francis (2017) and discussion of post-perceptual processing by Winn and Teece (2021)].
Considering the consequences of different additive noise scenarios can also explain seemingly paradoxical cases, such as the benefit to performance that is sometimes observed with added low-level noise (Zeng , 2000) and noise masking in the workplace (Haapakangas , 2011). In the first case, benefits seem to accrue because of nonlinear changes to low-level neural responsivity in the presence of noisy input (Alain , 2009; Moss , 2004), and such effects are even seen when noise is added in other perceptual modalities, including somatosensation (Wuehr , 2018), though it is possible that these benefits only accrue in threshold or near-threshold perception. In the second case, the presence of noise seems to help to obscure task-irrelevant acoustic properties of the distractors, which may consist of footsteps, others' conversations, and even telephones. All of these may constitute “notice events” in the sense of De Coensel (2009) [see Love (2018)], yet they are either not recognized as a separate auditory object or are recognized as benign or even necessary ones [Haapakangas , 2011; Loewen and Suedfeld, 1992; Veitch , 2002; though see Lenne (2020)]. By raising the noise floor without enabling the perception of the noise as a distinct object, a noise masker may reduce the ability of task-irrelevant sounds to capture attention in a task-detrimental manner without incurring an emotional response to the presence of the masker itself as an unwanted sound [though cf. Lenne (2020)].
A. Attention and effort
To understand how listeners might respond to the presence of a non-target auditory object during a speech perception task, we must consider the listener as a fully functioning organism. I have already introduced this idea in discussing the ecology of hearing, but here I extend that discussion to consider the function of perception more broadly. From an ethological perspective, the role of a nervous system is to facilitate action, to enable the organism to efficiently move toward desirable conditions and away from undesirable ones (Lang, 2000; Yost, 2007). The role of attention, then, is to orient toward properties of the environment that are relevant for making decisions about action (Bradley, 2009; Raymond, 2009). Thus, the fact that listeners can, or even must, direct attention toward auditory objects suggests that such objects engage relatively high-level decision-making processes—processes that are integrated with affective mechanisms associated with emotion and motivation.
Attention is typically conceived of as a limited capacity mechanism for selecting phenomena for further cognitive processing (Driver, 2001; Kahneman, 1973), modulating their relevance in pursuit of a particular goal (Bradley, 2009; Eckert , 2016; Raymond, 2009), and thereby contributing to how well the selected information can be processed (Chun , 2011; Wild , 2012). As such, the typical assumption is that exerting attention is perceived as effortful (Kahneman, 1973). It is in fact quite likely that many contexts in which attention is engaged are not in fact perceived as effortful [for example, puzzles and games and activities involving flow (Nakamura and Csikszentmihalyi, 2016 ); see Bruya and Tang (2018) and Inzlicht (2018) for discussion], and this has important implications for understanding the role of motivation in speech perception in adverse conditions, as discussed below. However, within the context of the kind of high-effort, high-attention tasks typically employed in speech perception research, increasing the complexity of the auditory scene increases demand on attentional processes as indicated by decreased susceptibility to interference from auditory distractors (Bertoli and Bodmer, 2014; Fairnie , 2016; Francis, 2010) and also by physiological markers associated with effort, especially the pupil dilation response.
The pupil dilation response refers to momentary, illumination-independent, event-related increase in the diameter of the pupil of the eye. The pupil dilation response is an autonomic response originally associated with general arousal related to task demand (Kahneman, 1973) and engagement (Kahneman , 1968). Subsequent research supports a connection to the allocation of selective attention (Wierda , 2012) and cognitive effort (Beatty, 1982; Granholm , 1996; Kahneman and Beatty, 1966), including listening effort (Kramer , 1997; Winn , 2018; Zekveld , 2018). Listening effort-related pupil dilation is strongly associated with activation in the locus ceruleus–norepinephrine (LC-NE) system (Koelewijn , 2015; Peelle, 2018; Wang , 2018). The LC-NE system, in turn, is associated with the deployment of cognitive resources (Aston-Jones and Cohen, 2005; Gilzenrat , 2010), further supporting the idea that increased pupil dilation under adverse listening conditions reflects greater mobilization of limited cognitive resources. Thus, once a noise is perceived as an auditory object, even before its relevance is evaluated, it becomes a target for attention and therefore also a potential source of effort.
1. Divided attention
In addition to the likelihood that even the mere existence of a non-target object in the auditory scene increases listening effort, we must also consider the value ascribed to that extraneous object by the listener within the ethological context of making decisions about moving toward positive stimuli and away from negative ones. Again, in the traditional case, any signal that is not the target that the listener is instructed to attend to is considered unwanted because, as we have established, it diverts attention from goal performance. However, I would argue the goal of a listener in daily life is rarely to attend to a single message to the exclusion of all other percepts (Winn and Teece, 2021; Zhang , 2021). Even in conditions in which one chooses to listen attentively to a single talker, for example, when attending a lecture or watching a movie or a play, one is very likely still alert for the possibility of hearing other sounds—the muttered comment of a skeptical colleague, for example, or a sound cue from off stage signaling the entrance of a new character.
Moreover, many instances of speech communication occur in complex auditory environments, such as while walking down a busy street or in locations such as restaurants, coffee shops, or the proverbial “cocktail party”; environments in which listeners may want or even need to attend to “noise” as well as to target speech. Although the original “cocktail party” research by Colin Cherry (Cherry, 1953) focused on the ability to attend to one stream to the exclusion of another, subsequent research quickly showed that listeners were often incapable of completely shutting out an irrelevant signal (Broadbent, 1952; Moray, 1959), leading to decades of research and debate on the nature of selective attention (Driver, 2001; Lavie, 2005; Price and Moncrieff, 2021; Shinn-Cunningham and Best, 2015). Currently, there is a general consensus that information can intrude from “unattended” channels (Broadbent, 1982) and that the degree to which this happens depends on a wide variety of properties of the signal(s) and the listener (Aydelott , 2015; Bargh, 1982; Vachon , 2020). In particular, though, it appears that signals that have a particular significance to the listeners [i.e., their name or ringtone (Moray, 1959; Röer , 2013, 2014)] are more likely to draw attention to themselves and thereby incur greater processing demands. While some research suggests this could be the case for emotional or unpleasant stimuli as well (Broadbent, 1977), there is also evidence that listeners tend to habituate to intrusive sounds (Banbury and Berry, 1997; Martin-Soelch , 2006), and even observing and hearing an individual scraping their nails down a chalkboard may be missed if attention is sufficiently occupied by another task (Wayand , 2005). Thus, it seems very likely that when the auditory scene includes sounds that can be perceived as distinct from the target speech, listeners will devote at least some attentional processing to them even when doing so incurs a greater demand on cognitive processing, but other circumstances, such as habituation and overall attentional capacity demand, are likely to be important as well.
2. Involuntary capture of attention
Auditory objects that attract attention and demand cognitive processing resources during the accomplishment of other tasks may also cause aversive responses, such as annoyance and distress, even when the primary task does not involve audition at all (Keus van de Poll and Sörqvist, 2016; Marsh , 2018), suggesting that listeners also evaluate non-target stimuli in terms of their orientation toward the task. For example, in a workplace context, sounds that listeners feel are not necessary tend to be identified as more annoying, while those that listeners feel they have less control over are perceived as more distracting (Kjellberg , 1996). Thus, the capture of attention by irrelevant sounds not only reduces the cognitive capacity available for processing task-relevant information (Fairnie , 2016; Francis, 2010), it engages emotional/evaluative processes that could result in an aversive emotional response due to frustration with being unnecessarily or uncontrollably distracted from a primary task (Haapakangas , 2011; Kjellberg , 1996; Röer , 2014; Sörqvist, 2014).
In the case of intrinsically unpleasant sounds, i.e., sounds that in themselves cause an aversive response in the listener (whether for physiological or learned reasons), the involuntary capture of attention may introduce an added level of negative response, as the unavoidable sound is perceived as intolerable. In the extreme case of misophonia (Jastreboff and Jastreboff, 2015; Kumar , 2017), triggering sounds seem to elicit an involuntary autonomic response that is interpreted as emotionally meaningful, e.g., disgusting or enraging. The heightened negative affective response then increases attention toward trigger sounds because emotion heightens sensory attention and predictions about the sensory environment (Smout , 2019). Increased attention to triggering sounds, in turn, leads to stronger autonomic responses, resulting in a vicious cycle that makes triggering sounds increasingly intolerable (Dozier, 2015; Jastreboff and Jastreboff, 2015). Such an effect of anticipation may also arise in less pathological contexts as well. For example, Ely (1975) found that listeners who knew they would hear nails on a chalkboard showed increasing autonomic responses across repeated presentations, while uninformed listeners exhibited no change in autonomic response over time, suggesting that awareness and perhaps anticipation of the nature of the unpleasant sound increased the strength of the aversive response over time.
Thus, when a non-target auditory object is present in an auditory scene, even when it does not interact with the primary task in any acoustic manner, we must consider that it will likely attract attention to itself, especially if it exhibits some of the acoustic properties discussed above as being attentionally capturing. By attracting attention, it may interrupt performance of the primary task, even if this task is not auditory in nature, causing irritation or displeasure and increasing demand for the task (Sörqvist, 2014). Even just pulling attention away from the primary task, therefore, increases the overall cognitive demand on the listener, potentially reducing the availability of resources needed to accomplish the speech perception task and resulting in poorer performance, introducing a further sense of increased listening effort, and increasing frustration. In addition, however, engaging with the irrelevant sound as a separate object seems likely to engage evaluative processes that may result in strong affective response to the presence of the sound that may or may not be related to the degree to which it interferes with a target speech perception task (as in the cycle described for misophonia, but, presumably, to a less emotional degree). All these potential factors should be considered when adding noise to a target signal.
B. Emotion and motivation
Just as different sounds may evoke different kinds and degrees of emotional response, so may the sense of exerting effort, and there is a growing awareness that in many tasks motivation is at least as significant as cognitive effort itself (Bruya and Tang, 2018; Kurzban, 2016; Pichora-Fuller , 2016). Following the argument outlined in previous work (Francis and Love, 2020), the willingness to exert effort depends on motivation (Richter, 2016; Richter , 2016), and motivation, especially for action, in turn depends on mechanisms that regulate the expenditure of limited resources to maintain homeostasis (Barrett and Simmons, 2015; Kleckner , 2017; Touroutoglou , 2019). Simply put, motivation to accomplish a task depends on the ongoing assessment of whether accomplishing that task is worth expending the resources that must be expended to accomplish it relative to the current availability and predicted future demand for those same resources (Eckert , 2016; Kurzban, 2016; McLaughlin , 2021; Schneider , 2019; Westbrook and Braver, 2015). The ongoing, moment-by-moment assessment of resource availability vs demand (current and projected) is reflected in the physiological property known as core affect (Barrett, 2006; Duncan and Barrett, 2007), which also provides the physiological basis for the internal states that we identify as emotions (Barrett and Bliss‐Moreau, 2009). Thus, effort, emotion, and motivation are linked in that expended effort that is perceived as failing to achieve a desired goal feels bad and is demotivating (Shenhav , 2021; Venables and Fairclough, 2009).
When the presence of noise increases demand on processing resources, listeners must decide whether it is still worth it to continue with the task. If the task is sufficiently important, they may continue performing it even while assessing that it is no longer worth the cost. Following Alhanbali (2018), I argue that it is this demotivating emotional response to the fruitless expenditure of effort that underlies the kind of negative sense that discussions of listening effort typically evoke rather than the expenditure of effort as such (Alhanbali , 2018; Francis and Love, 2020; Hornsby, 2013; Kramer , 2006; Pichora-Fuller , 2016; Winn and Teece, 2021). Nevertheless, if the increased effort is attributable to the presence of an identifiable additional noise (a distinct auditory object, i.e., someone else talking, or a noisy air conditioning system), then the listener may be annoyed specifically at that noise.
C. Summary
In summary, how much a non-target auditory object is unwanted depends not just on how much it interferes with the formation and/or interpretation of the target signal (interference), but also on whether or not it is intrinsically unpleasant or otherwise aversive to the listener (aversion), whether and how much it interferes with some outcome that the listener desires (distraction), and how much that outcome is actually wanted (motivation). All of these factors may affect the listener's performance on a listening task as well as their evaluation of whether it is worthwhile to continue performing the task, and they do so through distinct but not mutually exclusive mechanisms.
V. CONCLUSION
In 1981 Ann Cutler published a paper entitled “Making up materials is a confounded nuisance, or: Will we able to run any psycholinguistic experiments at all in 1990?” (Cutler, 1981). This short paper drew attention to many of the myriad ways in which new discoveries were complicating the development of stimulus sets for spoken word recognition tasks and argued for a deeper consideration of these complex interactions in future research. In a similar way, I hope I have shown here that adding noise to speech perception tasks is thoroughly confounded—certainly with auditory processing, but also with attention, effort, affect, and motivation. Nevertheless, research investigating speech perception in noise remains possible and even desirable. The move toward studying speech perception in different kinds of noise is at least partly motivated by the desire to investigate behavior in listening contexts that are more ecologically valid, in the sense of more like those found in everyday life, and I think this is a laudable and necessary goal (Beechey, 2022; Keidser , 2020). However, in doing so, we must also move away from considering added noise as simply a way to dial up the difficulty of a listening task.
The first step is to recognize that adding a new signal to the auditory scene is just one of many ways to increase uncertainty and that, even if we merely wish to add uncertainty to the process of recognizing a target signal, we must nevertheless ask, “What kind of uncertainty is appropriate for my research question?” Speech is composed of different acoustic events, some of which will be more or less affected by different properties of the source or receiver and more or less obscured by specific patterns of acoustic noise or other sorts of signal manipulation. These events are further bound together into phonetic features and higher-level units that, themselves, depend on combinations of spectral and temporal information and thus are likely to be made differentially uncertain by different kinds of noise or manipulation unfolding over different intervals of time. Uncertainty at the phonetic feature level may be compensated for in a different way than lexical uncertainty, for example, and listeners may react differently to the need for different sorts of compensatory processes. The locus of interference within the process of understanding speech (Fig. 2) is, therefore, an important consideration in understanding the effect of added uncertainty and the strategies that listeners may adopt to cope with it.
In addition, however, if the source of uncertainty is an acoustic signal, we must ask whether it can be perceived as an auditory object distinct from that of the target speech. If so, we must consider the possibility of additional demands imposed on selective attention and the processing of the auditory scene and the possibility that the noise itself may be considered unpleasant either due to its own inherent properties or because it is perceived as causing difficulties in achieving a desired goal. And in all cases, we must consider the potential for an impact on motivation from emotional responses both to the noise-related reduction in performance and/or increase in listening effort and to the awareness of the presence of the noise itself. Whether we prefer a concept of noise closer to the “source of increased uncertainty” or one closer to “unwanted sound,” ultimately, we must consider that uncertainty in speech perception may arise at many simultaneous levels of processing and that the unwanted nature of a sound may have significant implications for task performance that extend far beyond simple errors in the perception of speech.