In-Channel Cancellation

6 7 A model of early auditory processing is proposed in which each peripheral channel is 8 processed by a delay-and-subtract cancellation filter, tuned independently for each channel 9 with a criterion of minimum power. For a channel dominated by a pure tone or a resolved 10 partial of a complex tone, the optimal delay is its period. For a channel responding to 11 harmonically-related partials, the optimal delay is their common fundamental period. Each 12 peripheral channel is thus split into two subchannels, one cancellation-filtered and the other 13 not. Attention can be directed to either depending on the task. The model is applied to explain 14 the masking asymmetry between pure tones and noise: a noise target masked by a tone is 15 more easily detectable than a tone target masked by noise. The model is one of a wider class 16 of models, monaural or binaural, that cancel irrelevant stimulus dimensions so as to attain 17 invariance to competing sources. Similar to occlusion in the visual domain, cancellation yields 18 sensory evidence that is incomplete, thus requiring Bayesian inference of an internal model 19 of the world along the lines of Helmholtz's doctrine of unconscious inference.


21
A pure tone masks a narrowband noise probe less effectively than a narrowband noise 22 masks a tone probe (Helman 1972, Hall 1997. The difference in masking can reach 20 dB or 23 more. This implies a stark deviation from the standard "power spectrum model" of masking 24 according to which a sound is detectable if it its power exceeds that of the masker within at 25 least one peripheral frequency channel (Moore 1995). Similar masking differences have been 26 observed between harmonic and inharmonic (or noise-like) maskers, as reviewed by de 27 Cheveigné (2021): a harmonic masker is less effective than an inharmonic or noise-like masker 28 Cheveigné 2021). 23 However, the in-channel cancellation model goes beyond merely extending the harmonic 24 cancellation model to allow for a different "local fundamental period" in each frequency band . 25 It is also applicable to maskers that are spectrally sparse, so that some individual channels are 26 dominated by a single sinusoidal component of the masker. The simplest example a pure tone, 27 a focus of this paper. 28 The hypothesis of an automatically-tuned filter in each peripheral channel opens some 29 interesting perspectives. First, it highlights the role of time-domain signal processing in the 30 auditory brainstem, and implies that auditory frequency selectivity is not entirely determined 31 by cochlear frequency selectivity. The idea of a "second filter" dates back to Huggins and 32 Licklider (1951), more recent incarnations being the lateral inhibitory network (LIN) of 1 Shamma (1985) or the phase-opponency model of Carney et al (2002). 2 Second, it emphasizes the role of invariance as a goal of auditory processing, and 3 cancellation as a fundamental operation to achieve that goal. While cancellation might ensure 4 invariance with respect to the background, the integrity of the target is not guaranteed: 5 certain features may be absent or distorted. Thus, cancellation needs to work hand-in-hand 6 with a Helmholtzian process that can fit an incomplete or distorted target representation to 7 an internal model (de Cheveigné 2021). 8 Third, a by-product of the fitting process is an estimate of the period that dominates each 9 channel. Whereas the original harmonic cancellation model offered a single period estimate, 10 a potential cue to pitch (see de Cheveigné 1998), the in-channel version offers multiple local 11 estimates. A stimulus for which period estimates differ across channels (e.g. an inharmonic 12 complex) typically lacks a clear overall pitch. However, listeners may still be sensitive to pitch 13 change (Demany and Ramos 2005, Popham et al 2018) that this model might help explain. 14 An earlier explanation of the tone/noise or harmonic/inharmonic masking difference is 15 that it is easier to detect a probe on the background of a smooth (unmodulated) pure-tone 16 or harmonic complex than on a rough (modulated) noisy or inharmonic background. The in-17 channel model can be seen either as an alternative to that explanation, or as a sensitive 18 mechanism to detect departure from smoothness as required to implement that explanation. 19

20
The model 21 A simplified linear model of cochlear filtering is used to derive qualitative predictions. For 22 simplicity and clarity, effects of non-linear transduction, compression, and stochastic neural 23 coding are not considered. The initial stage of auditory processing is modeled as a linear filter 24 bank, followed by a delay-and-subtract cancellation filter, with a delay estimated 25 automatically and independently in each channel. The auditory brain is assumed to have 26 access to the original signal in each peripheral channel, the cancellation filter output, and the 27 delay estimate. Details of the simulation are as follows. 28 Filters 1 Cochlear filtering is modeled using a gammatone filter bank (Holdsworth et al. 1988, Slaney 2 1993) with characteristic frequencies distributed uniformly on a scale of equivalent 3 rectangular bandwidth (ERB), with the bandwidth of each channel set to one ERB according 4 to estimates of Moore et al. (1983). Transfer functions of selected channels are plotted in 5 Fig. 1, scaled so that their peak gain is one (0 dB). Plotted on a linear frequency scale, filters 6 appear wider at high than low CF: bandwidth is roughly proportional to CF above 1 kHz, and 7 roughly uniform over the lowest CFs. Each channel attenuates all but a narrow frequency 8 region, so its output is relatively insensitive to the presence of signal features outside that 9 region. However, attenuation is not infinite at any frequency.  This selectivity can serve to "protect" a narrowband target sound from a competing 16 masker. This is illustrated in Fig. 2 (top) for a 1 kHz probe mixed with a wideband masker of 17 equal RMS. Channels near 1 kHz are purely dominated by the probe, so the presence or 18 absence of the probe can be detected despite the masker. In another example (Fig. 2, middle), 19 individual low-rank partials of a 200 Hz harmonic complex tone are isolated by channels with 20 CFs near each partial's frequency. Two ranges of CFs can be distinguished. Below ~1 kHz (5th 21 harmonic), each partial appears to be perfectly isolated within a subset of channels close to 22 its frequency (intermediate channels respond to at most two neighboring partials). Above 23 ~1 kHz, individual partials are less perfectly isolated (the peaks of the red curve do not reach 24 one). All channels tend to respond to more than two partials (the dotted line represents the 25 power of the third-strongest partial). 26 This parallels the classic distinction between "resolvable" and "unresolvable" partials of a 27 complex (Moore and Gockel 2011), which is often attributed to the presence or absence of 28 (middle) of the present paper suggests an alternative account: a partial is resolvable if it 2 dominates at least one peripheral channel (i.e. the power of all other components is below 3 some threshold). This might, for example, be a condition for the estimation of that partial's 4 frequency based on time-domain neural patterns (Srulovicz and Goldstein 1983).   situation where it is of little avail. The stimulus here is a mixture of a pure tone at 1 kHz and 3 a narrowband noise centered at 1 kHz with 0.5 ERB bandwidth (~60 Hz) and equal RMS 4 amplitude. No channel is clearly dominated by the pure tone, which leads us to expect high 5 thresholds for detecting a tone probe in a narrowband noise (or vice-versa). Here, peripheral 6 filtering offers little benefit. 7 The second ingredient of the model is a cancellation filter, modeled as a simple delay-and-8 subtract filter with impulse response ℎ( ) = ( ) − ( + ) (Fig. 3 left). Its transfer function 9 has zeros at all multiples of = 1/ (Fig. 3 right), implying that attenuation is infinite at those 10 frequencies.   The in-channel cancellation filter requires an estimate of the delay ! in each channel . 7 Here, we make the important assumption that ! is determined automatically within each 8 channel, on the basis of the signal within that channel. This is achieved as schematized in 9  irrelevant fluctuations, yet short enough to track period changes in a non-stationary signal. In 1 the simulations the window size was set to 30 ms. Second, for a purely periodic signal (a pure 2 tone or harmonic complex), the output/input ratio is zero for the period and all its multiples, 3 implying an infinite number of equally valid candidates for ! . For definiteness, it may be 4 convenient to choose among candidates the first for which the power ratio is below some 5 threshold (see de Cheveigné and Kawahara 2002 for the rationale and more details). This 6 parameter may affect ! but it has little impact on the ability of the filter to suppress the 7 background, as reported below. 8 Figure 6 plots the inverse of the automatically-determined delay ! within each channel 9 as a function of CF for three stimuli: a 1 kHz pure tone (left), wideband noise (center), and 10 narrowband noise (right). The parameter was set to 0 in each case. For the pure tone, the 11 estimate is 1/ ! = 1 kHz within all channels. For white noise, it is close to CF for every channel, 12 suggesting that it reflects the noise-excited ringing pattern within each channel.   For each peripheral channel, the auditory system is assumed to have access to both the 25 cancellation-filtered and unfiltered output, and can attend to either depending on the task. 26 It also has access to the automatically-estimated delay parameter ! for each channel. For a pure tone, assuming the delay ! is accurately chosen, the output of the cancellation 4 filter should be zero. Actually, this is not quite correct for a stimulus of finite length: there is 5 a "glitch" at both onset and offset, but the part intermediate between glitches is zero (Fig. 7  6 top right). The same is not true for narrowband noise (Fig. 7 bottom right). For that stimulus, 7 the compound filter fails to cancel any portion. The narrowband noise was produced by 8 filtering wideband noise with a gammatone filter centered at 1 kHz with bandwidth 0.5 ERB 9 (roughly 60 Hz), and then shaping it temporally with a 500 ms window with 10 ms raised-10 cosine onset and offset ramps. In both examples, the delay ! was chosen automatically as 11    This is qualitatively consistent with the observation, made by Helman (1972) and others, 7 that a pure tone masker is much less potent than a narrowband masker of similar amplitude. 8 The demonstration that this effect is an emergent property of in-channel cancellation is the 9 foremost result of this paper. The rest of this section examines a wider range of examples. 10 Smoorenburg's sweep tone masker 11 Smoorenburg and Coninx (1980) found that a pure-tone masker was up to 20 dB less 12 potent than a sweep tone masker when the probe was a short pure tone with a frequency 13 that matched the frequency of the pure tone masker (or the instantaneous frequency of the 14 sweep masker at the probe's position). If anything, the standard power-spectrum model of 15 masking would predict less masking for the sweep because it contributes only briefly to power 16 within the channel occupied by the probe. 17 Figure 10 shows the excitation pattern across channels after cancellation filtering for a 1 18 kHz pure tone probe temporally centered on a 1 kHz pure tone masker (left) or a sweep-tone 19 masker (right). The sweep tone frequency was swept logarithmically from 500 Hz to 2 kHz in 20 0.5 s (sweep rate 4 octaves/s). In contrast to the case of a pure tone masker (Fig. 9 left) (Fig. 9 right). This is qualitatively in agreement with the stronger masking observed by 1 Smoorenburg and Coninx (1980) for a sweep relative to a pure tone masker 2 3 4 Fig. 10. Spectro-temporal excitation patterns before (left) and after (right) cancellation for a 5 short pure tone probe added to a sweep-tone masker. Cancellation allows the probe to emerge 6 (visually) from the pure tone but not the sweep. below about 1 kHz, 1/ ! appears to follow either the frequency of an individual partial, or a 16 lower frequency that likely corresponds to a common subharmonic of neighbouring partials. 17 For higher CFs, 1/ ! tends to follow CF (with some glitches). These estimates were obtained 18 with a threshold parameter = 0.1 (see Methods). For a smaller value of , the delay 19 estimate is less likely to follow the period of individual partials, and more likely to follow the 20 fundamental period of the harmonic tone, or a "local fundamental" of the inharmonic tone.   qualitative. Whether they they translate into a quantitative fit with behavioural results awaits 2 more detailed modelling, food for future studies. This section discusses some limits of the 3 model, and speculates on its possible significance as one of a wider class of sound processing 4 mechanisms available to the auditory brain. 5 The tone/noise asymmetry 6 The observation that a pure tone is a less potent masker than a narrow-band noise of same 7 intensity (Helman 1972) clashes with the widely accepted power spectrum model of masking 8 according to which a probe is heard as soon as its power reaches some proportion of the 9 power of the masker within at least one peripheral channel (Moore 1995). As argued by Hall 10 (1997), the mismatch with the power model suggests an additional unmasking mechanism, 11 for which the in-channel cancellation model offers a putative account. 12 The tone/noise masking asymmetry has also been explained by arguing that the addition 13 of a probe to a pure tone introduces modulation cues that are easier to detect on a smooth 14 background (pure tone) than on an already-modulated background (narrowband noise) 15 (Verhey 2002). The in-channel cancellation model can be seen as an alternative to that 16 explanation, or as an implementation of it: in-channel cancellation is arguably a sensitive and 17 expedient way to detect period-to-period fluctuations within each channel. Hall (1997)  Using the in-channel cancellation model, we could likewise invoke the fact that a narrowband 23 noise is more approximately periodic that a wide-band noise. Moore et al (1998) argue that 24 additional detection cues may need to be taken into consideration, such as distortion 25 products, beats, or dip listening. 26 Is the model applicable to speech? 27 Figure 13 (left) shows the spectro-temporal excitation pattern in response to a short 28 speech phrase ("Wow Cool!"). Fig. 13 (right) shows the degree of suppression afforded by in-29 channel cancellation (yellow: no suppression, blue: strong suppression). Clear suppression is 30 evident in a small number of time-frequency pixels, which luckily correspond to pixels of high 31 amplitude in the excitation pattern (left). Within certain pixels the attenuation reaches 20 dB 1 or more, which might make it easier to hear features of a weak target (e.g. other voice) that 2 happen to fall within those pixels. Whether this can translate into a benefit in terms of 3 intelligibility, for example as an explanation of the "cocktail party effect", is beyond the scope 4 of this paper. offers multiple delay estimates ! , each corresponding to the period of a partial, or of a group 15 of harmonically-related partials that dominates a channel. These can be variously interpreted 16 as cues to a pitch local to a spectral region (accessible to conscience via attention to that 17 spectral region), or as a set of "partial pitches" with perceptual reality at a sub-attentive level 18 (analogous to the "spectral pitches" invoked by Terhardt 1974). Certain sounds that lack a 19 clear period, such as inharmonic tones, or bandpass noise, may also be judged to have a pitch-20 like quality (sometimes referred to as "tonality") that the in-channel cancellation might help 21 explain. Speculating, a condition for perceived tonality may be the existence of peripheral 22 channels for which local cancellation is approximately successful. inharmonic complexes (at least for short delays between sounds). These phenomena suggest 1 the existence of a battery of "frequency shift detectors" as reviewed by Demany and Semal 2 (2018). The in-channel cancellation model might help explain how such shift detectors are 3 implemented. The period ! within a channel can be estimated by scanning the outputs of an 4 array of cancellation filters for a minimum, as schematized in Fig. 14   The availability of a mechanism that can reduce the masking power of certain sounds 22 implies for those sounds a property of transparency, that may be beneficial in a musical 23 context that involves building a complex sound structure with multiple elements (polyphony). 24 The same property may be detrimental, for example for perceptual coding of audio signals, 25 as it enhances the detectability of quantization noise (Johnston 1988). 26 How do in-channel cancellation and harmonic cancellation models relate to each 1 other?

2
The two models are similar in principle and structure. In-channel cancellation might seem 3 more complex because it involves more parameters (the value of ! in each channel), and 4 thus less parsimonious. However, the ! are chosen automatically and thus do not constitute 5 "free" parameters (they cannot be tweaked to improve the fit to a data set). One might argue 6 instead that the in-channel model is simpler because it is implemented locally within each 7 channel, with no need for a cross-channel mechanism to derive a common across channels 8 and distribute that estimate to each channel. 9 The two models are equally able to cancel a perfectly harmonic background (same period 10 in all frequency regions), but the in-channel model can additionally attenuate backgrounds 11 that are locally harmonic, as well as those that are spectrally sparse (e.g. Roberts  The possibility of implementing in-channel cancellation within tonotopically differentiated 20 neural tissue may make it physiologically more plausible than harmonic cancellation as all 21 operations are tonotopically local and require no cross-channel communication. 22 Invariance, unconscious inference, and predictive coding

23
One can argue that a major goal of auditory processing is to ensure invariance, both to 24 irrelevant sound dimensions for the purpose of classification, and to competing sound sources 25 for the purpose of auditory scene analysis. Cochlear frequency resolution or temporal 26 resolution can be interpreted in this light, and binaural, harmonic, or in-channel cancellation 27 as complementing them. For example, cochlear filtering is effective to ensure invariance of a 28 narrowband target to the presence of off-band energy (Fig. 2 top), and in-channel cancellation 29 extends this ability to an on-frequency tonal masker for which cochlear filtering does not 30 suffice (Fig. 2 bottom). 31 Cochlear filtering and cancellation both contribute to invariance by suppressing the masker, 1 but this operation may affect the target: target features that fall within discarded channels or 2 on zeros of the cancellation filter (Fig. 3, right) are lost. However, incomplete sensory patterns 3 can be accommodated via a process of unconscience inference (Helmholtz 1867) by which an 4 internal model of a perceptual object is constrained by the available (albeit incomplete) 5 information (de Cheveigné 2021). 6 In-channel cancellation can also be seen as a form of predictive coding (Barlow and 7 Rosenblith 1961; Friston 2018) operating at the very first stage of post-cochlear processing. 8 The outcome of this coding is both a delay parameter (and possibly a goodness of fit measure) 9 for each channel, that contribute to characterize the stimulus, and an "error signal" that may 10 reveal a weaker concurrent target. In-channel cancellation is related to linear predictive 11 there are multiple sites within the auditory brainstem where the necessary excitatory-20 inhibitory interactions could occur, from dendritic fields within the cochlear nucleus to 21 dendritic fields within the inferior colliculus. In such a circuit, cancellation would operate on 22 instantaneous spike probabilities, and one might wonder whether this allows sufficient 23 linearity and dynamic range for effective filtering. To settle this issue requires simulation with 24 realistic models of transduction and neural processing, which is beyond the scope of this 25 paper, however previous modelling demonstrated effective processing for synthetic and 26 speech stimuli using recorded auditory-nerve patterns (Palmer 1990) and simulated spike 27 trains (de Cheveigné 1993, Guest and Oxenham 2019). Evidence for spike-based subtractive 28 processing as required by the EC model has been reported in the auditory brainstsem, e.g. 29 Franken et al (2021). The neural cancellation filter has the same structure as the same-30 frequency inhibitory-excitatory (SFIE) circuit that Nelson and Carney (2004) proposed to 1 explain amplitude-modulation tuning in the inferior colliculus. 2 Spectro-temporal fan-out in the brainstem 3 A remarkable feature of the auditory system is the progressive "fan-out" from the single 4 channel of acoustic vibration at each ear to ~3000 inner hair cells, ~30000 auditory nerve 5 fibers, ~100000 neurons in the cochlear nucleus (Hinojosa 2011), and millions of cells within 6 the auditory cortex. Response properties are usually more complex at higher levels. In-7 channel cancellation could participate in such a fan-out by augmenting each peripheral 8 channel with an additional cancellation-filtered channel. Multiple, diverse transforms help 9 pattern recognition algorithms ensure invariance to irrelevant feature dimensions (Duda et al 10 2012) or, as we suggest here, to background sounds. Random transforms are sufficient for 11 pattern recognition as long as they include convolution and non-linearity and are sufficiently 12 numerous and diverse (e.g. Gauthier 2021), but there is a benefit to selecting a priori useful 13 transforms. In-channel cancellation ensures invariance to a class of interfering stimuli 14 (harmonic and/or spectrally sparse), which argues for including this transform within the 15 bouquet. In-channel cancellation is arguably a "good thing to have" for the auditory 16 brainstem. 17 A classic view of auditory processing is that cochlear filter outputs are demodulated and 18 represented as a slowly-varying spike rate, in which case spectral resolution is determined 19 entirely by that of the cochlea. Neural time-domain processing offers a wider range of 20 possibilities. Using a sampled-signal notation for convenience, the cochlear filter bank can be 21 approximated as a -column matrix of finite impulse responses of order . Filtering 22 amounts to multiplying a -column matrix of delayed stimulus signals (delays 1 ⋯ ) by 23 the matrix to obtain the matrix = of cochlear-filtered signals (one channel per 24 column). If the matrix allows an inverse, (i.e. if its rank is larger than ), multiplying by 25 this inverse would reconstruct = " , i.e. the acoustic waveform (a single column of ) 26 could be reconstructed as a weighted sum of cochlear filter outputs. Furthermore, supposing 27 that a task can be addressed by applying some filter of impulse response to the acoustic 28 waveform. The output of that filter is available by multiplying by "$ . In other words, any 29 filter (of order at most ) applicable to the acoustic stimulus can be implemented within the 30 auditory system by taking a weighted sum of cochlear filter outputs. The operations applied 31 to (cochlear filter bank outputs) are purely scalar: no delays are required, although the 1 availability of delays might ease the implementation. 2 This rosy perspective should be tempered because it assumes perfect linearity, whereas 3 we know that transduction and neural processing have imperfect linearity and limited 4 dynamic range. Measurements of binaural or harmonicity-based unmasking (reviewed in de 5 Cheveigné 2021, p10) suggest that the benefit of post-cochlear filtering might be limited to 6 about 3-15 dB. Cochlear filtering is required for greater attenuations, and thus the concepts 7 of "critical band" and "resolvability" remain entirely relevant. 8 By drawing on both cochlear filtering and time-domain neural processing, in-channel 9 cancellation resembles other hybrid models such the lateral inhibitory network (LIN) of 10 Shamma (1985) or the phase opponency of Carney et al (2002), or earlier proposals of a 11 "second filter" (Huggins and Licklider 1951) or the more recent idea of "synthetic delay lines" 12 (de Cheveigné and Pressnitzer 2006). The interplay between cochlear filtering and time-13 domain processing is integral to all these models. 14

15
A partial of frequency can be cancelled with a delay equal to any multiple = / of its 16 period. There might be an advantage of choosing a larger multiple, as a cancellation filter 17 tuned to a higher value of imposes less attenuation to the spectral region immediately 18 adjacent the peak of the peripheral filter, as illustrated in Fig. 15.