Vowel length contrasts in quantity languages are typically realized primarily through duration. This study tested whether spectral cues contribute to the perceptual identification of the short-long monophthongal contrasts in two varieties of Czech. Results showed that listeners attend to spectrum as well as to duration, both for the high vowel-length pairs, which display consistent spectral differentiation in production, and for the remaining contrasts, whose spectral differences are subtle. Reliance on spectrum was generally higher for Bohemian than Moravian listeners. The findings reveal the utilization of spectrum for vowel length perception in Czech, which is described as a “true” quantity language.

Although many of the world's languages contrast phonologically short and long vowels, only in some the contrast can be straightforwardly analyzed in terms of a vowel duration difference. Such languages, including Estonian, Finnish, Czech, or Japanese, are traditionally termed “quantity languages.”1 Languages in which contrasts between pairs of vowels with shorter and longer durations are not primarily realized by acoustic duration differences, such as English, Dutch, German, or Swedish, rely greatly on other cues, especially formant frequency differences,2,3 but also formant dynamics4 and F0 dynamics.5 However, even in languages classified as true quantity languages can formant frequencies serve as a partial cue to phonemic vowel length; spectral differences have been found to affect short–long category boundary locations along durational continua for Thai6 and Japanese.7 

If a quantity language exhibits systematic formant-frequency differences between vowel quantity pairs in speech production, even if they are small, one may ask whether these differences can be interpreted as phonologically relevant, i.e., have consequences for the organization of the sound system. Further, if spectral cues are relevant to listeners, are they used in perceiving a particular short-long vowel contrast, or does the effect of spectrum generalize to other quantity pairs?

The current paper addresses these questions for Czech, a quantity language with the monophthongal system consisting of five short-long vowel pairs: /ɪ/-/iː/, /ɛ/-/ɛː/, /a/-/aː/, /o/-/oː/, and /u/-/uː/.8 Early impressionistic or small-scale acoustic studies described these pairs as differing primarily in duration, although the literature (as early as Frinta9) notes minor qualitative differences within the pairs, especially for high front /ɪ/-/iː/. Recent acoustic data10,11 showed that in Czech production, /ɪ/ and /iː/ are clearly spectrally differentiated, and that also in the high back pair short /u/ tends to be more centralized (have higher F1 and F2) than long /uː/. Recent studies also report the mean long/short duration ratio across all vowel pairs of about 1.7:1.11,12 Importantly, the long/short vowel ratio differs across the five vowel pairs: it is smallest for high vowels, intermediate for the low vowel, and largest for the mid vowels.11,12 Note that these observations about /ɪ/-/iː/ and /u/-/uː/ are based on the Bohemian Czech (BC) variety (western), the Moravian Czech (MC) varieties (eastern) probably showing less spectral and more temporal differentiation in production.8 

Since at least in BC a spectral difference between /ɪ/-/iː/ and also /u/-/uː/ exists (accompanied by a somewhat reduced durational difference), it is reasonable to ask whether spectral structure is utilized by listeners as a perceptual cue to these vowel contrasts. This should be the case if perceptual sound categories reflect statistical distributions of the acoustic properties available in the speech signal.13 Podlipský and colleagues12 tested the perceptual cue weighting for Czech /ɪ/-/iː/ and showed that spectrum really served as a cue for these vowels, decreasing the perceptual weight of their reduced durational distinction; while MC listeners (although they did attend to it) weighted spectrum as less important than duration, for BC listeners the two cues were approximately equally important. A later study14 showed that this dialectal difference in cue weighting for Czech /ɪ/-/iː/ is reflected in Czech listeners' perception of second-language vowels: BC listeners relied on spectral cues when distinguishing non-native (Dutch) vowel contrasts more than MC listeners who attended more to duration. For Czech /u/-/uː/ no cue-weighting data have been published. Given that the /u/-/uː/ pair parallels /ɪ/-/iː/ in production (exhibiting a spectral differentiation and a somewhat reduced long/short ratio), we expect spectrum to be used as a perceptual cue to /u/-/uː/ as well, at least by BC listeners.

Apart from comparing the perception of the front and back high short/long vowel contrasts in Czech, this study also tests whether listeners attend to slight qualitative differences occurring between the members of the non-high Czech short/long vowel pairs. The non-high short/long vowel pairs are included in our experiment also for a methodological reason. Testing the perception of a single vowel contrast (as, e.g., Podlipský and colleagues12 have done), or of a subset of the vowel system, may distort responses due to a stimulus range effect.15 We aim to avoid this by including all Czech monophthongs in the stimulus set.

To sum up, the research questions for this paper are the following: (1) Will previous results for Czech front /ɪ/-/iː/,12 i.e., the use of spectrum as a perceptual cue to this contrast and for BC a similar weighting of spectrum and duration, be replicated if all Czech monophthongs are included in the stimulus set? (2) Will listeners attend to spectral quality when identifying Czech back /u/-/uː/ and if so, what will be the relative cue weighting of spectrum and duration? (3) Will listeners attend to slight spectral differences between members of the non-high short/long vowel pairs?

The experiment, conducted at Palacký University Olomouc, comprised a self-paced ten-alternative forced-choice vowel categorization task performed in Praat.16 The task used isolated nonsense [fVp] monosyllables as stimuli, which was preferred to isolated vowels (to make the task more natural) and to real words (to rule out lexical effects). The labial context was chosen to avoid contexts (coronal, dorsal) with transitions of the different formants having different directions, as shifting such formant contours could lead to formant collisions.

In the stimulus set, there were 17 different vowel qualities altogether. The formant values (except for those interpolated between /iː/ and /ɪ/ and between /uː/ and /u/) were obtained from selected [fVp] tokens that were produced naturally by a Czech male. The F1 and F2 values used (see Table 1) were within one standard deviation from the means reported for Czech males by Skarnitzl and Volín.10 A naturally-produced [f iːp] token served as the basis for the [f iːp] to [fɪp] set. This set contained six qualities, the original [f iːp] at one endpoint and the remaining points produced by resynthesis manipulating the values of the first four formants (with formant transitions included in the manipulated portion). The steps were psychoacoustically approximately equal, with interpolation along the equivalent rectangular bandwidth (ERB) rate scale. The resynthesis was performed in Praat,16 using the procedure described in the Praat manual entry “Source-filter synthesis 4. Using existing sounds” with the following additions. First, the tracked formant contours were checked against a broadband spectrogram and manually corrected if necessary; second, the original intensity of the manipulated portion was preserved; and third, the original high-frequency resonances (lost due to the necessary downsampling) were added back into the signal. A [fuːp] to [fʊp] set, comprising five qualities, was created analogously. The consonants used in this set were taken from the original [fuːp] token, not [f iːp], and thus preserved potential coarticulation effects. Table 1 gives the resulting spans between the endpoint qualities of these two sets for each formant converted to just-noticeable-difference (JND) units using 0.3 bark as the constant discrimination threshold.17 

Table 1.

Spectral spans between the endpoints of the short/long vowel pair stimuli. Spans smaller than 1 JND are italicized.

F1F2F3 & F4 spans (JND)
ERB rangeJND spanERB rangeJND spanF3F4
[ɪ]–[iː] 8.90–5.69 5.65 20.97–22.67 4.53 4.87 2.36 
[ʊ]–[uː] 8.90–5.36 6.22 14.47–12.71 3.42 1.02 0.03 
[ɛ]–[ɛː] 11.80–12.50 1.91 19.33–19.66 1.61 0.82 0.80 
[a]–[aː] 12.91–14.13 3.08 16.39–16.60 0.39 0.28 0.02 
[o]–[oː] 10.50–10.80 0.70 14.42–14.47 0.23 0.66 0.46 
F1F2F3 & F4 spans (JND)
ERB rangeJND spanERB rangeJND spanF3F4
[ɪ]–[iː] 8.90–5.69 5.65 20.97–22.67 4.53 4.87 2.36 
[ʊ]–[uː] 8.90–5.36 6.22 14.47–12.71 3.42 1.02 0.03 
[ɛ]–[ɛː] 11.80–12.50 1.91 19.33–19.66 1.61 0.82 0.80 
[a]–[aː] 12.91–14.13 3.08 16.39–16.60 0.39 0.28 0.02 
[o]–[oː] 10.50–10.80 0.70 14.42–14.47 0.23 0.66 0.46 

Six additional vowel qualities were obtained from naturally-produced [fVp] syllables where the vowels were [ɛ], [ɛː], [a], [aː], [o], and [oː] (the remaining Czech vowels). Like in the high-vowel sets, each quality came with the original flanking consonants. Table 1 shows the differences between the qualities of each short/long vowel pair in JND for each formant, as well as the range of F1 and F2 in ERB. The distance between [ɛ] and [ɛː], between [a] and [aː], and between [o] and [oː] corresponded, respectively, to less than a third, a little over a third, and less than a tenth of the distance between the endpoints of the high-vowel continua. The differences in the slope of F0, which can affect the perception of vowel length,5,7 were small for [ɛ]-[ɛː] and [a]-[aː] ([ɛ] -12.68 vs [ɛː] -11.14, and [a] -15.22 vs [aː] -16.87 semitones per s) but for [o]-[oː] it was larger ([o] -11.83 vs [oː] -21.45 semitones per s). The 6 [f iːp] to [fɪp] qualities, and the 5 [fuːp] to [fʊp] qualities, were each combined with 5 durations equally distant on a logarithmic scale: 90, 104, 120, 139, and 160 ms. Stimuli with the endpoint vowel durations sounded like good exemplars of the short and long vowels to the authors (three native speakers of Czech) in these syllables. This duration span was converted in JND, assuming the discrimination threshold of 5 ms to the reference value of 90 ms18 which corresponds to the Weber fraction of approximately 0.054, using Eq. (1),

(1)

The span between the endpoint durations corresponded to about 10.64 JND and was thus larger than the spans between the formant values of the endpoints of the high-vowel continua (see Table 1). Using a smaller range of duration values in these isolated syllables, more comparable in JND to the spectral range, would have meant using durations neither short enough nor long enough to make good short and long vowels, as judged by the authors (native Czech speakers). We reasoned that if listeners categorize the spectrally-differing stimuli within each of the two high-vowel stimulus sets differently, despite the fact that the spectrum span is smaller than the duration span, it will reflect a robust reliance on this dimension as a perceptual cue. The 3 non-high vowel pairs had the 3 intermediate durations each, i.e., 104, 120, and 139 ms, corresponding to the JND span of about 5.37, which is again more than the spectral difference for these pairs (see Table 1). The reason why only the three intermediate durations were combined with these vowel qualities was to reduce the overall length of the task, and not to make the disproportion between the spectral and durational range for these pairs even larger, potentially causing listeners to rely on duration more strongly. In sum, there were thus (6 /iː/-/ɪ/ qualities + 5 /uː/-/u/ qualities) × 5 durations + 6 non-high vowel qualities × 3 durations = 73 different stimuli altogether. Throughout all the stimuli, the durations of the friction of [f], the closure of [p], and the burst noise of [p] were equalized by truncating a randomly selected portion of the necessary duration in the middle of these segments.

On each trial, one of the 73 [fVp] stimuli, without a carrier phrase, was played once via circumaural headphones and the participant clicked on one of the ten pseudowords “fop fip fap fup fep fóp fíp fáp fúp fép” shown on a computer screen (the diacritics unambiguously denote long vowels in Czech spelling). Each of the 73 stimuli was presented 5 times in random order.

The participants were 74 native Czech speakers, 34 from Bohemia (the west of the Czech Republic; 23 female), and 40 from Moravia (the east; 24 female). They were volunteers, aged between 19 and 27 yr, recruited among students of the university. None of them reported any hearing or language impairment.

Statistical analyses were performed in R (Ref. 19) using mixed-effects logistic regression models, package lme4, version 1.1.20.20 The data and script are available at https://tinyurl.com/y5v49suq.23 The entire data set was split into 5 subsets by stimulus target vowel contrast, excluding occasional non-target contrast responses (namely, 3% of responses for /iː/-/ɪ/ stimuli, 5% for /uː/-/u/ stimuli, and <1% for the 3 other vowel pairs). Per stimulus contrast, we then ran a duration-only and a duration-and-spectrum model on the short versus long responses. The short responses within each contrast was the dependent variable; stimulus duration and, in the duration-and-spectrum model, spectrum were entered as continuous predictors, and dialect as a fixed factor (coded −0.5 for Bohemian, and +0.5 for Moravian); the model also included an interaction of duration and dialect and an interaction of spectrum and dialect; participant was entered as a random factor and the models for high vowels also contained per-participant random slopes for duration and spectrum (random slopes were not modelled for the non-high vowels to avoid convergence issues due to sparse data points). In order to obtain comparable scales for both dimensions, the durational and spectral values were entered in terms of JND steps from the initial points. The initial value at each scale was set to 1, representing the most peripheral and longest stimulus (i.e., a prototypical /iː/ and a prototypical /uː/ on both the durational and the spectral dimension), and every subsequent value was expressed in terms of JND distance from the preceding one. Spectral differences were operationalized as Euclidean distances in the F1 by F2 ERB-scaled space.

We adopted α = 0.01 to correct for the 5 individual tests done here. For each of the 5 vowel contrasts, the Akaike Information Criterions of the duration-only and the duration-and-spectrum fit were compared with the anova() function. For all vowel contrasts the duration-and-spectrum model provided a significantly better fit than the duration-only model, meaning that spectrum significantly affected short-long vowel categorization in all Czech short-long vowel pairs. The resulting logistic regression coefficients are given in Table 2. The duration-and-spectrum models for /iː/-/ɪ/ and /uː/-/u/ revealed significant interactions involving dialect. For /iː/-/ɪ/, dialect interacted with duration (estimate = 0.164, z =4.017, p =6 × 10−5) and with spectrum (est. = −0.359, z = −3.567, p =4 × 10−4). For /uː/-/u/, dialect marginally interacted with spectrum (est. = −0.136, z = −2.396, p =0.017). These results indicate that duration was a stronger predictor for /iː/-/ɪ/ categorization for Moravian than for Bohemian listeners, while spectrum was a stronger predictor for /iː/-/ɪ/ as well as for /uː/-/u/ in Bohemians than Moravians. Figure 1 visualizes the relative contribution of spectral and durational cues to the perceptual categorization of /iː/-/ɪ/ and /uː/-/u/ in each of the two dialects, smoothing the data using local polynomial regression, the loess function, in R.19 

Table 2.

The glmer model coefficients for the spectral and the duration predictor whose scales were adjusted for JND spans. Per-dialect coefficients are from separate per-dialect glmer analyses.

groupCoefficient duration95% c.i. coef. dur.Coefficient spectrum95% c.i. coef. spec.
/iː/-/ɪ/ 0.536 0.495–0.578 1.328 1.226–1.431 
BC /iː/-/ɪ/ 0.458 0.405–0.512 1.562 1.360–1.763 
MC /iː/-/ɪ/ 0.614 0.552–0.675 1.125 1.022–1.228 
/uː/-/u/ 0.614 0.568–0.661 0.572 0.515–0.628 
BC /uː/-/u/ 0.600 0.542–0.659 0.638 0.567–0.730 
MC /uː/-/u/ 0.625 0.558–0.693 0.508 0.437–0.578 
/ɛː/-/ɛ/ 0.859 0.742–0.976 0.666 0.520–0.812 
/oː/-/o/ 0.640 0.561–0.718 0.504 0.140–0.867 
/aː/-/a/ 0.872 0.745–0.998 1.299 1.105–1.493 
groupCoefficient duration95% c.i. coef. dur.Coefficient spectrum95% c.i. coef. spec.
/iː/-/ɪ/ 0.536 0.495–0.578 1.328 1.226–1.431 
BC /iː/-/ɪ/ 0.458 0.405–0.512 1.562 1.360–1.763 
MC /iː/-/ɪ/ 0.614 0.552–0.675 1.125 1.022–1.228 
/uː/-/u/ 0.614 0.568–0.661 0.572 0.515–0.628 
BC /uː/-/u/ 0.600 0.542–0.659 0.638 0.567–0.730 
MC /uː/-/u/ 0.625 0.558–0.693 0.508 0.437–0.578 
/ɛː/-/ɛ/ 0.859 0.742–0.976 0.666 0.520–0.812 
/oː/-/o/ 0.640 0.561–0.718 0.504 0.140–0.867 
/aː/-/a/ 0.872 0.745–0.998 1.299 1.105–1.493 
Fig. 1.

Contribution of spectrum and duration to short-long vowel categorization in BC and MC listeners. Equal scaling in JND is applied to both dimensions. Darkness indicates the probability of a short vowel response. Black dotted-dashed curves mark the short–long perceptual boundary; tilt of the curve roughly corresponds to cue prominence: horizontal = greater contribution of spectrum, vertical = greater contribution of duration.

Fig. 1.

Contribution of spectrum and duration to short-long vowel categorization in BC and MC listeners. Equal scaling in JND is applied to both dimensions. Darkness indicates the probability of a short vowel response. Black dotted-dashed curves mark the short–long perceptual boundary; tilt of the curve roughly corresponds to cue prominence: horizontal = greater contribution of spectrum, vertical = greater contribution of duration.

Close modal

We aimed to find out whether spectral information serves as a cue to short–long vowel contrasts in Czech. Previous research indicated that at least for one vowel pair, /iː/-/ɪ/, the length contrast is indeed partially cued by spectrum, and more strongly so for BC than for MC.12 On the basis of production data showing that the duration ratios of phonologically long to short vowels for the high front /iː/-/ɪ/ and high back /uː/-/u/ contrasts are slightly smaller than for the other contrasts,11,12 we reasoned that spectrum can serve as a cue to at least one other vowel pair, namely /uː/-/u/. We tested all five vowel pairs to avoid range effects as well as to examine potential contribution of the spectral cue across the entire monophthongal vowel inventory of Czech.

We found that /iː/-/ɪ/ is cued by spectrum more heavily than by duration, in both the varieties of Czech tested. The spectral reliance for /iː/-/ɪ/ for Bohemian listeners is even stronger than for Moravian listeners, while the reverse holds for the durational reliance in this vowel pair. For /uː/-/u/, the overall contribution of spectrum and duration are comparable, but as with /iː/-/ɪ/, the perceptual reliance on spectrum is greater in Bohemian than in Moravian listeners. With regard to the three non-high vowel pairs, spectral cues outweigh durational cues for the low vowel pair /aː/-/a/. For /o/-/oː/, the differences in F1, F2, F3, or F4 were not larger than 1 JND but still spectrum affected categorization, suggesting that the F0-slope difference played a role in perception.

To summarize, we confirm that (1) Czech listeners weigh the spectral cue more heavily than the duration cue when identifying members of the short–long /ɪ/-/iː/ vowel pair; this previously reported finding is replicated here with all monophthongal vowels as stimuli. For the first time, we demonstrate that (2) Czech listeners attend to the spectral cues for the short-long distinction in the other high vowel pair, /u/-/uː/, placing comparable reliance on spectrum and duration. For both these vowel pairs, there is a dialectal difference showing that Bohemian listeners weigh spectral cues more heavily than Moravians. It is possible that for the two Czech high vowel quantity pairs, /ɪ/-/iː/ and /u/-/uː/, the reduced long-to-short vowel durational ratio12 (perhaps related to the universal tendency for high vowels to be shorter21) goes hand-in-hand with the high perceptual reliance on spectral differences that we found for these pairs, and the reliance on spectrum in turn reduces the need for speakers to maintain a clear durational differentiation of these vowels (cf. Lindblom's “hypoarticulation”22). Additionally, we found that (3) responses were influenced by the slight spectral differences within each of the non-high vowel pairs.

This study contributes to previous research6,7 showing that even a language described phonologically as a true quantity language may employ subtle phonetic cues, other than duration, for the quantity distinctions. Our findings, showing that Czech listeners utilize spectral properties (formants, F0) for vowel-quantity perception even for length pairs whose members' spectral differences in produced speech are small and possibly less consistent, underline the role of phonetic detail in human speech perception.

This research was supported by the Czech Science Foundation Grant No. 18-01799S.

1.
I.
Lehiste
, “
Prosodic change in progress: From quantity language to accent language
,” in
Development in Prosodic Systems
, edited by
P.
Fikkert
and
H.
Jacobs
(
Mouton de Gruyter
,
Berlin, New York
,
2003
), pp.
47
65
.
2.
R.
Weiss
, “
Relationship of vowel length and quality in the perception of German vowels
,”
Linguistics
12
(
123
),
59
70
(
1974
).
3.
K.
Hadding-Koch
and
A. S.
Abramson
, “
Duration versus spectrum in Swedish vowels: Some perceptual experiments
,”
Stud. Linguistica
18
(
2
),
94
107
(
1964
).
4.
J. M.
Hillenbrand
, “
Static and dynamic approaches to vowel perception
,” in
Vowel Inherent Spectral Change
, edited by
G. S.
Morrison
and
P. F.
Assmann
(
Springer
,
Berlin, Heidelberg
,
2013
), pp.
9
30
.
5.
W. A.
van Dommelen
, “
Does dynamic F0 increase perceived duration? New light on an old issue
,”
J. Phonetics
21
(
4
),
367
386
(
1993
), available at https://www.researchgate.net/publication/232477957.
6.
A. S.
Abramson
and
N.
Ren
, “
Distinctive vowel length: Duration vs. spectrum in Thai
,”
J. Phonetics
18
(2),
79
92
(
1990
).
7.
H.
Lehnert-LeHouillier
, “
A cross-linguistic investigation of cues to vowel length perception
,”
J. Phonetics
38
(
3
),
472
482
(
2010
).
8.
Š.
Šimáčková
,
V. J.
Podlipský
, and
K.
Chládková
, “
Czech spoken in Bohemia and Moravia
,”
J. Int. Phon. Assoc.
42
(
2
),
225
232
(
2012
).
9.
A.
Frinta
,
A Czech Phonetic Reader
(
University of London Press
,
London
,
1925
).
10.
R.
Skarnitzl
and
J.
Volín
, “
Reference values of vowel formants for young adult speakers of Standard Czech
,”
Akustické listy
18
,
7
11
(
2012
).
11.
N.
Paillereau
and
K.
Chládková
, “
Spectral and temporal characteristics of Czech vowels in spontaneous speech
,”
Acta Universitatis Carolinae— Philologica 2/2019, Phonetica Pragensia
, pp.
77
95
.
12.
V. J.
Podlipský
,
R.
Skarnitzl
, and
J.
Volín
, “
High front vowels in Czech: A contrast in quantity or quality?
,” in
Proceedings of Interspeech
, Brighton, United Kingdom (
2009
), pp.
132
135
.
13.
K.
Wanrooij
and
P.
Boersma
, “
Distributional training of speech sounds can be done with continuous distributions
,”
J. Acoust. Soc. Am.
133
(
5
),
EL398
EL404
(
2013
).
14.
K.
Chládková
and
V. J.
Podlipský
, “
Native dialect matters: Perceptual assimilation of Dutch vowels by Czech listeners
,”
J. Acoust. Soc. Am.
130
(
4
),
EL186
EL192
(
2011
).
15.
T.
Benders
,
P.
Escudero
, and
M. J.
Sjerps
, “
The interrelation between acoustic context effects and available response categories in speech sound categorization
,”
J. Acoust. Soc. Am.
131
(
4
),
3079
3087
(
2012
).
16.
P.
Boersma
and
D.
Weenink
, “
Praat: Doing phonetics by computer [Computer program]
,” Version 6.0.40
, http://www.praat.org/ (Last viewed October 2,
2019
).
17.
D.
Kewley-Port
, “
Vowel formant discrimination II: Effects of stimulus uncertainty, consonantal context, and training
,”
J. Acoust. Soc. Am.
110
(
4
),
2141
2155
(
2001
).
18.
S. G.
Nooteboom
and
G. J.
Doodeman
, “
Production and perception of vowel length in spoken sentences
,”
J. Acoust. Soc. Am.
67
(
1
),
276
287
(
1980
).
19.
R Core Team
, R: A language and environment for statistical computing (Vienna, R Foundation for Statistical Computing,
2016
), www.r-project.org (Last viewed October 2, 2019).
20.
D.
Bates
,
M.
Maechler
,
B.
Bolker
, and
S.
Walker
, “
Fitting linear mixed-effects models using lm4
,”
J. Stat. Softw.
67
(
1
),
1
48
(
2015
).
21.
I.
Lehiste
,
Suprasegmentals
(
MIT Press
,
Cambridge, MA
,
1970
).
22.
B.
Lindblom
, “
Explaining phonetic variation: A sketch of the H&H theory
,” in
Speech Production and Speech Modelling
, edited by
W. J.
Hardcastle
and
A.
Marchal
(
Springer
,
Dordrecht
,
1990
), pp.
403
439
.
23.
Online materials: data and analysis scripts, stored at the Open Science Framework, https://tinyurl.com/y5v49suq (Last viewed October 2, 2019).