This study proposes a method for analyzing sampling jitter in audio equipment based on the time-domain analysis, considering the temporal fluctuations of the zero-crossing points in the recorded sinusoidal waves to characterize the jitter. This method enabled the separate evaluation of jitter in an audio player from those in audio recorders when the same playback signal is simultaneously fed into two audio recorders. The experiments were conducted using commercially available portable devices with a maximum sampling rate of 192 000 samples per second. The results demonstrated that jitter values on the order of a few tens of picoseconds can be identified in an audio player. Moreover, the proposed method enabled the separation of jitter from phase-independent noise utilizing the left and right channels of the audio equipment. As such, this method is applicable for performance evaluation of audio equipment, signal generators, and clock sources.

Sampling jitter in audio equipment is an error in the sampling instants from the ideal timing, i.e., where f S is the sampling frequency of digital-to-analog converter (DAC) and analog-to-digital converter (ADC). Sampling jitter j S ( t ) causes the sampling instants t [ i ] = ( i 1 ) f S 1 to be changed to t [ i ] = ( i 1 ) f S 1 + j S ( t [ i ] ). Hence, sampling jitter affects the performance of audio equipment. The sampling jitter has conventionally been analyzed in the frequency domain.1–4 In this method, one plays back a sinusoidal wave whose frequency is f P / 4, where f P is the sampling frequency of the audio player, records it, and then examines the frequency response using a window function with small sidelobe levels such as a Blackman Harris window. In addition to frequency-domain analysis (FDA), time domain analysis (TDA) of sampling jitter has been conducted.5 The Hilbert transform has been employed to obtain the real jitter waveform j S ( t ). The advantage of TDA is that one can separately extract jitter and amplitude modulation (AM) from a recorded waveform, which is not possible with FDA.

In this study, we propose an efficient and powerful method to characterize sampling jitter in audio equipment. The proposed method comprises two key elements. The first is an improved TDA termed zero-crossing analysis (ZCA). To apply this method, the zero-crossing points (ZCPs) of a recorded waveform are analyzed, following which the ZCPs of an ideal sinusoidal wave are calculated. Time differences in ZCPs between the recorded waveform and an ideal sinusoidal wave contain the jitter information. We term these time differences “zero-crossing fluctuations (ZCFs).” The ZCA enables us to extract jitter information from a recorded waveform even when the input signal contains both jitter and AM. The second key element is the simultaneous recording of the same playback signal with two audio recorders to generate two independent waveforms. We term this “double recorder setup (DRS).” Because ZCA preserves absolute time information, we can exactly compare and calculate the positive and negative correlations of ZCFs between the two generated waveforms. Based on the addition rule of probability, the sampling jitter of the player and that of the recorders can be individually evaluated.

Note that the proposed method requires neither an optional output clock signal synchronized with the recorders' internal clock nor an external clock generator that is more precise than the internal clock. Thus, the proposed method is possible using low-cost recorders and is feasible at an end-user level. The proposed method can also be applied to high-frequency phase noise and jitter measurements. Replacing the two recorders with two digital oscilloscopes with higher sampling rates, we can utilize the proposed method for evaluating the performance of various signal and clock generators that output high-frequency sinusoidal waves.

The method is somewhat similar to the reciprocal calibration of microphones, which has been used since the 1940s;6,7 however, it does not require the bidirectional use of devices. The DRS is similar to the setup of the cross-spectrum method (CSM) for phase noise measurement.8 In CSM, repeating FDA with double instruments reduces the influence of the instruments to 1 / m CS, where m CS is the number of measurements. The proposed method is the so-called TDA version of CSM. The influence of the instruments is canceled by using DRS.

In this study, we focus on the performance of audio equipment; however, it contributes to the field of human audibility.9–11 Previous jitter studies9 demonstrated that the threshold of perceptual detection of random jitter in music signals is large, but the original jitter, i.e., the one that is already existing before adding extra jitter, has not been controlled at that time. Researchers can quickly select instruments with minimal jitter using the method proposed herein. It provides an opportunity to examine how the detection threshold is reduced after the testees are well-trained using audio players with lower levels of jitter and using music signals with which slight artificial jitter is compounded. Moreover, the proposed method helps to diagnose whether the player is appropriately operating in all audibility studies.

The remainder of this study is organized as follows. Section II describes the principles on which the proposed method is based. Section III presents the experimental procedure. Section IV presents our results. In Sec. V, multiple perspectives are discussed, and Sec. VI provides a summary and outlook. We performed numerical calculations in supplementary material, in which several instructive results are presented. A preprint of this study has been posted on a preprint server.12 

A pure sinusoidal wave F pure ( t ) is expressed as follows:
F pure ( t ) = A 0 cos ( ω t + θ 0 ) ,
(1)
where t, ω, θ0, and A0 are the time, angular frequency, initial phase, and amplitude of the wave, respectively. When one plays a digital audio file by which a pure sinusoidal wave is expected to be reproduced, the resulting playback signal is not a pure sinusoidal wave. It contains (i) jitter, (ii) AM, and (iii) phase-independent (PI) noise. Here, we explain these three noise patterns individually. (i) Jitter is the deviation of the playback timing at each point. When jitter is present, t is replaced by t + j ( t ), where j(t) is the jitter of the player. The pure sinusoidal wave F pure ( t ) is thus changed to
F 1 ( t ) = A 0 cos { ω [ t + j ( t ) ] + θ 0 }
(2)
= F pure ( t ) + n jitter ( t ) .
(3)
When ω j ( t ) 1, the noise caused by jitter n jitter ( t ) is expressed as follows:
n jitter ( t ) ω j ( t ) A 0 sin ( ω t + θ 0 ) .
(4)
This denotes that jitter is prominent when the signal is near zero, i.e., F 1 ( t ) 0, and that it is easier to measure the jitter j(t) when A0 and ω are larger. A conceptual view of jitter is shown in Fig. 1(a). (ii) AM is the amplitude variation of a wave concerning time. If AM is present, A0 is replaced by A 0 + A M ( t ), where A M ( t ) is a continuous function that represents AM at time t. The pure sinusoidal wave F pure ( t ) is changed to
F 2 ( t ) = [ A 0 + A M ( t ) ] cos ( ω t + θ 0 )
(5)
= F pure ( t ) + n AM ( t ) .
(6)
FIG. 1.

Schematic of sinusoidal signals modulated by (a) jitter, (b) AM, and (c) PI noise. The horizontal axis represents the phase of a pure sinusoidal wave. Time t = s 1 is the first zero-crossing time of the pure sinusoidal wave, and m is the number of cycles. The filled bands represent the range of fluctuation during the repetition.

FIG. 1.

Schematic of sinusoidal signals modulated by (a) jitter, (b) AM, and (c) PI noise. The horizontal axis represents the phase of a pure sinusoidal wave. Time t = s 1 is the first zero-crossing time of the pure sinusoidal wave, and m is the number of cycles. The filled bands represent the range of fluctuation during the repetition.

Close modal
The noise caused by AM n AM ( t ) is expressed as follows:
n AM ( t ) = A M ( t ) cos ( ω t + θ 0 ) .
(7)
Consequently, AM increases when cos ( ω t + θ 0 ) increases, in contrast to the behavior of jitter. A conceptual view of AM is shown in Fig. 1(b). (iii) The actual wave contains not only jitter and AM but also PI noise, which represents all other types of noise that are not categorized as jitter or AM. If PI noise is present, the pure sinusoidal wave F pure ( t ) is changed to
F 3 ( t ) = F pure ( t ) + n PI ( t ) ,
(8)
where n PI ( t ) denotes PI noise. A conceptual view of PI noise is shown in Fig. 1(c). Considering all three noise patterns, the total noise n total ( t ) and actual playback signal c(t) become
n total ( t ) = n jitter ( t ) + n AM ( t ) + n PI ( t ) ,
(9)
c ( t ) = F pure ( t ) + n total ( t ) .
(10)
In the actual digital audio players, n jitter ( t ) comes primarily from the internal clock module, and n AM ( t ) is from DAC units. The output amplifier of the DAC contributes to n PI ( t ).
In the proposed method, we focus on the ZCPs of impure sinusoidal waves. The ZCPs of F pure ( t ), represented as s k, satisfy F pure ( s k ) = 0; thus, they are given by
s k = 1 ω ( k π π 2 θ 0 ) ,
(11)
where k is the index of ZCPs. At t = s k , the playback signal c(t) is not equal to zero and is given by
c ( s k ) = ( 1 ) k ω A 0 j ( s k ) + n PI ( s k ) ,
(12)
which can be derived from Eq. (10).

The aforementioned noise classification is also appropriate in the recording process. Although a data array is a set of discrete variables rather than a continuous function, we refer to it as a “waveform” in the following. When one records a pure sinusoidal playback signal, the recorded waveform is not equal to the sampled data points of a pure sinusoidal wave because it contains (i) jitter, (ii) AM, and (iii) PI noise. In real digital audio recorders, jitter originates primarily from the internal clock module, whereas AM originates from ADC units. The driver amplifier for ADC contributes to PI noise.

As demonstrated later in this study, PI noise becomes comparable to jitter in the case of recent audio equipment with small jitter values, typically less than 100 ps. Jitter and PI noises can be separated using the left and right channels of the player and recorder.

Herein, we represent how jitter, AM, and PI noises, introduced in Sec. II A, can be realized in a real playback process. Hence, we introduce a model of a digital audio player. A schematic of a single-channel digital audio player is shown in Fig. 2(a). The parameters are fixed to be the same as those of the experimental conditions described in Sec. III and can be arbitrarily selected depending on the experimental conditions. The DAC in the player, assumed to be ideal and noise-free, performs the conversion at N P bits with a sampling rate f P. In this study, N P and f P are set to 24 bit and 48 kHz, respectively.

FIG. 2.

(a) Model diagram of single-channel digital audio player. (b) Playback waveform; dots have been reduced to improve visibility. (c) Relationships among DAC output v(t) (dotted line), signal after LPF F pure ( t ) (solid line), and playback waveform vi (black square). (d) Model diagram demonstrating a dual-channel digital audio player.

FIG. 2.

(a) Model diagram of single-channel digital audio player. (b) Playback waveform; dots have been reduced to improve visibility. (c) Relationships among DAC output v(t) (dotted line), signal after LPF F pure ( t ) (solid line), and playback waveform vi (black square). (d) Model diagram demonstrating a dual-channel digital audio player.

Close modal
The playback waveform is represented as v [ i ], where i is a natural number and v [ i ] is a N P bit signed integer. The length of v [ i ] is N total. The waveform v [ i ] used in this study is shown in Fig. 2(b). It can be separated into five parts, labeled (i), (ii), (iii), (iv), and (v) as depicted in the figure. Dots have been reduced to improve visibility in the figure. The horizontal axis represents index i. (i) Silent part, where the playback waveform is v [ i ] = 0 for 1 i i main N F 1. As shown below, i main is the first index of the main part, and N F is the length of the fade part. We set i main = 480 000 and N F = 240 000. Therefore, the temporal duration of the silent part becomes ( i main N F ) f P 1 5 s. (ii) Fade-in part: the playback waveform has a form of
v [ i ] = { v min + [ 1 + cos ( π i i main N F ) ] v max v min 2 } × cos ( 2 π mod [ i i main , 4 ] 4 )
(13)
in the region of i main N F i i main 1, where v max : = 2 N P 1 1 = 8 388 607 is the maximum value of a N P bit signed integer. The initial amplitude in the fade-in part v min is set to 256. The temporal duration of the fade-in part becomes N F f P 1 5 s. (iii) Main part: the playback waveform is a repetition of ( v max , 0 , v max , 0 ) for i main i i main + N main 1, i.e., the playback waveform in the main part is expressed as follows:
v [ i ] = v max cos ( 2 π mod [ i i main , 4 ] 4 ) .
(14)
In this study, we set N main = 1 440 000. Therefore, the main part of the waveform begins at ( i main ) f P 1 10 s after playback is started and continues ( N main ) f P 1 30 s. (iv) Fade-out part: this part begins after the main part and its length is the same as that of the fade-in part. The waveform of the fade-out part is the reversed sequence of the fade-in part, which is expressed as Eq. (13). (v) Second silent part: this part begins after the fade-out part and its length is the same as that of the first silent part. Consequently, the total length of the playback signal, which is the summation of the length of its five parts, is N total f P 1 = ( 2 i main + N main ) f P 1 50 s.

The DAC output voltage, represented as v(t), is a square wave. For t P + ( i 1 ) f P 1 t < t P + i f P 1, v(t) is expressed as v ( t ) = A P v [ i ] / v max. The time when playback is started is t P. In this section, we set t P = ( i main + N main / 6 ) f P 1 15 s.

A low-pass filter (LPF) is connected after the DAC. The cut-off frequency of the LPF is assumed to be f P / 2. Because the square wave v(t) is smoothed by the LPF, the output voltage after the LPF becomes a pure sinusoidal wave expressed as Eq. (1). The frequency of the pure sinusoidal wave becomes f C : = ω / 2 π = f P / 4 since the main part of the playback waveform is expressed as Eq. (14). The relationship between v(t) and F pure ( t ) is depicted in Fig. 2(c). The dotted and solid lines represent v(t) and F pure ( t ), respectively. For comparison with v(t), the playback waveform v [ i ] is plotted as black squares at the position t = t P + ( i 1 ) f P 1 and v ( t ) = A P v [ i ] / v max. As mentioned above, the components of 3 f C , 5 f C , in v(t) are perfectly attenuated by the LPF, and a pure sinusoidal wave of frequency f C remains.

The playback signal c(t) is obtained by adding n jitter ( t ) , n AM ( t ) , n PI ( t ) to F pure ( t ), as shown in Eqs. (9) and (10). In real audio player, jitter is primarily caused by fluctuation of f P. However, in this model, DAC and LPF are ideal, and jitter noise is added after the LPF. The output is made by a buffer amplifier with a direct current (DC)-blocking capacitor.

A model diagram of a dual-channel digital audio player is presented in Fig. 2(d). It comprises two single-channel digital audio players, the jitter inputs of which are assumed to be equipotential. This model diagram is used in Sec. II F.

To represent jitter, AM, and PI noises in a real recording process, we present a model diagram of a single-channel digital audio recorder in Fig. 3(a). As in the previous section, the parameters are fixed to be the same as the experimental conditions described in Sec. III. The playback signals c(t) are fed into a buffer amplifier with input impedance Z R. Subsequently, jitter a jitter ( t ), AM a AM ( t ), and PI noise a PI ( t ) are added, and the high-frequency component is attenuated by an LPF. The cut-off frequency of the LPF is assumed to be f R / 2. The voltage signal before an ADC is represented as x(t). Their relationship is similar to Eqs. (9) and (10). This relationship can thus be expressed as follows:
a total ( t ) = a jitter ( t ) + a AM ( t ) + a PI ( t ) ,
(15)
x ( t ) = L F { c ( t ) + a total ( t ) } ,
(16)
where L F { } denotes the low-frequency component.
FIG. 3.

(a) Model diagram of a single-channel digital audio recorder. (b) Relationship between signal before LPF x(t) (solid line) and recorded waveform xi (white square). (c) Model diagram of a dual-channel digital audio recorder. The method to utilize a dual-channel recorder as a single-channel recorder is depicted.

FIG. 3.

(a) Model diagram of a single-channel digital audio recorder. (b) Relationship between signal before LPF x(t) (solid line) and recorded waveform xi (white square). (c) Model diagram of a dual-channel digital audio recorder. The method to utilize a dual-channel recorder as a single-channel recorder is depicted.

Close modal
The ADC in the recorder, assumed to be ideal and noise-free, performs conversion at N R bits with a sampling rate of f R. The recorded waveform is represented as x [ i ]. The ith value is expressed as x [ i ] = floor [ x max { x ( t [ i ] ) / A R } ], where A R is a constant with voltage dimensions and x max : = 2 N R 1 1 is the maximum value of a N R bit signed integer. In this study, N R and f R are set to 24 bit and 192 kHz, respectively. The analog-to-digital conversion timing is denoted as t [ i ]. The value t [ i ] is expressed as
t [ i ] = t R + ( i 1 ) f R 1 ,
(17)
where t R represents the time at which the recording started. We assume that the ADC begins working before the playback starts and that the length of t [ i ] is adequate to record the entire playback signal. When the voltage signal before the ADC is x ( t [ i ] ) = A R, the recorded waveform becomes x [ i ] = x max. In the real audio recorder, jitter results from the fluctuation of f R; however, in this model, the LPF and ADC are ideal, and jitter is added before the LPF.

The relationship between x(t) and x [ i ] is depicted in Fig. 3(b). The solid line represents x(t). The recorded waveform x [ i ] is plotted as white squares at t = t [ i ] and x ( t ) = x [ i ]. The range of the horizontal axis is equal to that of Fig. 2(c). The fluctuation of the solid line in this figure shows artificial random noise. Because the ratio between the sampling rate of the ADC and the frequency F pure ( t ) is f R / f C = 16, 16 sampling points are present for each wavelength.13 

We demonstrate the model diagram of a dual-channel digital audio recorder in Fig. 3(c). The recorder comprises two single-channel digital audio recorders; the jitter inputs of the single-channel recorders are assumed to be equipotential. To obtain the experimental results described in Secs. IV A, IV B, and IV C, we used a dual-channel recorder as a single-channel recorder by contacting two analog inputs and averaging their two waveforms x ( L ) [ i ] and x ( R ) [ i ]. We term this setup a “pseudo single-channel recorder.” As an exception, we analyzed waveforms x ( L ) [ i ] and x ( R ) [ i ] separately to estimate the jitter from the digital audio recorder. See Sec. V D for details.

In ZCA, we first seek the time at which the voltage signal in the recorder, i.e., x(t), crosses the t-axis while 0 t T. For this purpose, we reconstruct a continuous function x ( t ) from sampling data x [ i ], which satisfies x ( t ) x ( t ) for 0 t T. The reconstruction process comprises three steps. (i) To avoid the boundary effect of the sampling data, a window function w ( t [ i ] ) is multiplied to x [ i ] as x [ i ] w ( t [ i ] ), where w(t) is the Blackman type, and is expressed as follows:
w ( t ) = { 0.42 + 0.5 cos ( π f R t / N ) + 0.08 cos ( 2 π f R t / N ) ( N f R 1 t < 0 ) 1 ( 0 t ⩽  T ) w ( T t ) ( T < t T + N f R 1 ) .
(18)
We set N = 48 000 and T = 4 N f R 1 1 s. Consequently, the data length w ( t [ i ] ) becomes 6N, and the domain of w(t) becomes 0.25 s t 1.25 s. Thus the data length x [ i ] w ( t [ i ] ) becomes 6N. (ii) After the multiplication with window function, the data points are interpolated using the fast Fourier transform (FFT) method by an oversampling factor of N over = 64. As a result, the number of data points increases to 6 N over N. The value of N over is adjusted depending on the required accuracy. We confirmed that N over = 64 is sufficiently large by performing numerical simulation.14 Thanks to the FFTW library used in matlab,15 the computation time is almost negligible. We also applied bandwidth limitations to the data to eliminate the DC component. (iii) The interpolated points are connected by a line. After these three steps, a continuous function x ( t ) is obtained from the discrete data { x [ i main N ] , , x [ i main + 5 N ] }.

Figure 4(a) shows the process to obtain x ( t ). The range of the horizontal axis is equal to that of Fig. 3(b). The white squares are the recorded waveform x [ i ], and the black circles are the interpolated points by the FFT method. Solid lines represent x ( t ). Figure 4(b) presents a magnified view of Fig. 4(a) around the first and second ZCPs. For the ease of viewing, the FFT interpolation was performed using an oversampling factor N over = 4 in Figs. 4(a), 4(b). One can see that the N over 1 points are interpolated between two recorded data points. As shown in Fig. 4(b), the zero-crossing times are labeled as t = s 1 , s 2 , , s M, where M denotes the number of ZCPs when 0 t T. The obtained sequence s 1 , s 2 , , s M is not equally spaced because of jitter.

FIG. 4.

(a) The recorded waveform xi (white square), points obtained by FFT interpolation (black circle), and the continuous function x ( t ) (solid line). For ease of viewing, the FFT interpolation was performed using an oversampling factor of N over = 4 rather than N over = 64. (b) Magnified graph for 20 μ s t 80 μ s.

FIG. 4.

(a) The recorded waveform xi (white square), points obtained by FFT interpolation (black circle), and the continuous function x ( t ) (solid line). For ease of viewing, the FFT interpolation was performed using an oversampling factor of N over = 4 rather than N over = 64. (b) Magnified graph for 20 μ s t 80 μ s.

Close modal
Second, we identify equally spaced points s k, which are introduced in Eq. (11) in ZCA. For this purpose, a straight line is fitted to sk using the least squares method. A conceptual diagram is provided in Fig. 5. The fitting function is written as follows:
s ( k ) = k 1 2 f C + s 1 ,
(19)
where f C is the frequency of the playback signal measured by the recorder. Deviation in sk from the straight line is less than 100 ps; this is enlarged in Fig. 5 to ease visualization. From the fitting function of Eq. (19), the kth equidistant point s k : = s ( k ) is obtained. The frequency f C is the averaged frequency during 0 t T. Thus, this analysis is sensitive to short-term drift with a frequency of f 1 / T but is not sensitive to long-term drift.
FIG. 5.

The zero-crossing time in ms versus zero-crossing index k. sk denotes the zero-crossing time of the recorded data, and s k denotes that of the corresponding pure sinusoidal wave. Consecutive times s k are equally spaced, whereas sk are not. The difference between sk and s k is extremely magnified for the ease of viewing.

FIG. 5.

The zero-crossing time in ms versus zero-crossing index k. sk denotes the zero-crossing time of the recorded data, and s k denotes that of the corresponding pure sinusoidal wave. Consecutive times s k are equally spaced, whereas sk are not. The difference between sk and s k is extremely magnified for the ease of viewing.

Close modal
In ZCA, we finally obtain the ZCF that gives the difference between sk and s k, which is expressed as follows:
Δ s k = s k s k .
(20)
ZCF Δ s k goes to jitter j ( s k ) when both PI noise n PI ( t ) and recorder noise a total ( t ) are negligible. Because the reconstructed function x ( t ) crosses the t-axis twice per cycle, one can obtain ZCF values at a repetition rate of f Z = 2 f C. In other words, the bandwidth of j(t) reconstructed from Δ s k becomes f    f Z / 2 = f C. This is expected because jitter resembles the frequency modulation in which it is impossible to transmit a frequency higher than the carrier wave. If one observes only the rising or falling ZCPs, the bandwidth of j(t) becomes restricted to f    f C / 2, which is insufficient to perfectly reconstruct j(t).
First, we consider the case in which one player and one recorder are connected [Fig. 6(a)]. We term this setup a “single recorder setup (SRS).” In an SRS, both the player and recorder noises are included in ZCFs and are represented as Δ s k. The relationships among the player noise, recorder noise, and ZCFs are expressed as follows:
( 1 ) k ω A 0 Δ s k = n jitter ( s k ) + n PI ( s k ) + a jitter ( s k ) + a PI ( s k ) .
(21)
FIG. 6.

(a) SRS. (b) DRS. (c) Setup for separating jitter from PI noise.

FIG. 6.

(a) SRS. (b) DRS. (c) Setup for separating jitter from PI noise.

Close modal
In the following, V { } denotes the variance of data. From Eq. (21), the variance of ZCFs V { Δ s k } becomes
V { Δ s k } = ( σ n 1 ) 2 + ( σ a 1 ) 2 ,
(22)
where σ n 1 and σ a 1 are the root mean squares (RMSs) of ZCFs for the player and the recorder, respectively. More explicitly, σ n 1 and σ a 1 can be expressed as follows:
( σ n 1 ) 2 = V { j ( s k ) } + V { n PI ( s k ) } ( ω A 0 ) 2 ,
(23)
( σ a 1 ) 2 = V { a jitter ( s k ) } + V { a PI ( s k ) } ( ω A 0 ) 2 .
(24)
The left-hand side of Eq. (22) can be obtained using experimental data, the results of which are described in Sec. IV A.
Second, we consider the case in which one player and two recorders are connected [Fig. 6(b)]; this setup is termed a “DRS.” In the DRS, both player and recorder noises are included in each ZCF, which are represented as Δ s k and Δ r k. The relation of Eq. (21) holds for the DRS. As in Eq. (21), Δ r k is expressed as
( 1 ) k ω A 0 Δ r k = n jitter ( r k ) + n PI ( r k ) + b jitter ( r k ) + b PI ( r k ) ,
(25)
where r k is the equally spaced time obtained by ZCA using a waveform acquired by recorder B (represented as y [ i ]), b jitter ( t ) is the jitter of recorder B, and b PI ( t ) is the PI noise of recorder B. Because the two recorders simultaneously sample the same playback signal outputted from one player, we can say s k = r k in the real world. Note that s k and r k have a common index k. Consequently, r k in Eq. (25) can be replaced by s k then, we obtain the following:
( 1 ) k ω A 0 Δ r k = n jitter ( s k ) + n PI ( s k ) + b jitter ( s k ) + b PI ( s k ) .
(26)
Using Eqs. (21) and (26), the following four equations are obtained:
V { Δ s k } = ( σ n 2 ) 2 + ( σ a 2 ) 2 ,
(27)
V { Δ r k } = ( σ n 2 ) 2 + ( σ b 2 ) 2 ,
(28)
V { Δ s k Δ r k } = ( σ a 2 ) 2 + ( σ b 2 ) 2 ,
(29)
V { Δ s k + Δ r k } = 4 ( σ n 2 ) 2 + ( σ a 2 ) 2 + ( σ b 2 ) 2 ,
(30)
where σ n 2 , σ a 2, and σ b 2 are the RMSs of ZCFs for the player, recorder A, and recorder B, respectively. It is important to note that the effect of the player is canceled in Δ s k Δ r k. Moreover, the player makes a double contribution in Δ s k + Δ r k. This is because s k and r k are common to both instruments, even though Δ s k and Δ r k are measured on different instruments.

The left-hand side of these equations can be obtained using experimental data. Using Eqs. (27)–(29), we can evaluate noise in the player ( σ n 2) separately from that in the recorders ( σ a 2 and σ b 2). Equation (30) can be used to verify the calculations. The experimental results are presented in Sec. IV B.

These noises, σ n 2 , σ a 2, and σ b 2, can be expressed as
( σ n 2 ) 2 = V { j ( s k ) } + V { n PI ( s k ) } ( ω A 0 ) 2 ,
(31)
( σ a 2 ) 2 = V { a jitter ( s k ) } + V { a PI ( s k ) } ( ω A 0 ) 2 ,
(32)
( σ b 2 ) 2 = V { b jitter ( s k ) } + V { b PI ( s k ) } ( ω A 0 ) 2 .
(33)

Finally, we consider the case shown in Fig. 6(c). This setup enables us to separate jitter from PI noise for the player. The setup differs from that of Fig. 6(b), where the L and R signals of the player are bundled together. The RMS of ZCFs for the player, represented as σ n 3, can be obtained as in the DRS.

As shown in Fig. 2(d), the dual-channel player comprises two single-channel players with equivalent jitter. Moreover, the PI noises of the two single-channel players are independent. Therefore, under the assumptions of this model, PI noise in σ n 3 is reduced to
( σ n 3 ) 2 = V { j ( s k ) } + 1 2 V { n PI ( s k ) } ( ω A 0 ) 2 .
(34)
Using Eqs. (31) and (34), we can obtain V { j ( s k ) }. Therefore, the RMS of jitter can be determined by measuring σ n 2 and σ n 3. This relationship is expressed as
dev { j ( s k ) } = 2 ( σ n 3 ) 2 ( σ n 2 ) 2 ,
(35)
where dev { } denotes the deviation, i.e., dev { } : = V { }.

In our experiment, we used three identical portable audio devices (DR-100MKIII; TASCAM, Japan). These devices offer several advantages: they are unaffected by the quality of the alternating current power supply, which ensures the reproducibility and independence of the measurement, are inexpensive, and are easy to obtain.

One of the three devices (No. 1) was used as a player, and the others (Nos. 2 and 3) were used as recorders. Figures 7(a) and 7(b), corresponds to SRS and DRS, respectively, as described in Sec. II E. Figure 7(c) shows the setup required to separate jitter from PI noise as described in Sec. II F.

FIG. 7.

(a) SRS; the output, the left channel of device No. 1, was fed simultaneously to the left and right channels of device No. 2. A matching resistor of 200 Ω was inserted. (b) DRS; the output, the left channel of device No. 1, was fed simultaneously to the left and right channels of device No. 2 and No. 3. (c) The output is the sum of the left and right channels of device No. 1 and is fed simultaneously to the left and right channels of device No. 2 and No. 3.

FIG. 7.

(a) SRS; the output, the left channel of device No. 1, was fed simultaneously to the left and right channels of device No. 2. A matching resistor of 200 Ω was inserted. (b) DRS; the output, the left channel of device No. 1, was fed simultaneously to the left and right channels of device No. 2 and No. 3. (c) The output is the sum of the left and right channels of device No. 1 and is fed simultaneously to the left and right channels of device No. 2 and No. 3.

Close modal

The device settings for the recorders are summarized in Table I. The recording levels for the three setups are adjusted to be equal by inserting a matching resistor as shown in Figs. 7(a) and 7(c). This is necessary to prevent level changes in recordings that affect PI noise.

TABLE I.

Device settings for the recorders.

FILE FORMAT  WAV24 
SAMPLING RATE  192kHz 
FILE TYPE  STEREO 
XRI  OFF 
DUAL REC  OFF 
SOURCE  EXT LINE 
A/D FILTER  FIR1 
DUAL ADC  ON 
LOW CUT  OFF 
RECORDING LEVEL  +3dB 
FILE FORMAT  WAV24 
SAMPLING RATE  192kHz 
FILE TYPE  STEREO 
XRI  OFF 
DUAL REC  OFF 
SOURCE  EXT LINE 
A/D FILTER  FIR1 
DUAL ADC  ON 
LOW CUT  OFF 
RECORDING LEVEL  +3dB 

Details of the playback file are described in Sec. II B. The length of the main part is N main f P 1 30 s. The lengths of the fade-in and fade-out parts are both N F f P 1 5 s. Therefore, there are ( N main + 2 N F ) / 4 = 480 000 cycles of sinusoidal waves in the playback signal, and the same is true for the recorded waveform. In our improved TDA, the analysis program counts the number of cycles in the two sinusoidal waves from two recorders and assigns a common zero-crossing index. This characteristic is of importance in Sec. IV B.

Using the SRS [Fig. 7(a)], we played back and recorded the sinusoidal wave of f C = 12 kHz. To eliminate low-frequency noise that does not originate from the clock in the player, the recorded waveform was processed in a limited bandwidth range of f C B w f    f C + B w. Consequently, we analyzed jitter in the bandwidth of 1 / T    f B w. In this analysis, B W was set to 6 kHz. Then, the ZCF was obtained using matlab code. Figure 8(a) shows the obtained ZCF, Δ s 1 , Δ s 2 , , and Δ s M. Figure 8(b) shows the distribution of the obtained ZCF, which resembles a Gaussian curve. As expressed in Eq. (22), this ZCF includes the effects of jitter and PI noise from both the player and recorder. The RMS of ZCF is { ( σ n 1 ) 2 + ( σ a 1 ) 2 } 1 / 2 = 55.3 ps.

FIG. 8.

(a) ZCF obtained for 1 s. There are M = 24 000 ZCPs. (b) Histogram of the obtained ZCF.

FIG. 8.

(a) ZCF obtained for 1 s. There are M = 24 000 ZCPs. (b) Histogram of the obtained ZCF.

Close modal
Using the DRS [Fig. 7(b)], we obtained the ZCFs Δ s k and Δ r k from devices Nos. 2 and 3, respectively. Figure 9 shows the distributions of Δ s k Δ r k and Δ s k + Δ r k, wherein the former is clearly narrower than the latter. This is because the effect of the player is canceled out in Δ s k Δ r k, whereas it makes a double contribution in Δ s k + Δ r k. From the experimental data, we can determine
E 1 = dev { Δ s k } ,
(36)
E 2 = dev { Δ r k } ,
(37)
E 3 = dev { Δ s k Δ r k } ,
(38)
E 4 = dev { Δ s k + Δ r k } ,
(39)
where E1, E2, E3, and E4 are the standard deviations calculated from { Δ s 1 , , Δ s M } , { Δ r 1 , , Δ r M } , { Δ s 1 Δ r 1 , , Δ s M Δ r M }, and { Δ s 1 + Δ r 1 , , Δ s M + Δ r M }, respectively. The values of E1, E2, E3, and E4 are obtained as E 1 = 56.0 ps , E 2 = 56.1 ps , E 3 = 50.6 ps, and E 4 = 100.0 ps, respectively. Using Eqs. (27), (28), and (29), we obtain the following RMS values of ZCFs for the player:
σ n 2 = 43.1 ps ,
(40)
and those of the recorders as
σ a 2 = 35.7 ps ,
(41)
σ b 2 = 35.9 ps .
(42)
FIG. 9.

Histograms of Δ s k Δ r k and Δ s k + Δ r k. The RMS values of the former and latter were 50.6 and 100.0 ps, respectively.

FIG. 9.

Histograms of Δ s k Δ r k and Δ s k + Δ r k. The RMS values of the former and latter were 50.6 and 100.0 ps, respectively.

Close modal

These values satisfy Eq. (30).

In this subsection, we obtain the jitter and PI noise of the player separately. We measured a ZCF using the setup shown in Fig. 7(c). Using the same method as in Sec. IV B, we determined the RMS values of the ZCF of the player, σ n 3. The obtained value is
σ n 3 = 33.5 ps .
(43)
This is smaller than σ n 2 = 43.1 ps calculated in Sec. IV B. As expressed by Eqs. (31)–(34), this difference can be interpreted as a decrease in PI noise due to averaging. From these, we obtain the jitter and PI noise of the player as
dev { j ( s k ) } = 19.7 ps ,
(44)
dev { n PI ( s k ) } ω A 0 = 38.4 ps .
(45)
These results indicate that PI noise must be considered when evaluating the jitter in recent audio equipment with small jitter values, typically less than 100 ps. As noted in Sec. IV A, jitter was calculated in the bandwidth of 1 Hz f 6 kHz. Meanwhile, PI noise was analyzed in the bandwidth of 6 kHz f 18 kHz.
Using the proposed method, jitter can be measured with higher accuracy than when using existing methods.1–5 This is partly due to the recent improvement in the performance of ADCs. We measured waveforms at 192 kHz and 24 bit, whereas 16-bit DACs of 44.1 or 48 kHz were used in previous studies.5 The detection limit of the proposed method exists due to quantization noise, which is represented as j LSB and can be obtained by solving the following equation:
x max A 0 ω j LSB A R = 1.
(46)
The result becomes j LSB 1.76 ps.
The mean for σ n 2 and that for dev { j ( s k ) } are obtained as follows:
σ n 2 = 42.34 ( 14 ) ps ,
(47)
dev { j ( s k ) } = 21.1 ( 6 ) ps ,
(48)
using the values of σ n 2 and dev { j ( s k ) } for different ten time domains. The numbers in () represent the standard deviation of the mean. Therefore, the detection limit lies between j LSB and 21.1 ps and is expected to be less than 10 ps. Furthermore, the detection limit depends on the recorders employed; more accurate measurements are possible when higher-performance recorders are used.

We now consider the phase dependence of the total playback noise, i.e., n total ( t ). One might expect that phase dependence analysis enables the separation of jitter, AM, and PI noise; unfortunately, this approach is not promising, as shown below.

First, we demonstrate that the phase dependence of n total ( t ) can always be expressed by two parameters, A and B. For this purpose, we express time t with phase θ and the number of cycles m as follows:
t ( θ , m ) = θ θ 0 + 2 π ( m 1 ) ω ,
(49)
where the maximum value of m, represented as m max, is set to
m max = floor ( ω T / 2 π ) ,
(50)
and the domain of θ is restricted to
0 θ < 2 π .
(51)
Notably, t ( π / 2 , m ) = s 2 m 1 and t ( 3 π / 2 , m ) = s 2 m. In the following, for simplicity, we replace the expression of n total ( t ( θ , m ) ) with n total ( θ , m ). The same rule is also applied to j(t), A M ( t ), and n PI ( t ). As a result, the playback noise is expressed as
n total ( θ , m ) = A 0 ω j ( θ , m ) sin θ + A M ( θ , m ) cos θ + n PI ( θ , m ) .
(52)
In the following, V { n total ( θ , m ) } denotes the variance calculated from n total ( θ , 2 ) , n total ( θ , 3 ) , , and n total ( θ , m max 1 ). From Eq. (52), we obtain
V { n total ( θ , m ) } = 1 cos ( 2 θ ) 2 ( ω A 0 ) 2 V { j ( θ , m ) } + 1 + cos ( 2 θ ) 2 V { A M ( θ , m ) } + V { n PI ( θ , m ) }
(53)
= A cos ( 2 θ ) + B ,
(54)
where we set
A : = V { A M ( θ , m ) } ( ω A 0 ) 2 V { j ( θ , m ) } 2 ,
(55)
B : = V { n PI ( θ , m ) } + ( ω A 0 ) 2 V { j ( θ , m ) } + V { A M ( θ , m ) } 2 .
(56)
We assume that V { j ( θ , m ) } , V { A M ( θ , m ) }, and V { n PI ( θ , m ) } do not depend on θ. Therefore, the phase dependence of V { n total ( θ , m ) } can always be expressed by two parameters A and B provided the assumptions adopted above are valid.

Consequently, the following behavior can be confirmed: (i) when jitter and AM are not negligible, the offset B is not equal to PI noise; (ii) when jitter and AM are comparable, amplitude A vanishes; (iii) when PI noise is negligible, one can obtain jitter and AM by calculating B ± A; (iv) when PI noise is not negligible, one cannot obtain jitter, AM, and PI noise from A and B. Behavior (iv) indicates that further considerations are necessary to separate V { j ( θ , m ) } from B. The procedure designed for this purpose is explained in Secs. II F and IV C.

As noted in the introduction, the DRS is similar to the setup of CSM.8 CSM can be regarded as a combination of FDA and DRS, whereas the proposed method is a combination of ZCA and DRS. For CSM, noise from two instruments is reduced by averaging, and the cross-spectrum attains the power spectrum of the device under test.

Commercial products based on CSM are designed to evaluate clock generators with a greater frequency than 1 MHz.16,17 This is primarily because frequency conversion in the audio frequency range is technically challenging. Hence, assessing audio signal with CSM has not been performed so far. The proposed method, a combination of ZCA and DRS, can access audio signal and appears to be feasible as a substitution for CSM.

In this subsection, we separately obtain the jitter and PI noise of a recorder. As shown in Fig. 3(c), the dual-channel recorder comprises two single-channel recorders with common jitter. The PI noises of L and R inputs are independent and are represented as a PI , L ( t ) and a PI , R ( t ), respectively. The ZCFs of the recorded waveforms L and R are represented as Δ s k ( L ) and Δ s k ( R ), respectively. Similar to Eqs. (27)–(33), we obtain the following equations:
( E 5 ) 2 = V { Δ s k ( L ) } = ( σ n 2 ) 2 + V { a jitter ( s k ) } ( ω V 0 ) 2 + V { a PI , L ( s k ) } ( ω V 0 ) 2 ,
(57)
( E 6 ) 2 = V { Δ s k ( R ) } = ( σ n 2 ) 2 + V { a jitter ( s k ) } ( ω V 0 ) 2 + V { a PI , R ( s k ) } ( ω V 0 ) 2 ,
(58)
( E 7 ) 2 = V { Δ s k ( L ) Δ s k ( R ) } = V { a PI , L ( s k ) } + V { a PI , R ( s k ) } ( ω V 0 ) 2 ,
(59)
( E 8 ) 2 = V { Δ s k ( L ) + Δ s k ( R ) } = 4 ( σ n 2 ) 2 + 4 V { a jitter ( s k ) } ( ω V 0 ) 2 + V { a PI , L ( s k ) } + V { a PI , R ( s k ) } ( ω V 0 ) 2 ,
(60)
where E5, E6, E7, and E8 are the standard deviations calculated from { Δ s 1 ( L ) , , Δ s M ( L ) } , { Δ s 1 ( R ) , , Δ s M ( R ) } , { Δ s 1 ( L ) Δ s 1 ( R ) , , Δ s M ( L ) Δ s M ( R ) }, and { Δ s 1 ( L ) + Δ s 1 ( R ) , , Δ s M ( L ) + Δ s M ( R ) }, respectively. Using the experimental data generated herein, E5, E6, E7, and E8 are obtained as 63.7, 63.1, 61.9, and 110.6 ps, respectively. Consequently, we obtain
dev { a PI , L ( s k ) } ω V 0 = 44.3 ps ,
(61)
dev { a PI , R ( s k ) } ω V 0 = 43.3 ps ,
(62)
( σ n 2 ) 2 + V { a jitter ( s k ) } ( ω V 0 ) 2 = ( 45.9 ps ) 2 .
(63)
With σ n 2 = 43.1 ps in Eq. (40), we obtain
dev { a jitter ( s k ) } ω V 0 = 15.7 ps .
(64)
The player and recorders used in this experiment are the same product; consequently, we anticipate comparable jitters in the player and the recorders. The results of Eqs. (44) and (64) support this expectation.

Herein, we proposed an efficient and powerful method for highly accurate jitter measurements. This method is based on two key elements: ZCA and DRS. The ZCA enables us to determine the zero-crossing times of the voltage signals in the recorders (sk and rk) and those of the pure sinusoidal waves ( s k and r k) by analyzing the recorded waveforms ( x [ i ] and y [ i ]). Their respective differences, “ZCFs ( Δ s k and Δ r k),” contain information about both player noise ( σ n 2) and recorder noises ( σ a 2 and σ b 2). If one measures ZCFs with a DRS, it is possible to eliminate recorder noise from ZCFs by calculating positive and negative correlations between ZCFs ( V { Δ s k + Δ r k } and V { Δ s k Δ r k }). As a result, one can independently determine player noise. The player noise ( σ n 2) results from the jitter ( dev { j ( s k ) }) and PI noise ( dev { a PI ( s k ) }). To separate them, some considerations are required. An example of such a procedure is to measure player noise when L and R outputs are bundled together ( σ n 3).

We demonstrated the proposed method using commercial audio equipment. The RMS values of jitter and PI noise were determined as dev { j ( s k ) } 20 ps and dev { n PI ( s k ) } / ( ω A 0 ) 40 ps, respectively. These results show that the proposed method can evaluate values of jitter that are smaller than PI noise. The high accuracy of the proposed method entails that it will be powerful means by which to develop ultrahigh performance devices in the future. Using such devices, more definite and quantitative study of real-life sounds, such as music, becomes possible. This will form the basis of future investigations.

1.
J.
Dunn
, “
Jitter: Specification and assessment in digital audio equipment
,” in
Proceedings of the 93rd AES Convention
, San Francisco, CA (October 1–4,
1992
), p.
3361
.
2.
J.
Dunn
and
I.
Dennis
, “
The diagnosis and solution of jitter-related problems in digital audio systems
,” in
Proceedings of the 96th AES Convention
, Amsterdam, The Netherlands (February 26–March 1,
1994
), p.
3868
.
3.
J.
Dunn
, “
The diagnosis and solution of jitter-related problems in digital audio systems
,” in
Proceedings of the AES 9th UK Conference: Managing the Bit Budget (MBB)
, London, UK (May 16–17,
1994
), pp.
148
166
.
4.
J.
Dunn
, “
Jitter Theory
,” Audio Precision TECHNOTE TN-23, Audio Precision, Beaverton, OR (
2000
).
5.
A.
Nishimura
and
N.
Koizumi
, “
Measurement of sampling jitter in analog-to-digital and digital-to-analog converters using analytic signals
,”
Acoust. Sci. Technol.
31
(
2
),
172
180
(
2010
).
6.
W. R.
MacLean
, “
Absolute measurement of sound without a primary standard
,”
J. Acoust. Soc. Am.
12
,
140
146
(
1940
).
7.
S.
Barrera-Figueroa
, “
Free-field reciprocity calibration of measurement microphones at frequencies up to 150 kHz
,”
J. Acoust. Soc. Am.
144
(
4
),
2575
2583
(
2018
).
8.
E.
Rubiola
and
F.
Vernotte
, “
The cross-spectrum experimental method
,” arXiv:1003.0113 (
2010
).
9.
K.
Ashihara
,
S.
Kiryu
,
N.
Koizumi
,
A.
Nishimura
,
J.
Ohga
,
M.
Sawaguchi
, and
S.
Yoshikawa
, “
Detection threshold for distortions due to jitter on digital audio
,”
Acoust. Sci. Technol.
26
(
1
),
50
54
(
2005
).
10.
V. R.
Melchior
, “
High resolution audio: A history and perspective
,”
J. Audio Eng. Soc.
67
(
5
),
246
257
(
2019
).
11.
H.
Nittono
, “
High-frequency sound components of high-resolution audio are not detected in auditory sensory memory
,”
Sci. Rep.
10
,
21740
(
2020
).
12.
M.
Takeuchi
and
H.
Saito
, “
A method for analyzing sampling jitter in audio equipment
,” arXiv:2305.04531 (
2023
).
13.
The player yields a sinusoidal wave of f C = f P / 4; however, the frequency measured by the recorder is not exactly f P / 4.
14.
See supplementary material at https://doi.org/10.1121/10.0020291 for the results of numerical simulation.
15.
M.
Frigo
and
S. G.
Johnson
, “
FFTW (version 3.3.10) [computer program]
,” http://www.fftw.org (
2005
) (Last viewed June 2, 2023).
16.
G.
Feldhaus
and
A.
Roth
, “
A 1 MHz to 50 GHz direct down-conversion phase noise analyzer with cross-correlation
,” in
2016 European Frequency and Time Forum (EFTF)
(
2016
) pp.
1
4
.
17.
G.
Feldhaus
,
G.
Roesel
,
A.
Roth
, and
J.
Wolle
, “
Measurement uncertainty analysis and traceability for phase noise
,” Application Note No. 1EF95, Rohde & Shwarz, Munich, Germany (
2016
).

Supplementary Material