Music archives store vast amounts of audio data and serve different interests. Some of the most prominent archives belong to the commercial streaming services that deliver music on demand to worldwide audiences. Users of Pandora, SoundCloud, Spotify, and other streamers access music through curated or user-generated playlists. Alternatively, they can search for an artist, song, or genre and scroll through the results on their devices.

The streamers’ collaborative filtering algorithms also generate recommendations based on the preferences of other users. Motivated by commercial gain, major music companies promote their content by inducing the streaming services to suggest it. Their songs duly appear in heavy rotation. Combined, the filtering and the manipulation lead to so-called echo chambers, in which the recommendations in a distinct group of people gain a self-reinforcing momentum. Members end up listening to an identical subset of the music catalog.

Ethnomusicological archives are another type of audio repository. They aim to provide access to a wide variety of field recordings from different cultures around the world. The Smithsonian Folkways Recordings, the Berlin Phonogram Archive, and the Ethnographic Sound Recordings Archive (ESRA, esra.fbkultur.uni-hamburg.de), in which the three of us are involved, are just three examples. The reasons for maintaining such archives are various. Some are dedicated to preserving musical cultural heritage; others focus on education.

Colores Salsa (2018), Valerie Vescovi, www.valvescoviart.com

Colores Salsa (2018), Valerie Vescovi, www.valvescoviart.com

Close modal

Repositories of a third type store the sounds of musical instruments. The institutions that maintain them focus on providing access to large sets of well-recorded sound samples. One use of such data is for the physical modeling of musical instruments. Researchers solve the differential equations of continuum dynamics to create realistic sounds that mimic the recordings. To make that possible, archives store the responses of instrument parts to impulsive forces and imaging data—such as CT scans of instruments or computer-aided design (CAD) models of them. Also stored are textual data from bibliographical research, such as instrument descriptions, provenance, and technical drawings. The databases are valuable for investigations of historic instruments whose surviving examples are unplayable.

The increasing demand in digital humanities for big data sets has led to another type of archive as cultural institutions rush to digitize their musical recordings and other holdings. Unfortunately, when such projects are complete, the result is often a vast trove of uncurated data.

The subject of our article, computational phonogram archiving, has emerged as a unified solution to tackling some of the generic problems of music archives, such as classifying recordings, while also addressing their particular shortcomings, such as echo-chamber playlists.1 Its primary approach is to analyze, organize, and eventually understand music by comparing large sets of musical pieces in an automated manner.

As we describe below, the first step in computational phonogram archiving involves using extraction algorithms to transform music into numerical representations of melody, timbre, rhythm, form, and other properties. In the second step, machine learning algorithms derive mood, genre, meaning, and other higher-level representations of music.

Music archives consist of audio files and accompanying metadata in text format. Metadata typically include extra-musical information, such as the artist’s name, genre, title, year of recording, and publisher. Archives of ethnographic field recordings also collect and store metadata about the origin of the audio. That information could include geographic region, GPS-derived latitude and longitude, the ethnic identity of the performers, and additional notes about the circumstances of the recording. If an archive focuses on preserving old recordings, metadata might also include information about the condition of the original media.

By using a search engine, users can query an archive’s database through its metadata. But when the metadata are incomplete or absent, an alternative strategy comes into play.2 Algorithms applied to musical recordings compute numerical representations of the characteristics of the audio data. At the most basic level, the representations, called audio features, quantify specific signal properties, such as the number of times per second the waveform crosses the zero-level axis and the amount of energy in various frequency bands.

To compute audio features that correlate with the auditory perception of human listeners, researchers employ methods drawn from the field of psychoacoustics.3 Since Hermann von Helmholtz’s groundbreaking derivation in 1863 of the Western major scale from perceptions of musical roughness, psychoacoustical investigations have explored the diverse relationships between physical properties of sound and the effects they have on auditory perception. The perception of loudness, for example, is not only proportional to amplitude, it is also strongly dependent on frequency and on whether or not a sound’s component frequencies are integer multiples of each other.

Search and retrieval engines that interrogate audio files can directly compare data sets only when they are the same size. One way to circumvent that limitation is to resolve the natural time dependence of the audio features. To that end, several applications use central moments, such as the expected value and variance, to aggregate over the time dimension. In subsequent stages of the interrogation, those values replace the original features.

Another approach is to model the distribution of an audio feature by using a mixture model. In such a case, the estimated model parameters stand in for the original time series. A similar but more specialized approach is the use of a hidden Markov model (HMM).4 The parameters of a trained HMM provide insight into the process that might have generated a particular instance at hand (see the box on page 52). Such a representation is especially useful in archives that feature music from different cultures, because its generality allows for an unbiased, extracultural viewpoint on the sound.

Finding rhythm with a hidden Markov model

Researchers at the Ethnographic Sound Recordings Archive use a hidden Markov model (HMM) to derive a representation of a piece’s rhythm in terms of the succession of timbres it includes. An onset detection algorithm estimates the time point of the beginning of each note. Feature extraction is performed only around those time points.

The training procedure considers the probabilities of changing from one timbre to another and then adapts the model’s transition probability matrix. The matrix therefore embodies the internal structure of a trained rhythm. As such, it functions as a prototype of the overall “global” rhythm of a piece of music. The original time series is one instance of the global rhythmical profile.

How the process plays out is represented by the accompanying figures. The top one illustrates a drum groove as a waveform diagram. Colored rectangles mark the regions subject to feature extraction. The colors correspond to the three instruments that played the groove. Two of them, bass drum (green) and snare drum (blue), are treated individually. The third instrument, the hi-hat, is split into two: hi-hat 2 (yellow), which is the cymbal’s pure sound, and hi-hat 1 (coral pink), which is a polyphonic timbre that combines the pure hi-hat sound with the decay of the bass drum.

The lower figure shows a graphical representation of the rhythmical profile. Each node corresponds to an HMM state. Edges represent the node transitions with probabilities greater than zero. The HMM quantifies a counterintuitive finding: Timbre influences how rhythm is perceived.

Audio features consist of vectors—that is, sets of points—in a high-dimensional space. How can such a point cloud help users navigate through an archive’s contents? The key idea is that similar feature vectors represent similar regions of the space in which they reside. The vectors’ mutual distance is therefore a natural measure of their similarity. The same is also true for the musical pieces that the vectors characterize. If their feature vectors are close together, their content is similar. Methods from artificial intelligence enable the exploration of enormous sets of feature vectors.5 

A straightforward approach, at least for retrieving records that resemble a particular input, is to use the k-nearest neighbors algorithm, which assigns an object to a class based on its similarity to nearby objects in a neighborhood of iteratively adjustable radius. Another way to classify the feature space is to partition it such that the members of a particular partition are similar. Such cluster analyses are conducted using methods like the k-means and expectation–management family of algorithms, agglomerative clustering, and the DBSCAN algorithm.

However, to truly explore the inherent similarities in the feature space, users need to somehow move through it. For that to be feasible, the complexity of the space has to be mitigated by reducing its dimensionality. Principal component analysis achieves that goal by projecting a data set onto a set of orthogonal vectors that account for the most variance in the data set.

An alternative approach, the self-organizing map (SOM),6 is popular for reducing dimensionality in applications that entail retrieving musical information.7 The SOM estimates the similarity in the feature space by projecting it onto a regular, two-dimensional grid composed of artificial “neurons.” Users of the archive can browse the resulting grid to explore the original feature space.

Another popular approach for reducing dimensionality is to use an embedding method, which maps discrete variables to a vector of continuous numbers. By coloring data points that correspond to the available metadata, users can cluster the embedding visually. Popular methods for doing this are t-distributed stochastic neighborhood embedding (t-SNE) and uniform manifold approximation and projection (UMAP).

Figure 1 illustrates an example of the above approach. ESRA currently holds three collections of recordings from different time spans: 1910–48, 1932, and 2005–18. When a SOM clustered the assets of ESRA by timbre, it revealed that timbre features appear to be sensitive to the noise floor introduced by different recording equipment—that is, the SOM segregates input data by the shape of the noise. Some pieces lack dates. One of the SOM’s applications is to estimate the year a piece was recorded.

Figure 1.

A self-organizing map (SOM) trained with the timbre features of audio files in the Ethnographic Sound Recordings Archive (ESRA). Circles mark the position to which the SOM assigns the audio feature vectors of individual audio files. The circles’ color denotes which of three ESRA collections the file belongs to. The clustering of the collections reflects the audio files’ noise floor, which, in turn, is related to the recording technology. In particular, the oldest recordings (yellow dots) tend to be the noisiest (blue contours).

Figure 1.

A self-organizing map (SOM) trained with the timbre features of audio files in the Ethnographic Sound Recordings Archive (ESRA). Circles mark the position to which the SOM assigns the audio feature vectors of individual audio files. The circles’ color denotes which of three ESRA collections the file belongs to. The clustering of the collections reflects the audio files’ noise floor, which, in turn, is related to the recording technology. In particular, the oldest recordings (yellow dots) tend to be the noisiest (blue contours).

Close modal

Physical modeling of musical instruments is a standard method in systematic musicology.8 The models themselves are typically based on two numerical methods: finite difference time domain (FDTD) and finite element method (FEM). Computed on parallel CPUs, GPUs (graphical processing units), or FPGAs (field-programmable gate arrays), the two methods can solve multiple coupled differential equations with complex boundary conditions in real time.

Solving the differential equations that embody an instrument’s plates, membranes, strings, and moving air is an iterative process that aims to reproduce musical sound. The solutions can also deepen understanding of how instruments produce sound. Instrument builders use models to estimate how an instrument would sound with altered geometry or with materials of different properties without having to resort to physical prototypes.

Brazilian rosewood, cocobolo, and other traditional tonewoods have become scarce because of climate change and import restrictions. Another use of models is to identify substitute materials. Conversely, some historical instruments, notably organs and harpsichords, no longer sound as they used to because their wood has aged and their pipes, boards, frames, and other components have changed shape. Models can reproduce the instruments’ pristine sound.

A machine learning model can estimate both the range of possible historical instrument sounds and the performance of new materials by learning the parameter space of the physical model. After inserting the synthesized sounds that result into an archive and applying the methods of computational sound archiving, the essential sound characteristics of instruments can be estimated. Although the field is still emerging, a scientifically robust estimation of historical instrument sounds is in sight for the first time.

As simulations proliferate, the need emerges to compare virtually created sounds with huge collections of sounds of real instruments. An example of such a comparison is the investigation of the tonal quality of an unplayable musical instrument: an incomplete 15th-century bone flute from the Swiss Alps.

Like its modern relative the recorder, a bone flute consists of a hollow cylinder with holes for different notes. The sound is produced when air is blown through a narrow windway over a sharp edge, the labium. The windway and mouthpiece are formed by a block that almost fills the flute’s top opening. The bone flute in our example is unplayable because the block, which was presumably made of beeswax, has not survived.

The first step in resurrecting the flute’s sounds is to scan the object using high-resolution x-ray CT scans. The 3D map characterizes both the object’s shape and its density. The data are then ingested into the archive, where they are subjected to segmentation. Widely used in medical imaging to identify and delineate tissue, segmentation allows substructures with homogeneous properties to be extracted from a heterogeneous object.

Each voxel (a pixel in 3D space) is allocated to exactly one of the defined subvolumes. Since substructures are generated according to their homogeneous properties, the respective surface data adequately represent the body. What’s more, replacing myriad voxels with a modest number of polygon meshes leads to a massive reduction of data.

A set of 3D meshes of instrument parts can be stored in the database for subsequent research tasks. The meshes are even small enough to be shared via email. The reconstructed polygon surface mesh is transformed into a parametric model and further modified using CAD. Once the virtual flute is made, it can be numerically fitted with a selection of differently shaped blocks that guide the air flow to strike the labium at varied angles.

And that’s just what we did. Our model considered three angles (0°, 10°, and 20°) at which the blown air left the windway and struck the labium. It also considered two speeds (5 m/s and 10 m/s) at which the blown air entered the mouthpiece. We used the initial speed because the main dynamics that characterizes the generation and formation of tone—what players of wind instruments call articulation—takes place in the initial few milliseconds of the blowing process.

The sound optimization was carried out with OpenFOAM, an open-source toolbox for computational fluid dynamics and computational aeroacoustics. We set the tools to work solving the compressible Navier–Stokes equations and a suitable description of the turbulence. The six different combinations of angle and blowing speed were calculated in parallel on the 256 cores of the Hamburg University computing cluster. Time series of relevant physical properties were sampled from data and analyzed to find the optimal articulation of the instrument.

Figure 2 visualizes the pressure field in the generator region of the virtually intact bone flute. The simulated sound pressure-level spectra of the six combinations of angle and blowing speed revealed that one, 20° and 5 m/s, produced the best and most stable sound, a finding that was corroborated with measurements of a 3D-printed replica of the bone flute. The five other combinations led to attenuation of the transient and unstable or suboptimal sound production.

Figure 2.

A visualization of the pressure field around the mouthpiece of a simulated 15th-century bone flute 2.5 ms after the virtual player starts blowing.

Figure 2.

A visualization of the pressure field around the mouthpiece of a simulated 15th-century bone flute 2.5 ms after the virtual player starts blowing.

Close modal

Virtually generated instrument geometries can be subjected to physical modeling, just like a real instrument. In particular, we could investigate more complex dynamics such as overblowing, a technique that Albert Ayler, John Coltrane, and other avant-garde saxophonists used to extend the instrument’s sonic landscape.

Computational phonogram archiving can have political implications. Some multiethnic states define ethnic groups according to their languages and how or whether the languages are related to each other. Like language, music is an essential feature of ethnic identity. Using the methods presented in this article, ethnologists can investigate ethnic relations and definitions from the perspective of sound in ways less susceptible to human prejudice than traditional methods are.

It’s also possible with the tools of computational phonogram archiving to reconstruct historical migrations of music’s rhythms, timbres, and melodies. Another application is the investigation into the ways that individual musical expressions relate to universal cross-cultural musical properties such as the octave.

The doctrine of the affections is a Baroque-era theory that sought to connect aspects of painting, music, and other arts to human emotions. The descending minor second interval, for example, has always been associated with a feeling of sadness. Other, perhaps previously unrecognized associations—historical and contemporary—could be revealed through computational phonogram archiving. Composers and compilers of movie soundtracks might profit from that approach. Given a model trained on features of musical meaning, they could choose appropriate film music by matching a particular set of emotions and signatures.

The analysis of the underlying psychoacoustic features could illuminate the musical needs of historical eras. Whether a piece of music had a particular function is often subject to debate, which could be resolved through automated comparison with the contents of a music archive.

Indeed, the power and promise of computational phonogram archiving derives from its ability to address problems on different scales. It enables researchers to compare large-scale entities, like musical cultures, and small-scale entities, like differences in the use of distortion by lead guitarists.

1.
R.
Bader
, ed.,
Computational Phonogram Archiving
,
Springer
(
2019
).
2.
I.
Guyon
 et al., eds.,
Feature Extraction: Foundations and Applications
,
Springer
(
2006
).
3.
H.
Fastl
,
E.
Zwicker
,
Psychoacoustics: Facts and Models
, 3rd ed.,
Springer
(
2007
).
4.
W.
Zucchini
,
I. L.
MacDonald
,
Hidden Markov Models for Time Series: An Introduction Using R
,
Chapman & Hall/CRC
(
2009
).
5.
J.
Kacprzyk
,
W.
Pedrycz
, eds.,
Springer Handbook of Computational Intelligence
,
Springer
(
2015
).
6.
T.
Kohonen
,
Self-Organizing Maps
, 3rd ed.,
Springer
(
2001
).
7.
M.
Leman
,
Music and Schema Theory: Cognitive Foundations of Systematic Musicology
,
Springer
(
1995
).
8.
R.
Bader
, ed.,
Springer Handbook of Systematic Musicology
,
Springer
(
2018
).

Michael Blaß, Jost Leonhardt Fischer, and Niko Plath are researchers at the University of Hamburg’s Institute of Systematic Musicology in Hamburg, Germany.