There is always that moment when you hear a good song playing on the radio or in public and you just wish you could know the name of the song. If you have ever used the music recognition app, Shazam, then you are very much aware of the service it provides. With Shazam, you can record a short sample of a song, and the app will display the song title and artist and also include a link to buy the song. This seemingly uncomplicated user experience is the product of some fine level of brilliance which I plan to expound in this article.
A Summary: How it all works
“There is no such thing as magic; only things that are not understood by the observer”
-Toba Beta, Author of the Master of Stupidity
Being the curious mind that I am, I had a genuine interest in the inner workings of Shazam shortly after I had discovered it. Not only was it able to pick out the correct song title every time I used it, but it also worked well in the presence of background noise and poor audio.
Apparently, Shazam maintains a database comprising millions of songs. Each song record passes through a process that extracts unique data about the song (more on this later). The mobile app on your phone performs a similar process on the short sample you record and the output is sent to Shazam’s servers. On the backend, Shazam is able to look up the sample from your phone against its vast collection of samples.
With a very large database, chances are the song you recorded will find a match. The matching song’s metadata is returned back to your mobile client and you as the user can see the name of the song and the artist. Voila !!
But you might have a few questions. How is sound digitized? How is unique data extracted from an audio sample? How is Shazam able to quickly find a match from millions of songs?
Fear not dear reader, for I shall explain.
It all Starts with a Spectrogram
Think of a Spectrogram as a 3-dimensional representation of sound comprising the essential components: Amplitude, Frequency, and Time. Let’s go back to Physics 101.
Amplitude is the strength of a sound wave; basically what you perceive as loudness. Higher amplitude means louder sound and vice versa.
Frequency is the rate at which the vibration occurs. This represents the pitch of the sound. High frequency denotes high pitch and vice versa.
Time is well, time as you know it. In this context, the time component of sound represents the instant at which the sound possesses certain properties.
These three components are used to generate a spectrogram from which a fingerprint can be computed. A fingerprint is a kind of digital audio signature.
How to generate an Audio Fingerprint
For the people at Shazam, while tackling this problem, they had to devise an algorithm that would cater to a few guiding principles:
- Temporal Locality
- Translation Invariance
Stay with me..
Temporal Locality means that the fingerprint output should not be influenced by where in the song the sample is extracted.
Translation Invariance ensures that the fingerprint can be reproduced regardless of the sample’s position in the audio file. We don’t know what part of the song the user will record, hence we must be able to generate a fingerprint through any continuous flow of music data.
Robustness means that the algorithm should be capable of withstanding noise, background disturbance, or distortion. Even if the song is recorded from a noisy place, or in the presence of other sounds, we should be able to generate the same fingerprint.
Entropy is a measure of fingerprint independence; the likelihood that the fingerprint represents that sound exclusively. This signifies the chances of collision or two different samples producing the same fingerprint.
The good people at Shazam beautifully addressed these concepts in developing their algorithm.
How did they do it? Let’s go back to the Spectrogram which allows us create a 3D model of sound.
While generating a fingerprint, Shazam extracts the high amplitude points (peaks) on the Spectrogram — effectively reducing the 3-dimensional graph to a 2-dimensional graph of frequency against time (Imagine looking at the graph from the top view, where you are only interested in the peaks). What is called a Constellation Map, the points on the 2D plane, represent points of high intensity within the sound.
For any song, these points of high intensity will be most resistant to background noise.
Next, they select Anchor points which are paired with other neighboring peaks in what is called a Target zone. For each pair of Anchor point and Target point, a time-frequency vector is produced which represents the relationship between the two points as a function of change in frequency over the change in time. All these vectors are inputted into a hashing function which generates a 32-bit integer. This process is repeated throughout the length of the song. Hence multiple fingerprint hashes are generated for a single song.
These hashes are indexed in tables of combinatorial hashes. The indexing ensures that hashes can be matched very quickly during a search. And the search is pretty fast — Shazam’s search has a time complexity of O(n log n) in the average case and a constant time complexity O(1) in the best case. Considering the size of their database, that’s pretty fast!!
The hashing process described above is replayed on your phone when you record a sound clip on Shazam. The output is sent over the internet to Shazam’s servers which conduct a search against the already indexed DB. The response cascades back to your phone screen and you see the result. The song playing is Wizkid’s latest banger: Joro 🙂
What’s really beautiful about all this is how the engineers at Shazam were able to decompose a very interesting problem and devise an elegant solution that is easy to understand and replicate. The lesson here is that you should not be discouraged when confronted by a very daunting challenge. Boil down the problem to the rudimentary truths and work your way up while keeping your eyes on the goal.
So next time you hear a great song playing, you don’t need to guess the title. Simply whip out your phone. Shazam is able to search a few seconds from a 3–4-minute song within a database of 10 million songs (that’s over 57 years of music!!!) … score one for human ingenuity.