Hey ho and away we go: Folk music shanties used for U of A's singing-to-text research
Effort to teach computer to recognize sung words among first of its kind in the world
What can you do with a drunken sailor? Turns out, one might be useful in teaching a computer to decipher lyrics as they're being sung.
A University of Alberta physics student is using folk music shanties — traditional sea or lumberjack tunes sung by sailors or woodsmen as they worked — in an artificial intelligence project teaching computers to recognize speech as it is being sung.
"If you've used speech-to-text … you'll know that it has improved pretty dramatically," Dallin Backstrom, a physics major at the U of A, told CBC's Edmonton AM this week.
"That's largely been thanks to deep neural networks, which is a new AI technology that's been implemented in speech recognition."
Backstrom wondered if the technology would work on singing and discovered one other research group had ventured into this territory. His work, supported by the university's Sound Studies Initiative, is among the first of its kind in the world.
Teaching a computer to recognize singing-speech is a painstaking task which starts with feeding the computers a massive amount of tiny pieces of tagged data — in this case, carefully clipped and labelled singing sounds derived from recordings of shanty songs, Backstrom explained.
Why were shanty songs chosen for this project?
"A lot of it is acapella, which is very helpful because that reduces some of the noise and makes it just a lot easier for the computer to be able to understand. It's a lot less confused," Backstrom said.
And while What Can You Do With Drunken Sailor is a fine example of a sea shanty, that particular traditional tune from the early 1800s is not part of the computer's training repertoire.
Backstrom said the project is accessing Canadian music from the venerable Folkways collection hosted at the Smithsonian Institution as well as at the U of A's Sound Studies Initiative and include songs whimsically titled The Dog and the Gun, The Fair Maid on the Shore and The Barley Grain for Me.
Benjamin Tucker, a U of A professor specializing in speech science, explained that while speech-to-text technology works quite well, singing-to-text is complicated by things like pitch, intonation, longer-held vowels and the like.
Backstrom said that using the folk shanty music, he was able to teach a computer program to recognize the unique speech-sound produced by the singer at every quarter of a second with about 70 per cent accuracy.
He describes the training process like putting a soundwave through a funnel, in which the computer takes the complex information, breaks it down into sounds and tries to guess the correct speech sound from a set of 40 different ones that it has learned.
"Remember back in the days when we calculated how much memory there was in an iPod by how many songs it could hold? Sound is a big thing."
Now he is working on getting the program to be successful doing that same task with songs it hasn't heard before.
"We're at that step right now, where we're trying to test it and see if its accuracy is as high as we want it to be on brand new data."