Speech-to-text software has gotten pretty mainstream these days. The most common use is voice commands, especially with mobile devices. Many smartphone owners have had the experience of articulating hostilely into a 1-millimeter smartphone microphone, trying — and expecting — to be understood. While there are still enough voice command errors to inspire sites like whysiriwhy.com (with bloopers like “sushi” becoming “slushy”), at the end of the day, you can probably still dictate what time you’re coming home for dinner without too much trouble.
Less charted territory in the world of automatic speech recognition is speech-to-text for media, like audio and video files. Unlike the controlled, single-speaker environment you get with Siri, the voices in media are complicated. With dictation software, you can train your device on your own voice, adding new custom pronunciations all the time. With pre-recorded media, you don’t have that luxury. Add to that rampant background noise, overlapping speech, and poor audio quality, and you can imagine why automatic captioning on YouTube often reads like Dada poetry.
At Pop Up Archive, we use speech-to-text technology to generate automatic transcripts and keywords for media, so we’ve seen our share of blooper-filled transcripts. For example, here’s one recording from the site: It’s Franklin D. Roosevelt’s 1941 Fourth of July address, recorded just months before Pearl Harbor. The specter of war on American soil hangs in the air. With gravity, Roosevelt intones: “We know that we cannot save freedom in our own midst, in our land — if all around us our neighbor nations have lost their freedom.” What does the speech recognition software hear?
Oh boy. And that’s just a safe-for-work example.
Publishers want the ease and speed of automatic transcripts and captions, but are loath to “publish” text that’s not perfect because their audiences expect (and deserve) readable transcripts and captions. As a result, until now, many have been slow to let speech-to-text software into their process: After all, that software literally just put the words “meth lab” into FDR’s mouth.
It’s important to understand that automatic speech recognition text often looks the way it does not because of a lack of data, but a surplus. General speech-to-text software is intentionally trained on text that is very broad in order to figure out all of the most current and common words, how often they’re used, and in what order they appear. And what better place to search for current text than the Internet? In this way, after being fed an Internet’s worth of smut, tech lingo, and celebrity gossip, a general language model for speech-to-text software can easily end up thinking that a 1960s Nelson Mandela had a lot to say about Twitter and Nicki Minaj. It’s a simple case of input and output.
customizing speech recognition for media
So what if you made speech recognition for media more customized? Pop Up Archive has made great strides in automatic speech technology by categorizing the content that comes into our site, and training our language models only on that kind of content. One example is our work for Melody Kramer’s NPRchives project, in which old stories are excavated from NPR’s archive, one year at a time, starting with the year 1984. For this project, we created a language model full of all things 1980s, from legwarmers to Reaganomics. Paired with other improvements in our speech technology, like speaker differentiation and punctuation, these automatic transcripts, though still not perfect, start to make a lot more sense.
And after all, is perfection really that important? Search engines don’t need perfect transcripts. There are ways to be smart about interpreting messy output, and to use that data to extract relevant keywords and metadata. So media outlets, publishers, everybody: Listen up! To value automatic transcripts only at 100% accuracy is to misunderstand the way the Internet interprets text. It’s like magic: When you pair text content with timestamps, audio becomes browsable. Harnessed the right way, speech-to-text software means effortless drag-and-drop access to crucial keywords and moments hidden deep within hours of content.
What many producers really need is a quick way to see the structure and keywords in an interview, not necessarily perfect fidelity to an original recording. This is something the current state of speech technology is capable of. In a recent Lifehacker interview, Ira Glass (This American Life) said this of his transcription process:
“Getting every word right isn’t as important as having something on paper for each sentence that’s been said because to make radio stories, you edit by the sentence.”
Why should reporters still labor through logging tape like mid-century typists when they can use automatic transcripts to get what Pop Up Archive customer and WLRN-Miami Herald News reporter Kenny Malone calls “a kind of low-rez map” of their audio?
You’re missing huge opportunities if you’re not utilizing speech technology to help you organize — and ultimately monetize — your media. Relevant timestamped keywords and phrases help you efficiently create and reuse media, as well as share that media with wider audiences. Audio is a data-rich medium, and we finally have the tools to interpret that data. We need speech technology to bring digital voices to the forefront of the web, even if it might say some bad things about your mother.
Emily Saltz is the Head of Content Strategy at Pop Up Archive, a “smart transcription” company that makes sound searchable. Previously, she got her hands dirty as a research assistant in the Phonetics and Psycholinguistics Labs at UC Santa Cruz. A public media fangirl, she once used This American Life to teach English to Russian students in Saint Petersburg, Russia. She is the tweeter and blogger in residence at Pop Up Archive, posting tidbits from its varied and ever-expanding collections. Contact her at email@example.com