I used the new OpenAI technology to transcribe audio directly on my laptop

I used the new OpenAI technology to transcribe audio directly on my laptop

OpenAI, the company behind the DALL-E image and meme creation program and the powerful GPT-3 text completion engine, has released a new open source neural network aimed at transcribing audio into written text (via TechCrunch). It’s called Whisper, and the company says it “approaches human-level robustness and accuracy in English speech recognition” and can also automatically recognize, transcribe, and translate other languages ​​like Spanish, Italian, and Japanese.

As someone who constantly records and transcribes interviews, I was immediately excited by this news: I thought I could write my own application to securely transcribe audio directly from my computer. While cloud-based services like Otter.ai and Trint work for most things and are relatively safe, there are only a few interviews where I or my sources would feel more comfortable if the audio file wasn’t there. In Internet.

Using it turned out to be even easier than he had imagined; I already have Python and various development tools set up on my computer, so installing Whisper was as easy as running a single Terminal command. Within 15 minutes, I was able to use Whisper to transcribe a test audio clip that I had recorded. For someone relatively tech-savvy who didn’t already have Python, FFmpeg, Xcode, and Homebrew set up, it would probably take about an hour or two. However, there is already someone working to make the process much simpler and easier to use, which we’ll talk about in a second.

Command line applications obviously aren't for everyone, but for something that's doing relatively complex work, Whisper is very easy to use.

While OpenAI definitely saw this use case as a possibility, it’s pretty clear that the company is primarily targeting researchers and developers with this release. In the blog post announcing Whisper, the team said their code could “serve as a foundation for building useful applications and future research in robust speech processing” and that they hope “Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much broader set of applications.” However, this approach is still notable: the company has limited access to its most popular machine learning projects, such as DALL-E or GPT-3, citing the desire to “learn more about real-world use and continue to iterate on our security systems.”

Image showing a text file with the transcribed lyrics of the Yung Gravy song

There’s also the fact that it’s not exactly a user-friendly process to install Whisper for most people. However, journalist Peter Sterne has teamed up with GitHub developer advocate Christina Warren. to try to fix that, announcing that they are creating a “free, secure, and easy-to-use transcription app for journalists” based on Whisper’s machine learning model. I spoke with Sterne and he told me that he decided the show, called Stage Whisper, should exist after conducting some interviews and determining that it was “the best transcription he’s ever used, with the exception of human transcribers.”

I compared a transcript generated by Whisper to what Otter.ai and Trint posted for the same file, and would say it was relatively comparable. There were enough errors in all of them that I would never copy and paste quotes from them into an article without double-checking the audio (which is, of course, best practice anyway, no matter what service you’re using). But the Whisper version would absolutely do the job for me; I can search through it to find the sections I need and then just manually check them. In theory Stage Whisper should work exactly the same as it will use the same model, just with a wrapped GUI.

Sterne admitted that technology from Apple and Google could make Stage Whisper obsolete in a few years: The Pixel’s voice recorder app has been able to take offline transcriptions for years, and a version of that feature is starting to roll out to other apps. Android devices. and Apple has offline dictation built into iOS (though there’s currently no good way to transcribe audio files with it). “But we can’t wait that long,” Sterne said. “Journalists like us need good automatic transcription apps today.” He hopes to have a basic version of the application based on Whisper ready in two weeks.

To be clear, Whisper likely won’t totally obsolete cloud-based services like Otter.ai and Trint, no matter how easy it is to use. For one thing, the OpenAI model is missing one of the most important features of traditional transcription services: being able to tag who said what. Sterne said that Stage Whisper probably wouldn’t support this feature: “we’re not developing our own machine learning model.”

The cloud is just someone else’s computer, which probably means it’s a bit faster

And while you get the benefits of local processing, you also get the drawbacks. The main one is that your laptop is almost certainly significantly less powerful than the computers a professional transcription service uses. For example, I fed a 24-minute interview audio into Whisper, which runs on my M1 MacBook Pro; it took about 52 minutes to transcribe the entire file. (Yes, I made sure I was using the Apple Silicon version of Python instead of the Intel one.) Otter spat out a transcript in less than eight minutes.

However, OpenAI technology has one big advantage: price. Cloud-based subscription services will almost certainly cost you money if you use them professionally (Otter has a free tier, but upcoming changes will make it less useful for people who frequently transcribe things), and the features Transcription tools built into platforms like Microsoft Word or Pixel require you to pay for software or hardware separately. Stage Whisper, and Whisper itself, is free and can be run on the computer you already have.

Again, OpenAI has more hope in Whisper than it is in being the foundation for a secure transcription application, and I’m very excited about what the researchers will end up doing with it or what they’ll learn by looking at the machine learning model, which was trained. about “680,000 hours of multilingual, multitasking supervised data collected from the web.” But the fact that it also has real, practical use today makes it all the more exciting.

Leave a Reply

Your email address will not be published.