What could Adobe’s Project VoCo mean for pro audio?

Jerry Ibbotson reacts to Adobe’s recent reveal of Project VoCo, a new speech editing application it’s working on that has raised a number of technological and ethical questions.

Anyone who has worked with voice in sound will have a moment where they hear someone’s voice say something they didn’t, or at least not quite in that way. For me, it was way back in the last century as a trainee radio journalist working with quarter-inch tape. One of the exercises we were given was tidying up a classic archive of a US radio reporter trying to report on “A fire at the Firestone tyre factory”. Say it a few times yourself and you can imagine the mess he got himself in to. It illustrated the power of a tidy voice edit.

Across the years some things have changed but the basic principle is the same. Even with digital audio and a visual waveform on-screen, you still ultimately use your ears. But now software giant Adobe has demonstrated something that could shake that up and which throws up some serious ethical questions.

At its recent Adobe MAX Sneaks event in San Diego it unveiled a tech demo of Project VoCo. Beyond a few videos online there’s been no official announcement but this event has thrown up technology in the past that has worked its way into Adobe software such as Audition and Premiere.

Putting VoCo through its paces was developer Zeyu Jin. He described it as being like Photoshop for audio. I can understand where he’s coming from and it may make sense for his non-audio audience, but you could say that any audio editing package or DAW is already that. Adobe’s own Audition (my DAW/editor of choice) uses plenty of Photoshop-like tools.

Shaky start aside, what Zeyu Jin demonstrated left his audience stunned. He began by playing a clip of someone talking about his reaction to being nominated for an award.

“I kissed my dogs and my wife.”

VoCo automatically transcribed the words into text. Jin then copied the word “wife” and pasted it over the word “dogs” in the text window. VoCo played the audio back.

“I kissed my wife and my wife.”

The edit wasn’t 100% perfect – there was a slight clumsiness at the start of the first “wife” (the pasted one) – but it was still worthy of the applause it received from the audience. Bear in mind this is the original voice speaking – it sounded as it would if you edited the waveform but it’s been achieved by manipulating text.

There was more to come. Zeyu Jin wanted to put the dogs back in. He just wrote the word “dogs” at the end of the sentence, over the second “wife”.

“I kissed my wife and my dogs.”

This time it sounded exactly right, with nothing to fault. And this was done by simply writing the word.

How does it work?

Well it needs a decent sample of the original voice – around 20 minutes in this case. It then studies the building blocks of the individual’s speech – the phenomes. These are all the different sounds than make up spoken language, like ffs and esses and so on. VoCo then reassembles them on demand. It’s like a form of speech-sampling with the interface being text, instead of a waveform. The principle of using ears to make the edit has been broken.

The demo in San Diego wasn’t quite over. Zeyu Jin went back to VoCo and overwrote the word wife with “Jordan” (Jordan Steele on stage, presenting the MAX event). The audio now said, “I kissed Jordan and my dogs.”

This was VoCo building a new word from scratch, assembling it using the phenomes it had collected from the sample speech. It hadn’t just reordered words it had made them up or copied one that was already there.

Let’s not pretend that a good audio editor couldn’t re-jig a recording to make someone say something they hadn’t. I can remember hearing a quarter-inch tape edit of the Queen giving a speech in which she described a member of her family in language I can’t imagine Her Majesty ever using. And that was more than 25 years ago. But VoCo takes it to another level. It can literally build words from thin air.

So what do other people think?

I asked BBC radio journalist and tech-head Nick Garnett.

“Potentially it’s the biggest change in editing since waveforms came to the screen of your computer. It is fraught with post-truth issues. It enables quotes to be corrected after the event. Or changed.

“Veracity will rely on time stamped recordings or multiple recordings by different reporters. It’s the equivalent of the reporter huddle where newspaper hacks gather round one another to ‘build’ the quote. Honourable hacks would only manipulate a tense or the odd word. Unscrupulous hacks would make the whole lot up.”

This is something Adobe is aware of, as VoCo developer Zeyu Jin told the audience in San Diego: “We have researched how to prevent a forgery, like watermarking. As we are getting the results much better and making it so people can’t distinguish between the fake and the real one, we’re working on how to make it detectable.”

But the reassurance doesn’t cut it with everyone. I showed the video of the Adobe demo to another BBC journalist, my former colleague Huw Williams. He was even more alarmed at the potential for misuse that the concept offers.

“I think it sounds alarm bells. I can’t understand why someone has developed it and I’d like to know who they think is going to use it? Why would you want to make someone say something they didn’t?

“At a time when Fake News is in the spotlight after the US election, this has the potential to make things worse. It’s clever and impressive but some questions should be asked about how it might be used.”

For now it’s just on a test-bed but given that previous MAX demos have made it out into the real world, we could be about to see a revolution in voice editing that goes way beyond swapping words around or cutting down an interview.

Jerry Ibbotson has worked in pro-audio for more than 20 years, first as a BBC radio journalist and then as a sound designer in the games industry. He’s now a freelance audio producer and writer.