speech synthesis

Bored Stiff

by Paul Strikwerda in Articles, Career, Freelancing, Personal 10 Comments

The author behind his microphone

I’ve been behind the mic since I was seventeen. By the look of my grey hair, you can tell that’s a pretty long time. Thirty-seven years to be exact. 

“Does it ever get old” someone wanted to know. “This voice-over thing you do.”

“Well, ‘it’ doesn’t get old, but I certainly do,” I replied, not knowing that I had spoken too early.

An hour later I got this really boring script about ladders, and I changed my mind. It was poorly written, poorly translated, and I had no idea why they had selected poor old me to narrate it. Yes, it was money in the bank, but in reality I would rather go back to bed. 

Let me explain something to you. 

I have no particular fondness for ladders. Walking under them brings bad luck, and many of them wobble in a most disconcerting way. Ladders are ugly and dangerous. Just because they take you to the top, doesn’t mean they’re special. They’re just a few steps up from step stools. One of the reasons I became a freelancer is because I wasn’t good at climbing the corporate ladder. So, why out of all people, should I have to sing their praises? 

It’s for the same reason they talked me into voicing videos about agricultural insurance, miracle car wax, and motorcycle repair. It’s part of the unavoidable, unglamorous, unexciting work voice-overs do every day in dimly lit chatter boxes. 

I must admit: that part of the job does get old and boring. Especially if one has to edit, separate, and name hundreds of files per specific client instructions that make it impossible to do this semi-automatically. Of course the client conveniently “forgot” to mention it at the time of the booking.

Come to think of it: that gets old too. You know, clients trying to take advantage. The other day one of them sent me a message saying that I had “forgotten” to read one paragraph. Of course they would need it right away. The thing is, that mystery paragraph was never in the original script. It was a last-minute addition. 

Now, I know that some colleagues would forgive the client for this “mistake,” and record the five or six extra lines pro bono. In my book, however, more words means more money. It’s not that I am greedy. I just happen to run a for-profit business. With the Arctic temperatures we’re experiencing, someone’s got to pay the heating bill!

If you were to ask a contractor to paint your kitchen as a courtesy, right after she’s finished with the living room, do you think she’d do it? Would an Uber driver take you to the town next to your agreed destination, and not charge you for it? Of course not. Then why do some people expect they can get a voice-over to record a few extra lines at no charge? 

“Well, the other guy we hired did it.” 

“Then why didn’t you ask him to do it?”

“Because he sucked.”

It’s the same old story, and it makes me yawn every time I hear it. 

If you’re getting your feet wet as a VO, trust me. There are parts of this job that are “just work.” Work you may hate. For instance, you’ve signed up to narrate a 400-page audio book, and with every chapter you get this nagging feeling that it’s not getting better. In fact, it’s going nowhere. You start wondering how this piece of pulp ever got published. Then you find out this is a vanity project by someone who should have kept his job at the department of motor vehicles. 


One of the most boring jobs you can get in this business involves speech synthesis. It’s the artificial production of human sounds by computers. The text-to-speech software “runs” on thousands of snippets of sounds (phonemes) recorded by voice-overs. Recording sessions can go on for months and are notoriously tedious (just ask Susan Bennett, the voice of Siri).

Once the engineers have what they need, they can use the program to simulate speech for apps, navigation systems, or virtual assistants such as Bixby and Alexa. Amazon now has a database of synthesized voices that is rented out to developers in need of voices for their applications. 

Here’s the kicker. As a voice-over you only get paid once for the database you helped create. That’s it. A colleague of mine heard his voice in at least twenty applications varying from computer games to language courses that were created artificially, and he’ll never see a penny. 

Since he recorded his phonemes, technology has moved even further. 

Did you know that Adobe’s Voco (the Photoshop of speech) only needs about twenty minutes of recorded target speech to generate a sound-alike voice, producing sound patterns that were not even recorded?

Watch this (and try not to be bored):

Perhaps they should have Voco read that terrible self-published novel I mentioned earlier!

Anyway, thanks to modern technology, the most boring parts of voice-over jobs might be behind us. If we can get machines to say anything we want them to say, why use humans? Computers can work without a break, and don’t require a SAG-AFTRA contract. 

In a strange way, that’s music to my ears. 

I might lose a few dollars, but very soon people like me won’t have to talk about ladders anymore.

How exciting is that?!

Paul Strikwerda ©nethervoice

PS If you’d like to hear an audio version of this story, be my guest:

PPS Be sweet: subscribe and retweet!

Send to Kindle

Are you replaced by Text-to-Speech software?

by Paul Strikwerda in Articles 1 Comment

Should voice-over artists be afraid of artificial unintelligence?

Will robots take over the role of narrator or do voice-over professionals still have a future?

 The man who had lost his voice from thyroid cancer, spoke again on the Oprah Winfrey show. In 2010, the late film critic Roger Ebert gave his Oscar predictions with the help of text-to-speech (TTS) software that speaks whatever he typed.

The first computer-based speech synthesis systems were created in the late 1950s. They’ve come a long way, but a lot of TTS software still sounds rather robotic and unnatural. That’s why Ebert turned to Scottish firm CereProc for help.

CereProc actually uses someone’s audio recordings to create a digital voice that comes very close to the real thing. Usually, CereProc has people come in to their studio and record about 15 hours of audio. This is used to re-create the original voice.

In Ebert’s case, they used audio commentary he had made for several DVD documentaries. The quality was poor and the recordings were not as long as they would have liked. Nevertheless, they did the impossible and gave Ebert his voice back.


TTS software is not only used for people who have lost the ability to speak. It’s used to capture accents and dialects that are on the verge of dying out. People also use it to learn a foreign language. There’s one other application you should be aware of: it could eventually be used to replace you and me! Poland-based Ivona Text-to-Speech advertises:

“Save money spent on voice talent recordings. You do not have to look for recording studios and speakers. You do not waste time concluding agreements and contacting the contractors and it’s accessible 24/7.”

If you want to get an idea of what this software is capable of, go to their website; type in a few words and have a digital voice read it back to you. Rival NeoSpeech, headquartered in California claims: 

“Robotic voices are now history.”

Neospeech offers nine different voices that speak US English, Mexican Spanish, Korean, Japanese and Mandarin Chinese for a wide range of hand-held devices, desktop and network/server applications.


If it weren’t for a certain former president, Roger Ebert might never have  found CereProc. Ebert came across the Bush-o-Matic talking head, a hilarious re-creation of the 43rd president. I must admit: Bush never sounded so articulate! You can make him say things that are intelligent, and even make him wink, squint or blink.

The CereProc engineers pieced the voice of Bush together from his weekly radio address. It’s kind of scary, but in a fun way. Just to be fair, they also added a virtual version of president Obama’s voice and the inimitable accent of the former governor of California, Arnold Schwarzenegger.

As you can tell from the audio samples, CereProc is getting close, but they’re not quite there yet. One of the biggest challenges any TTS provider needs to overcome, is how to add some emotion to the speech. Most artificial voices still sound a bit flat and get very boring very quickly. And for ordinary mortals, it’s still too expensive to re-create their own voice with the help of this technology. 


So, do you think it’s getting time for professional voice-overs to pack their bags and start looking for other work? Yes and no.

First of all, text-to-speech companies all over the world use voice talent to record different languages and accents for different applications. Secondly, if you’re a musician, you might find this technological development very interesting but non-threatening.

As you probably know, any musical instrument under the sun has been sampled, and entire symphony orchestras can come out of a can. Yet, people are still buying real Steinways and there are plenty of musicians who make a very decent living.

Do you think that we’ll ever see the time when Stravinsky’s “Rite of Spring” as performed on virtual instruments, will win a Grammy? I don’t think so. Will a laboratory ever be able to produce a recording of Bach’s cello solo sonatas that rivals the depth of Yo Yo Ma’s interpretation?

You see, there’s still hope for the most subtle, most flexible, most surprising and unique of all instruments: the human voice.  

Here’s the rub: robots have a hard time emoting. They can patiently and dispassionately guide you to the next exit, but they have a hard time expressing even the most basic of feelings such as fear, anger, hurt, guilt and… love.

However, give it a few years, and who knows what the industry will come up with!

Paul Strikwerda ©nethervoice

Send to Kindle