Yeah, that bit about each phoneme sounding exactly the same everytime really made a lot of sense. Even if the TTS phoneme sounds nothing like a human would say it, once you've heard it enough times, you just memorize it.
I guess sounding "natural" really just amounts to adding variation across the sentence, which destroys phoneme-level accuracy.
> When I listened to the voice sample in that section of the article, it sounds very choppy and almost like every phoneme isn't captured.
Every syllable is being captured, just speed up so that the pauses between them are much smaller than usual.
Sounds like the robotic voice is more important than we give it credit for, though - from the article's "Do You Really Understand What It’s Saying?" section:
> Unlike human speech, a screen reader’s synthetic voice reads a word in the same way every time. This makes it possible to get used to how it speaks. With years of practice, comprehension becomes automatic. This is just like learning a new language.
When I listened to the voice sample in that section of the article, it sounds very choppy and almost like every phoneme isn't captured. Now, maybe they (the phonemes) are all captured, or maybe they actually aren't - but the fact that the sound per word is _exactly_ the same, every time, possibly means that each sound is a precise substitute for the 'full' or 'slow' word, meaning that any introduced variation from a "natural" voice could actually make the 8x speech unintelligible.
Hope the author can shed a bit of light, it's so neat! I remember ~20 years ago the Sidekick (or a similar phone) seemed to be popular in blind communities because it also had settings to significantly speed up TTS, which someone let me listen to once, and it sounded just as foreign as the recording in TFA.