For about a year, I was working on Listenly — an app to listen to text content with OpenAI's natural-sounding text-to-speech model.
At some moment, I realized that it would be cool to take all the public domain e-books and create audio versions for them. So I did it... kind-of.
It would cost an immense amount of money to generate all the audio right away (OpenAI TTS costs approximately $0.84/hour of audio; 11labs, for comparison, is 10 times more expensive). So, I took a more gradual approach.
I took all the metadata from the Project Gutenberg catalog (it's about 70GB of dirty XML), cleaned it, put it into my database, and created a browsable catalog. When the first user visits a book page on Listenly, I download the full text of the book, save it in my cloud storage, and calculate the price for audio generation based on the book's length. Then, if the user decides to purchase it, we generate the audio.
I know it’s not perfect.
I've burned out a couple of times already while doing it.
But still, I need to show it to the world. And I’ll be glad to hear your feedback.
Peace.
Check out their voice samples: https://rhasspy.github.io/piper-samples/ (or make your own).
Happy to help you set it up locally...
https://github.com/rhasspy/piper
I'd find Piper too jarring for audio books because of the big quality difference, but I actually prefer it for things like AI assistants as I don't necessarily "want" AI assistants to sound perfectly human, and prefer the stylistic choice of having them sound more computer-generated.
Much less enjoyable than with OpenAI TTS.
Here is pride and prejudice and up the thread you can see another web novel example:
https://twitter.com/HarrisonJackson/status/18109373574214537...
ElevenLabs has so many great voice models but is super expensive. I want to experiment with some oss voice models and even train my own but not sure on a great starting point with that. Play.ht has some good voices, too.
Seeing some of the results here with the openai tts I will probably switch at least the narrator to use one of these to save some money.
I think you should try OpenAI's voices for characters too. They're really good at catching the emotions. They even can scream! https://x.com/ivryb/status/1780210661189992877
As a rabid audiobook consumer, I do have a couple of suggestions.
An easy one - currently you only use the Onyx voice from OpenAI. I'd recommend that at the very least you match the gender of the voice to the gender of the author. I find this is pretty common with published audiobooks, and I find it helps bring out the tone of the author more.
A harder one - most great audiobook narrators change their voice depending on the character speaking. If you really wanted to go in depth here, parsing the text by character and matching them to a voice would go a long way in making these more listenable. It would be fairly straightforward (albeit more expensive) to parse these books with an LLM and ask it to add inline markdown for the right voice options for each speaking character.
Given a great narration in one language, have a model annotate the tone and emotion of the narrator for each sentence, and re-apply these emotions to the voice synthesis for a target language, on the translated version.
Narration/recitation is such an orthogonal axis to the story and literary style, and an integral part of the experience.
I prefer the voice to match the protagonist. Or better yet an audio play with the narrator voice plus a voice matched to each speaker.
This is the kind of bikeshedding that AI text-to-voice can make moot. We can all have it our own way. That's an argument for generating the voice just in time rather than as a batch. But as long as such tools aren't ubiquitous this batch is a great public service.
Oh, please don't! I find this extremely disorienting! When I am listening to an audiobook, I am not listening to the voice. I am transported, envisioning another world, and changing voices often breaks the immersion by forcing me to re-calibrate to their cadence, tenor, accent, etc.
> We can all have it our own way.
Ah, well, yes. Of course. Nevermind, then!
J.KRowling's books, as mentioned, are famously well read by Stephen Fry and Jim Cook, both male voices for an initialed (but female) author.
But what then for J.KRowlings Cormoran Strike series under the pseudonym Robert Galbraith?
Female or male voice for modern crime fiction male detective novels?
There are likely many factors that go into selecting the right voice here - my main point is the same voice shouldn't be used for all books. It's likely a simple heuristic is better than "male voice for all", though no approach will be perfect without the opinion of the author, which isn't available unfortunately.
My comment wasn't entirely clear, it was the "voice matched to author gender" part that prompted a response.
In the great scheme of thing any voice reading aloud is an advance for people that require or like to hear books read out, improvements can come on a per book basis.
The end goal is likely a mix of bespoke readings by gifted voive readers (Fry) and guided "selectable AI voice" readings that can do can do clear and correct pronunciation and pacing with the voice of Jamie Erl Jonas (totally not James Earl Jones), Skarlat Johnson, or that Chipmunk character.
> I don’t think people read the book in the tone of the author
People certainly do in some instances, but an interesting thing about generated content going forward is folks will likely have the ability to choose on demand.
Cos if so - cool, that’s a lovely model. And you should make more of it. There’s a definite feel good factor associated with this. You could probably also charge a bit more - $5 for a thing I get alone vs $10 for a thing that I get but everyone else gets for free too seems a no brainer incentive to me.
FWIW I find Omnivore[0] to be really compellingly realistic TTS. I don’t know what they use but it’s pretty great imo.
[0] https://omnivore.app/
You could also get some credits from these companies in return for advertising “this book is sponsored by blah company”
I like the idea of letting people donate the audio they purchased to the community.
Although I'm scared that I'll have no money.
Although I still concerned about the costs of maintaining all these MP3s.
They're using a previous generation of TTS models, which most of the reader apps are using. They're reasonable, cheap, but sound noticeably worse than OpenAI's or 11Labs. I don't like them.
https://marhamilresearch4.blob.core.windows.net/gutenberg-pu...
Jason Cohen's blog posts and TechCrunch "Startup Weekly" newsletter are also great to listen to.
In terms of books, I'm not as active. But actually, I like Churchill's books much better with this AI narration than any that I've found on Audible. It looks like they're trying to narrate Churchill's books as if Churchill would, and it's not a good thing.
I think it's already very good in terms of sound quality. If not for fiction, then for professional literature, it's just great.
Some people have already purchased some books, finding them through Google (it is indexing all the pages right now, but it is taking some time, as there are 100,000+ pages for all the books, authors, and subjects).
Side note, this almost feels like something 2012 Google would have done, a la their scanning of the Library of Congress. Something to show off their text-to-speech.
A sample of the first chapter is available here:
https://fairpublishing.org/index.php/ebooks/sample-audiobook...
The voice quality and pronunciation are excellent. However, the system struggles with acting, so the tone and emotional expression are often wrong during dialogues. Additionally, I have to fragment the text into short paragraphs, making it challenging to set appropriate break durations, resulting in an unnatural rhythm.
Despite the technical quality and my appreciation for the reading voice, I won't continue in this direction.
ElevenLabs is quite expensive, but it would be worth it if the final result were good enough for listeners to purchase the audiobook.
I don't know if using OpenAI's API in English would yield better results. However, OpenAI's performance in non-English languages is not satisfactory.
Maybe generating a bunch of runs and then asking the users to vote could get us the best narrated book overall.
And yeah, OpenAI's model is bad for non-English languages. At least, for now...
I sadly found an AI audio project I don't support: This person was instead summarizing popular books into 10 minutes of audio. Basically trying to SEO better than the author and I know the authors aren't compensated. That just left me feeling sad. (I know book summaries for busy people have been a thing for a while, but this just all felt so opportunistic.)
As I search podcasts these days, I'm finding more and more of these low-effort, "doesn't take more than a few minutes to set up, why not" type AI-generated spam cannons. Been hard for a while but it's about to get REALLY hard to separate the wheat from the chaff.
When it's like 1-2 minutes before the end of the current chunk — I'm starting to generate the next one, for a seamless transition.
One chunk is taking about 30-40 seconds to generate (OpenAI API is 20-30s, Azure OpenAI API is ~40s).
I was planning to convert the whole book (just by queuing and parallelizing the requests) and concatenate it into a single MP3 (or an MP3 for each chapter), but it's not ready yet.
I also read summaries of books for research purposes or for dull school homeworks.
They both have a place before or after ai.
What's the problem?
Maybe AI-generated books should also be a part of Librivox.
I tried to listen to some, but the quality of narration was bad.
It seems like you did a lot of good technical work, but I find this project entirely useless and a waste of resources.
I'm really enjoying listening to nonfiction – history, philosophy, biographies.
There are 70,000 audiobooks in the catalog, and people can listen to them. If audio is generated on-demand in the background, it does not make them "not-audiobooks", and it does not make my post a lie. "If it looks like a duck..."
It's just a technical implementation detail. And I'm not hiding it; I'm describing it in the post
You didn't.
That's a lie.
Most people don't like to be lied to.
And I don't see any lie there.
https://www.hackerneue.com/item?id=40964863
https://www.hackerneue.com/item?id=40963194
I explained what I did in detail. I'm open in the comment section and explained my reasoning regarding the pricing. I've made practically no money off of this project so far.
There is an option to cache, but there is also an option to crowd-source, which makes the price for the first person smaller.
Moreover, if you try to buy an 'hour plan' for $15 and listen to any PG book, you will not be billed for the converted chunk, so the caching works as you'd expect.
Flagging feels so exteremely unfair.
It's just a technical implementation detail. And I'm not hiding it; I'm describing it in the post. I cannot describe the implementation detail in the short title.
It's just that you decided to believe that it's a lie, saying it very confidently, and taking down the post that was received generally very positively.
2. It's not a subscription, it's one-time purchase of hours.
I just wrote a catchy title (which can be a bit misleading, but not dramatically, as all the audiobooks I'm mentioning are really accessible to people; I developed all the infrastructure needed for that), and tried to clarify everything in the post itself.
How about "just-in-time generation" of 70k audiobooks
Have you done any attempts at multiple narrators telling a story?
Microsoft's Azure has a great tool for doing this but it's time consuming as you have to take all the text & match it to the narrator by hand. Open AI's last big demo kind of showed using voice chat to change narrator voices on the fly.
I think it would be awesome if you could submit a book, have a simple tool parse through & find all the speakers. Then let you sample how each one sounds with a brief description of what the person is like. Basically you get to have each voice do an audition & you pick your favorites. Then it goes through page by page generating audio based on the voices selected.
I'm not suggesting this feature for the app. I'm just throwing out this idea as one I've been thinking about. There have been a lot of books I've wanted to listen to but don't have time to sit down & read.
Right now, my paid users are listening mostly to non-fiction, so it seems like they don't need it.
But this whole Project Gutenberg saga is kinda diluting everything, and I need to think which users/market to focus on.
Will see :)
Pricing: maybe try a mobile app with monthly subscription? Something for recurring revenue.
Features: can you generate at 1.5x speed? Might be more natural than the playback speed up options and be a nice differentiator.
Regarding the subscription — I thought that no subscription was actually a competitive advantage, but now so many people are telling me to do it, that I'm really not sure anymore.
Re:subscription - maybe try both and see? Assuming most people listen to the same books, your generation costs should plummet pretty quickly.
I wish the OP well, and the project is nicely designed. But AI simply isn't there for this yet, not without a lot of individual hand holding and extra work.
It's already great for that purpose.
I've spent a fair amount of time listening to free audiobooks (https://archive.org/details/librivoxaudio) including many that are out of copyright like these, as opposed to modern but in the public domain.
After listening to a few minutes of "Frankenstein" on his site, I would say that these OpenAI generated voices sound better than almost all of the human-read ones on Librevox, both in audio and performance quality -- these are voices that are designed to sound good, and they succeed at that.
Plus, sometimes available human narrations are so bad that you really would like to listen to an AI one (I've experienced it with Churchill's audiobooks on Audible).
I don't know if it will work. It felt like it should work, at least for pSEO.
I got my first two audiobook purchases two weeks after I submitted the sitemap to Google. It was some romantic novels. But now it's flatlined again.
Will see...
https://docs.lemonsqueezy.com/help/checkout/payment-methods#...
I was thinking about it.
On the one hand, I want to make money. On the other hand, I understand that making everything available for free would be much more aligned with the Project Gutenberg philosophy.
I left my job, living on the savings, and in the last year listenly made only $400 ~= $35 MRR. Although I was not doing much marketing.
I'm dreaming of it making $1k, $3k, $5k MRR.
Right now, I set the price to be 50% of the API cost, so I would make a profit starting from the 3rd same book purchase.
But maybe I should make it fully social project, get some donations, and treat it as "lead magnet" to monetize something else. I'm open to your suggestions!
Monetizing is good but there is no value proposition in the product.
The chances I'll get something I'd like to listen are low because: - AI errors - AI lack of emotion - You picked a voice I've heard in thousands of automatically generated youtube videos and that I came to hate.
There is no chance I'd buy this, I'd rather buy an audiobook made by a human.
Now, people may not understand that - but then they'll be disappointed, bother you for a refund (chargebacks are 15$ a pop if you don't) or just speak badly about the project. Repeating sales potential is pretty bad imho.
I hope I don't come across as rude.
If you are really set on this idea I'd recommend to generate 1 book, make it perfect until it reads like it should and then sell it on as many platforms as you can (Amazon mainly I guess). Maybe use a custom cloned voice so it will sound unique and constistent across all books. You don't need a website but you have one so you might as well use it for marketing and maybe to gauge interest for the next book to process.
An audiobook is a good product in itself.
Bark has potential but the voice quality is pretty off.
The tortoise fork which improves the model and restores cloning (the author of tortoise decided it was to dangerous and crippled the project) is ok with some voices but it takes a lot of tries.
Voicebox from Meta is pretty good, comparable quality to ElevenLabs, but it's research-only for now.
Pretty sad overall.
Additional auth providers and UI theming were not a priority, and frankly, this is the first time I have received such a request.
But you're right, I definitely will do it.
The best books should already exist in audio and you can already show examples of the quality.
Has no one used this yet? Do you not store the generated result?
I mean it's fine to make money but you state it differently.
Nonetheless I like the project, I'm impressed with the examples and I also like the approach
If you don't want to open source it, send me an email: anthony@chovy.com -- i'd like to collaborate with you privately if I can run my own instance.
I'm really not sure about fully open-sourcing it. It's generally good for developer-focused products, but for Listenly... I just can't see the benefits. But I might be very wrong.
Anyway, hit me up on https://t.me/chovy2 or https://fightclub.profullstack.com -- I'd like to help you out.
I expect your costs to drive down over time, which is nice.
I was thinking about launching a Kickstarter campaign and making the whole library free for everyone. But I need more feedback. I don't know if it's viable.