Major Open Source Engine is Coqui. Others to consider – Whisper, and Piper.
XTTS model weights from Coqui - coqui-ai/TTS: 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
KoljaB/RealtimeTTS: Converts text to speech in realtime by identifying sentence fragments for immediate auditory feedback. Ideal for applications requiring instant audio responses.
yl4579/StyleTTS2: StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Related to above: StyleTTS 2 | llm-tracker and a shootoff ; via
vits – via “model is 40M parameter and 150MB in size, and works on-CPU runtime”
Kim, Jaehyeon, Jungil Kong, and Juhee Son. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” 2021. https://arxiv.org/abs/2106.06103 .
It does its job for most on-device use cases like reading an article, practicing a language, etc.!! Here’s how you can use it with Transformers !
Set up your environment:
pip install transformers accelerate phonemizer
Initialise the model:
import torch from transformers import VitsModel, AutoTokenizer = VitsModel.from_pretrained( "kakao-enterprise/vits-vctk") tokenizer = AutoTokenizer.from_pretrained( "kakao-enterprise/vits-vctk") model # Pass the text you'd like to synthesise: = "Hey, it's Max the best doggo speaking!" text = tokenizer(text, return_tensors="pt") inputs # Generate audio with torch. no_grad(): = model(**inputs).waveform output
Bonus: you’d soon be able to fine-tune them in your voice/ dataset too!
Indian language TTS
… specially i’m interested in SOTA, Open Source, widely available Kannada models. If not, there is an opportunity to develop one.
Indic TTS - Synthesis Docs github - AI4Bharat/Indic-TTS: Text-to-Speech for languages of India
Kannada (ಕನ್ನಡ) Text To Speech (TTS) Demo
Bhashini Neural TTS