Major Open Source Engine is Coqui. Others to consider — Whisper, and Piper.
XTTS model weights from Coqui - coqui-ai/TTS: 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Related to above: StyleTTS 2 | llm-tracker and a shootoff; via
vits — via “model is 40M parameter and 150MB in size, and works on-CPU runtime”
It does its job for most on-device use cases like reading an article, practicing a language, etc.!! Here’s how you can use it with Transformers !
Set up your environment: pip install transformers accelerate phonemizer
Initialize the model:
import torch
from transformers import VitsModel, AutoTokenizer
model = VitsModel.from_pretrained( "kakao-enterprise/vits-vctk") tokenizer = AutoTokenizer.from_pretrained( "kakao-enterprise/vits-vctk")
# Pass the text you'd like to synthesise:
text = "Hey, it's Max the best doggo speaking!"
inputs = tokenizer(text, return_tensors="pt")
# Generate audio
with torch. no_grad():
output = model(**inputs).waveform[0]
Bonus: you’d soon be able to fine-tune them in your voice/ dataset too!
Indian language TTS
… specially i’m interested in SOTA, Open Source, widely available Kannada models. If not, there is an opportunity to develop one.
Indic TTS - Synthesis Docs github - AI4Bharat/Indic-TTS: Text-to-Speech for languages of India
Kannada (ಕನ್ನಡ) Text To Speech (TTS) Demo
Books
Speech and Language Processing — by Dan Jurafsky and James H. Martin — An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition