Authors: Álvaro Martín-Cortinas, Daniel Sáez-Trigueros, Iván Vallés-Pérez, Biel Tura-Vecino, Piotr Biliński, Mateusz Lajszczak, Grzegorz Beringer, Roberto Barra-Chicote, Jaime Lorenzo-Trueba
Published on: February 05, 2024
Impact Score: 8.4
Arxiv code: Arxiv:2402.03407
Summary
- What is new: A new self-supervised Voice Conversion (VC) architecture that creates speaker-disentangled representations to train Large Language Models (LLMs) for text-to-speech, improving stability and naturalness.
- Why this is important: LLMs suffer from stability issues like hallucinations, content skipping, or speech repetitions during speech generation.
- What the research proposes: Introducing a self-supervised VC architecture that enables the separation of transitory features from stationary ones during the training of LLMs, which improves the performance of text-to-speech systems.
- Results: The system showed a 4.7pp increase in speaker similarity and a 5.4pp reduction in word error rate over state-of-the-art models and achieved higher naturalness than human recordings.
Technical Details
Technological frameworks used: Self-supervised Voice Conversion architecture
Models used: Large Language Models (LLMs)
Data used: LibriTTS test-other dataset
Potential Impact
Speech synthesis and voice assistant technology companies, audiobook production, and any market relying on advanced text-to-speech applications.
Want to implement this idea in a business?
We have generated a startup concept here: VoiceGenie.
Leave a Reply