Authors: Dan Lyth, Simon King
Published on: February 02, 2024
Impact Score: 8.07
Arxiv code: Arxiv:2402.01912
Summary
- What is new: A scalable method for labeling speaker identity, style, and recording conditions, allowing high-fidelity speech generation with natural language prompting.
- Why this is important: Existing text-to-speech models require reference speech recordings to control speaker identity and style, which limits creative uses.
- What the research proposes: A method for scalable labeling of speaker characteristics applied to a dataset for training a model that can generate diverse, high-quality speech using natural language prompts.
- Results: The model significantly outperforms recent works in generating high-fidelity speech across various accents, styles, and conditions using intuitive prompts.
Technical Details
Technological frameworks used: nan
Models used: Speech language model
Data used: 45k hour dataset
Potential Impact
Speech synthesis and voice assistant markets, multimedia content creation industries
Want to implement this idea in a business?
We have generated a startup concept here: VoiceTune.
Leave a Reply