Authors: Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Zhuo Chen, Lei Xie, Yuping Wang, Yuxuan Wang
Published on: January 19, 2024
Impact Score: 8.3
Arxiv code: Arxiv:2401.11053
Summary
- What is new: Introduction of StreamVoice, the first LM-based streaming zero-shot voice conversion model without future look-ahead.
- Why this is important: Existing LM-based voice conversion models require complete source speech for offline conversion, hindering real-time applications.
- What the research proposes: StreamVoice utilizes a fully causal context-aware LM with a temporal-independent acoustic predictor for real-time conversion and employs strategies like teacher-guided context foresight and semantic masking to enhance performance.
- Results: StreamVoice achieves streaming voice conversion capability with zero-shot performance comparable to traditional, non-streaming VC systems.
Technical Details
Technological frameworks used: StreamVoice, a fully causal context-aware language model
Models used: Temporal-independent acoustic predictor, teacher-guided model for context foresight
Data used: nan
Potential Impact
Real-time communication platforms, voice modification software, and customer support systems could benefit; traditional voice conversion technology providers might be disrupted.
Want to implement this idea in a business?
We have generated a startup concept here: StreamVoice.
Leave a Reply