Authors: Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
Published on: February 06, 2024
Impact Score: 8.15
Arxiv code: Arxiv:2402.03710
Summary
- What is new: The introduction of ‘Listen, Chat, and Edit’ (LCE), a user-friendly, multimodal sound mixture editor that enables modifying each sound source in a mixture based on text instructions without separation.
- Why this is important: The challenge of controlling and editing mixed sound sources in our environment without having to manually separate them first.
- What the research proposes: LCE uses a novel approach by interpreting user-provided text instructions through a large language model to edit sound mixtures directly, facilitating tasks like extraction, removal, and volume control.
- Results: Demonstrated significant improvements in signal quality for various editing tasks and showed robust performance in zero-shot scenarios with different numbers and types of sound sources.
Technical Details
Technological frameworks used: nan
Models used: Large language models for interpreting text instructions and semantic filtering.
Data used: A custom 160-hour dataset with over 100k sound mixtures and associated text prompts.
Potential Impact
Audio editing software companies, content creators, and companies in the AI and multimedia sectors may benefit or need to adapt due to the advancements presented.
Want to implement this idea in a business?
We have generated a startup concept here: SonicAI.
Leave a Reply