Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu
Published on: December 22, 2023
Impact Score: 8.15
Arxiv code: Arxiv:2312.14378
Summary
- What is new: Introduction of Multimodal Attention Merging (MAM) for direct knowledge transfer between different data modalities without requiring labeled data.
- Why this is important: Limited efficacy in training foundation models due to constraints in fine-tuning compute and scarcity of labeled data for certain domains.
- What the research proposes: MAM utilizes zero-shot learning to transfer knowledge from text and image models to speech and audio models, aiming to improve their performance without extensive labeled data.
- Results: MAM achieved up to 6.70% reduction in Word Error Rate for ASR and 10.63% reduction in error for AEC models. Learnable-MAM further improved reductions to 2.90% in WER for ASR and 18.42% for AEC.
Technical Details
Technological frameworks used: Self-supervised learning for foundation models, attention mechanism adaptations
Models used: Automatic Speech Recognition (ASR), Audio Event Classification (AEC)
Data used: Unlabeled data from high resource modalities (text, images) and resource-constrained domains (speech, audio)
Potential Impact
Speech recognition and audio event classification markets, potentially impacting companies specializing in voice-enabled devices, and audio content analysis tools.
Want to implement this idea in a business?
We have generated a startup concept here: MAMTech.
Leave a Reply