Authors: Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang
Published on: January 04, 2024
Impact Score: 8.22
Arxiv code: Arxiv:2401.02330
Summary
- What is new: LLaVA-Phi uses a smaller language model to effectively engage in multi-modal dialogues.
- Why this is important: Previous models were too large and inefficient for real-time multi-modal (text and visual) dialogue tasks.
- What the research proposes: Utilizing the compact Phi-2 language model with 2.7B parameters trained on high-quality corpora for improved dialogue performance.
- Results: Achieved commendable performance on benchmarks for visual comprehension, reasoning, and knowledge-based perception.
Technical Details
Technological frameworks used: LLaVA-Phi
Models used: Phi-2
Data used: High-quality corpora
Potential Impact
Embodied agent applications, time-sensitive interactive systems, visual comprehension tools, and customer service technologies.
Want to implement this idea in a business?
We have generated a startup concept here: VisuoChat.
Leave a Reply