Authors: Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, Rong Jin
Published on: February 04, 2024
Impact Score: 8.22
Arxiv code: Arxiv:2402.02309
Summary
- What is new: A new maximum likelihood-based algorithm to create image Jailbreaking Prompts (imgJP) that works across different models and prompts.
- Why this is important: The issue of multi-modal large language models (MLLMs) producing objectionable responses to specific user queries.
- What the research proposes: Developing an algorithm that finds image jailbreaking prompts to prevent MLLMs from generating harmful responses.
- Results: The imgJP exhibits strong model-transferability and efficiency in preventing unwanted content generation compared to existing methods.
Technical Details
Technological frameworks used: nan
Models used: MiniGPT-v2, LLaVA, InstructBLIP, mPLUG-Owl2
Data used: nan
Potential Impact
Companies leveraging MLLMs for content generation, moderation, and interaction with users could benefit from these insights to enhance model safety and reliability.
Want to implement this idea in a business?
We have generated a startup concept here: ShieldAI.
Leave a Reply