Authors: Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, Haohan Wang
Published on: February 05, 2024
Impact Score: 8.45
Arxiv code: Arxiv:2402.03299
Summary
- What is new: A novel role-playing system method to generate jailbreaks for testing Large Language Models (LLMs) safety, and its application to both textual and vision language models.
- Why this is important: Existing strategies for generating jailbreaks to test LLMs are not efficient or diverse enough to cover the range of potential harmful responses.
- What the research proposes: GUARD proposes a role-playing system where user LLMs take on different roles and use a knowledge graph of jailbreak characteristics to generate new, diverse jailbreak scenarios.
- Results: GUARD successfully induced unethical or guideline-violating responses from leading open-sourced and commercial LLMs, demonstrating its potential for improving LLM safety measures.
Technical Details
Technological frameworks used: Role-playing system, knowledge graph
Models used: Vicuna-13B, LongChat-7B, Llama-2-7B, ChatGPT, MiniGPT-v2, Gemini Vision Pro
Data used: Existing jailbreaks, government-issued guidelines
Potential Impact
Large Language Model developers (e.g., OpenAI, Google), AI safety and ethics consultancy services, AI regulatory bodies
Want to implement this idea in a business?
We have generated a startup concept here: SafePrompt.
Leave a Reply