08 February 2024

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Written by Startup Idea

Authors: Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, Haohan Wang

Published on: February 05, 2024

Impact Score: 8.45

Arxiv code: Arxiv:2402.03299

Summary

What is new: A novel role-playing system method to generate jailbreaks for testing Large Language Models (LLMs) safety, and its application to both textual and vision language models.
Why this is important: Existing strategies for generating jailbreaks to test LLMs are not efficient or diverse enough to cover the range of potential harmful responses.
What the research proposes: GUARD proposes a role-playing system where user LLMs take on different roles and use a knowledge graph of jailbreak characteristics to generate new, diverse jailbreak scenarios.
Results: GUARD successfully induced unethical or guideline-violating responses from leading open-sourced and commercial LLMs, demonstrating its potential for improving LLM safety measures.

Technological frameworks used: Role-playing system, knowledge graph

Models used: Vicuna-13B, LongChat-7B, Llama-2-7B, ChatGPT, MiniGPT-v2, Gemini Vision Pro

Data used: Existing jailbreaks, government-issued guidelines

Large Language Model developers (e.g., OpenAI, Google), AI safety and ethics consultancy services, AI regulatory bodies

We have generated a startup concept here: SafePrompt.