Authors: Shahriar Golchin, Mihai Surdeanu
Published on: August 16, 2023
Impact Score: 7.8
Arxiv code: Arxiv:2308.08493
Summary
- What is new: The introduction of a new method to detect data contamination in large language models (LLMs) at both instance and partition levels, using guided instruction and advanced evaluation metrics.
- Why this is important: Data contamination in LLMs, where test data from downstream tasks is present in the training data, challenges the accurate assessment of an LLM’s effectiveness.
- What the research proposes: A method that employs guided instructions to identify individual instances of contamination, and further assesses partition level contamination through statistical comparisons and advanced classifiers.
- Results: Achieved 92% to 100% accuracy in detecting LLM contamination across seven datasets, revealing that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.
Technical Details
Technological frameworks used: nan
Models used: GPT-4
Data used: AG News, WNLI, XSum among others
Potential Impact
Companies relying on LLMs for data analysis, content creation, and AI-driven decision-making could be affected; accuracy advancements could benefit AI development and research communities.
Want to implement this idea in a business?
We have generated a startup concept here: CleanAI.
Leave a Reply