Authors: Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, Hari Sundaram
Published on: November 14, 2023
Impact Score: 8.38
Arxiv code: Arxiv:2311.08588
Summary
- What is new: Introduction of CodeScope, a new comprehensive benchmarking tool for Large Language Models (LLMs) in coding tasks, expanding beyond current benchmarks by covering 43 programming languages, 8 coding tasks, and evaluating across three dimensions.
- Why this is important: Existing benchmarks for LLMs in coding tasks are limited, focusing on few programming languages and not considering real-world needs like multilingual environments and multi-task settings.
- What the research proposes: CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for LLMs.
- Results: Systematic evaluation of 8 mainstream LLMs using CodeScope demonstrated its effectiveness in providing a more comprehensive and challenging benchmark compared to existing ones.
Technical Details
Technological frameworks used: CodeScope, MultiCodeEngine
Models used: 8 mainstream Large Language Models
Data used: 43 programming languages across 8 coding tasks
Potential Impact
Software development, AI development platforms, and companies specializing in programming automation and assistance technologies could be significantly impacted.
Want to implement this idea in a business?
We have generated a startup concept here: DevAssistant.
Leave a Reply