Authors: Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, Alexandre Lacoste
Published on: March 12, 2024
Impact Score: 7.6
Arxiv code: Arxiv:2403.07718
Summary
- What is new: A novel evaluation of language model-based agents in the context of knowledge worker tasks, focusing on their use in navigating and operating enterprise software through web browsers.
- Why this is important: The gap in understanding how well large language model (LLM) based agents can perform tasks typical of knowledge workers using enterprise software systems.
- What the research proposes: Introduction of WorkArena as a benchmark for assessing agent performance and BrowserGym as an environment for designing and evaluating these agents, specifically within the ServiceNow platform.
- Results: The study demonstrates that current agents exhibit potential in task automation on the WorkArena benchmark but also identifies a substantial performance gap, especially between open and closed-source LLMs.
Technical Details
Technological frameworks used: BrowserGym
Models used: Large Language Models (LLMs)
Data used: WorkArena benchmark tasks based on ServiceNow platform
Potential Impact
Enterprise software market, companies in the software automation space, and developers or providers of large language models could benefit or need to adjust strategies.
Want to implement this idea in a business?
We have generated a startup concept here: TaskMasterAI.
Leave a Reply