08 February 2024

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Written by Startup Idea

Authors: William Chen, Oier Mees, Aviral Kumar, Sergey Levine

Published on: February 05, 2024

Impact Score: 8.3

Arxiv code: Arxiv:2402.02651

Summary

What is new: Leveraging general and indexable world knowledge encoded in vision-language models (VLMs) for reinforcement learning (RL) tasks.
Why this is important: RL agents typically learn from scratch, struggling in complex tasks without utilizing existing vast world knowledge.
What the research proposes: Initializing policies with VLMs as promptable representations, using embeddings grounded in visual observations for task contexts.
Results: Policies trained with these embeddings outperform those trained on non-promptable embeddings and are on par with domain-specific ones in complex tasks.

Technological frameworks used: Reinforcement Learning, Vision-Language Models

Models used: General-purpose VLMs pre-trained on Internet-scale data

Data used: Visual observations in Minecraft and robot navigation tasks in Habitat

Gaming (specifically Minecraft), Robotics Navigation, AI Development platforms, EdTech for coding and AI training

We have generated a startup concept here: EmbedInstruct.