RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

ICML 2024

Yufei Wang^*1, Zhanyi Sun^*1, Jesse Zhang², Zhou Xian¹,
Erdem Bıyık², David Held^†1, Zackory Erickson^†1,

¹CMU, ²University of Southern California

^*Equal Contribution ^†Equal Advising

Abstract

Reward engineering has long been a challenge in Reinforcement Learning research, as it often requires extensive human effort. In this paper, we propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks, using only a text description of the task goal and the agent’s visual observations, by leveraging feedback from vision language foundation models (VLMs). The key to our approach is to query these models to give preferences over pairs of the agent’s image observations based on the text description of the task goal, and then learn a reward function from the preference labels. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains — including classic control, as well as manipulation of rigid, articulated, and deformable objects — without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions.

RL-VLM-F Components

Overview: RL-VLM-F automatically generates reward functions for agents to learn new tasks, using only a text description of the task goal and the agent’s visual observations, by leveraging feedback from vision language foundation models (VLMs). The key to our approach is to query these models to give preferences over pairs of the agent’s image observations based on the text description of the task goal, and then learn a reward function from the preference labels. We use Preference-based RL to learn the policy and reward function at the same time.

RL-VLM-F Query Design

RL-VLM-F Two-stage query: We query the VLM in 2 stages: First, we query the VLM to generate free-form responses comparing how well each of the two images achieves the task. Next, we prompt the VLM with the text responses from the first stage to extract a preference label over the two images. We use the same query template for all tasks, with [task description] replaced by task-specific goal description.

RL-VLM-F: Prompts and Policies

Below we show the policy rollouts from our method and baselines on seven tasks including rigid, articulated, and deformable object manipulation. For each task, we show a short text description of task goal, which, when combined with the template prompt below, forms the full prompt that we use to query the VLM for preferences.

Fold Cloth Diagonally

task description: "to fold the cloth diagonally from top left corner to bottom right corner"

Straighten Rope

task description: "to straighten the blue rope"

Pass Water without Spilling

task description: "to move the container, which holds water, to be as close to the red circle as possible without causing too many water droplets to spill"

Move Soccer Ball into Goal

task description: "to move the soccer ball into the goal"

Open Drawer

task description: "to open the drawer"

Sweep Cube into Hole

task description: "to minimize the distance between the green cube and the hole"

CartPole

task description: "to balance the brown pole on the black cart to be upright"

Experiments and Results

We thoroughly evaluate RL-VLM-F on a diverse set of tasks — including classic control, as well as manipulation of rigid, articulated, and deformable objects — without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions

Comparison to Baselines

As shown in the learning curves of all compared methods on 7 tasks, RL-VLM-F outperforms all baselines in all tasks, and matches or surpasses the performance of ground-truth preference on 6 of the 7 tasks.

Accuracy of VLM Preference Labels

We provide analysis of the accuracy of the VLM preference labels, compared to ground-truth preference labels defined according to the environment’s reward function. The x-axis represents different levels of differences between the image pairs, discretized into 10 bins, where the difference is measured as the difference between the ground-truth task progress associated with the image pairs. The y-axis shows the ratio where the VLM preference labels are correct, incorrect, or when it does not have a preference over the image pairs. We find that like humans, VLMs are better at evaluating two images when they are distinct from one another in terms of achieving the goal, and perform worse when the two images are very similar.

Alignment Between Learned Reward and Ground-truth Task Progress

We compare how well the learned rewards by RL-VLM-F and VLM Score align with the ground-truth task progress on 3 MetaWorld tasks along an expert trajectory. As shown, RL-VLM-F generates rewards that align better with the ground-truth task progress. The learned rewards are averaged over 3 trained reward models with different seeds, and the shaded region represents the standard error.

BibTeX

@InProceedings{wang2024,
          title = 	 {RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback},
          author =       {Wang, Yufei and Sun, Zhanyi and Zhang, Jesse and Xian, Zhou and Biyik, Erdem and Held, David and Erickson, Zackory},
          booktitle = 	 {Proceedings of the 41th International Conference on Machine Learning},
          year = 	 {2024}
        }