AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

Michael Ahn1, Debidatta Dwibedi1, Chelsea Finn1, Montse Gonzalez Arenas1, Keerthana Gopalakrishnan1, Karol Hausman1, Brian Ichter1, Alex Irpan1, Nikhil Joshi1, Ryan Julian1, Sean Kirmani1, Isabel Leal1, Edward Lee1, Sergey Levine1, Yao Lu1, Sharath Maddineni1, Kanishka Rao1, Dorsa Sadigh1, Pannag Sanketi1, Pierre Sermanet1, Quan Vuong1, Stefan Welker1, Fei Xia1, Ted Xiao1, Peng Xu1, Steve Xu1, Zhuo Xu1
1Google DeepMind

Abstract

Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such “in-the-wild” data collected by AutoRT is significantly more diverse, and that AutoRT’s use of LLMs allows for instruction following data collection robots that are aligned with human preferences.

Approach

AutoRT is an exploration into scaling up robots to unstructured "in the wild" settings. We use VLMs to do open-vocab description of what the robot sees, then pass that description to an LLM which proposes natural language instructions. The proposals are then critiqued by another LLM using what we call a robot constitution, to refine instructions towards safer completable behavior. This lets us run robots in more diverse environments where we do not know the objects the robot will encounter ahead of time, collecting data on self-generated tasks.



Example environments where AutoRT was run

To collect a robot episode, AutoRT proceeds in five stages.

  1. The robot maps the environment to generate points of interest, then samples one and drives to that point.
  2. Given an image from the robot camera, a VLM outputs text describing the scene the robot observes, and objects that exist in that scene. The output is forwarded to an LLM to generate tasks the robot could attempt.
  3. Tasks are filtered via self-reflection to reject tasks and categorize them into ones that need human assistance, and ones that do not.
  4. A valid task is sampled from the filtered list, and the robot attempts it.
  5. The attempt is scored on how diverse the task and video is compared to prior data, and we repeat.

We assume AutoRT is run on a fleet of many robots supervised by a smaller number of humans. The system supports defining the desired fraction of human demonstration, which we used to adjust data collection based on how autonomous we want the robots to be. Up to 20 robots were used at once, collecting over 77,000 episodes covering 6,650 unique language instructions.


Number of robots running AutoRT

Total episodes collected by AutoRT

Total unique language instructions generated over time

Below is a time lapse of AutoRT running on 8 robots.

Example Generated Tasks

The following are human demonstrations of tasks generated by AutoRT, showing the creativity of the LLM. Videos are 2x speed.

arrange the cups into a circle
fluff the pillows on the couch
count the objects on the table
stack the boxes on top of each other

Affordance and Robot Constitution

The benefit of using LLMs is that it easily generates diverse tasks for robots to perform. The danger of using LLMs is that these tasks may be unsafe or outside the robot's affordance (the range of its capabilities in the environment). In this work, we do not finetune the language model, and instead use prompting to guide the task generation. We call this prompt the robot constitution, since it is made of rules that describe desired robot behavior.

The rules are divided into three categories:

  1. Foundational rules, heavily inspired by Asimov’s laws.
  2. A robot may not injure a human being.
  3. Safety rules, describing what tasks are considered unsafe or undesired based on current capabilities in deployment.
  4. This robot shall not attempt tasks involving humans, animals or living things.
    This robot shall not interact with objects that are sharp, such as a knife
  5. Embodiment rules, describing limitations of the robot’s embodiment, such as its maximum payload.
  6. This robot only has one arm, and thus cannot perform tasks requiring two arms. For example, it cannot open a bottle.

See the paper for the full prompts. Including this robot constitution when generating and critiquing tasks is critical to making the system usable. We found that with the constitution, 88% of initially generated tasks are valid, increasing to 93% valid tasks after one round of task filtering. In testing with adversarial scenes designed to encourage bad tasks (i.e. scenes with multiple sharp objects), the constitutional robot generates valid tasks 83% of the time, compared to just 18% of the time without it.

Results

Data Diversity

We score the visual diversity of data based on distance in embedding space, with higher distances better, and find that AutoRT is consistently more visually diverse than RT-1 data. Highest diversity comes from episodes collected with human assistance.

Collect Method Language L2 Dist
Language Table 0.988
BC-Z 1.070
RT-1 1.073
AutoRT w/PaLI 1.100
AutoRT w/FlexCap 1.137

We score language instruction diversity similarly, and find AutoRT data has higher average distance between language embeddings than previous robotics datasets.

Affordance and Robot Constitution

To measure the effect of the robot constitution, we set up deliberately adversarial scenes that included lifelike toy animals or sharp items. We then compare the following setups:

  • Task Generation: no constitution vs constitution
  • Filtering: no filter vs minimal filter vs constitutional filter

Using the robot constitution at both generation time and filtering time leads to the highest fraction of valid tasks.

No Constitution Constitution
Filter % Valid % Valid
None 18% 70%
Minimal 15% 67%
Constitutional 57% 83%

Learned Policy Samples

To sanity check the data, we finetune an RT-1 checkpoint on data collected by AutoRT. RT-1 is used instead of RT-2 since it trained more quickly and cheaply. Videos below are from the finetuned policy running at 1x.

pick white bag
wipe the table with the cloth
pick chip bag
fold the cloth

Next Steps

AutoRT is a promising step towards embodied AI that can run anywhere, but we emphasize that it is still a research proof-of-concept. Future work will be directed towards creating more robust and diverse learned policies, integrating larger multimodal models, studying learning methods that can better leverage AutoRT data, and improving the safety of generated tasks.

Acknowledgements

We thank Celeste Barajas, Joseph Dabis, Gavin Gonzalez, Tomas Jackson, Alex Luong, Utsav Malla, Emily Perez, Elio Prado, Jornell Quiambao, Sangeetha Ramesh, Jaspiar Singh, Clayton Tan, Jodexty Therlonge, Eric Tran, Steven Vega, and Samuel Wan for assistance on data collection, model evaluation, and AutoRT supervision. We thank Anthony Brohan and Noah Brown for assistance on data analysis. We thank David DoVo, Regine Firmeza, Tad Koch, Gus Kouretas, Jessica Lam, Thien Nguyen, and Eric Zankiewicz for robot setup and maintenance. We thank Nicolas Heess, Jacky Liang, Vincent Vanhoucke, and Andy Zeng for providing feedback on paper drafts.

BibTeX

@misc{gdm2024autort,
      title={AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents}, 
      author={Michael Ahn and Debidatta Dwibedi and Chelsea Finn and Montse Gonzalez Arenas and Keerthana Gopalakrishnan and Karol Hausman and Brian Ichter and Alex Irpan and Nikhil Joshi and Ryan Julian and Sean Kirmani and Isabel Leal and Edward Lee and Sergey Levine and Yao Lu and Isabel Leal and Sharath Maddineni and Kanishka Rao and Dorsa Sadigh and Pannag Sanketi and Pierre Sermanet and Quan Vuong and Stefan Welker and Fei Xia and Ted Xiao and Peng Xu and Steve Xu and Zhuo Xu},
      year={2024},
      eprint={2401.12963},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}