CAMBRIDGE, MA—Inspired by large language models, engineers at the Massachusetts Institute of Technology (MIT) have developed a unique way to teach general-purpose robots new skills. They created a versatile technique that combines a huge amount of heterogeneous data from many sources into one system that can teach any robot a wide range of tasks.
Their method involves aligning data from varied domains, like simulations and real robots, and multiple modalities, including vision sensors and robotic arm position encoders, into a shared “language” that a generative AI model can process.
By combining such an enormous amount of data, this approach can be used to train a robot to perform a variety of tasks without the need to start training it from scratch each time.
This method could be faster and less expensive than traditional techniques because it requires far fewer task-specific data. In addition, it outperforms training from scratch by more than 20 percent in simulation and real-world experiments.
Typically, engineers collect data that are specific to certain robots and tasks, which they use to train the machine in a controlled environment. However, gathering this information is costly and time-consuming, and a robot will likely struggle to adapt to environments or tasks it hasn’t seen before.
“In robotics, people often claim that we don’t have enough training data,” says Lirui Wang, an electrical engineering and computer science graduate student at MIT who is working on the project. “But, another big problem is that the data come from so many different domains, modalities and robot hardware.
A robotic “policy” takes in sensor observations, like camera images or proprioceptive measurements that track the speed and position of a robotic arm, and then tells a robot how and where to move.
Policies are typically trained using imitation learning, meaning a human demonstrates actions or teleoperates a robot to generate data, which are fed into an AI model that learns the policy. Because this method uses a small amount of task-specific data, robots often fail when their environment or task changes.
To develop a better approach, Wang and his colleagues drew inspiration from large language models like GPT-4. These models are pretrained using an enormous amount of diverse language data and then fine-tuned by feeding them a small amount of task-specific data. Pretraining on so much data helps the models adapt to perform well on a variety of tasks.
“In the language domain, the data are all just sentences,” explains Wang. “In robotics, given all the heterogeneity in the data, if you want to pretrain in a similar manner, we need a different architecture.”
According to Wang, robotic data take many forms, from camera images to language instructions to depth maps. At the same time, each robot is mechanically unique, with a different number and orientation of arms, grippers and sensors. Plus, the environments where data are collected vary widely.
Wang and his colleagues at MIT’s Computer Science and Artificial Intelligence Laboratory developed a new architecture called Heterogeneous Pretrained Transformers that unifies data from these varied modalities and domains.
They put a machine-learning model known as a transformer into the middle of their architecture, which processes vision and proprioception inputs. A transformer is the same type of model that forms the backbone of large language models.
“Our dream is to have a universal robot brain that you could download and use for your robot without any training at all,” says Wang. “While we are just in the early stages, we are going to keep pushing hard and hope scaling leads to a breakthrough in robotic policies, like it did with large language models.”