‘Machine Dreams’ Help MIT Engineers Teach Robot Dog New Tricks

Combining generative AI with physics simulators enables a quadruped robot to learn from synthetic data rather than real-world experience.
Photo courtesy MIT Computer Science and Artificial Intelligence Laboratory

This robot derives its parkour behavior from reinforcement learning.
Photo courtesy MIT Computer Science and Artificial Intelligence Laboratory


Parkour is a high-energy sport that involves jumping and climbing between challenging objects and obstacles. It’s hard enough for most humans to tackle.
But, imagine teaching a robot to do parkour without ever letting it see the real world. That’s exactly what engineers at the Massachusetts Institute of Technology (MIT) Computer Science and Artificial Intelligence Laboratory (CSAIL) have done. Their LucidSim system combines generative AI with physics simulators to enable a quadruped robot to learn from synthetic data rather than real-world experience.
Robots trained using these machine-generated environments succeeded 88 percent of the time, while those taught by human experts managed only 15 percent. When the engineers doubled the AI-generated training data, performance improved steadily, showing that more virtual practice leads to better real-world performance.
This approach could help solve a problem plaguing robotics: the need for extensive real-world training data. It’s a practical step toward developing machines that can adapt more quickly to new tasks and environments. The goal of the R&D project is to teach robots to encounter scenarios that push the boundaries of their capabilities.
“Robots, like humans, extract knowledge from experience,” says Ge Yang, Ph.D., the lead researcher on the LucidSim project. “We need good experience to know what to do, and we need bad experiences to know how to avoid them. Learning happens at the tidal zone between mastery and failures, where the robot encounters a mixture of success and failures.
“Providing robots targeted experience at the boundary of their capabilities translate into the most effective learning,” explains Yang. “If we rely on humans to provide this data, we won’t be able to scale up the system. This is why building simulators like ours is critical. We can let robots explore the environment, and generate relevant data on their own.
“Good performance actually depends on the system being robust,” claims Yang. “Machines that are unable to generalize are brittle, which means they are unable to succeed in a world that is full of surprises.
“Without the ability to adapt and generalize to any environment or conditions, robots will fail on a regular basis, making it hard for people to trust them with tasks,” warns Yang. “Creating robots that can generalize and adapt to all environments and conditions is a key step toward making them useful.”
LucidSim moves what currently takes place in the physical world into software. Yang and his colleagues use generative AI to systematically create diverse, rich-looking scenarios with little effort. The data from LucidSim is also higher quality. Robots learn via closed-loop training, where the data directly reflect the machine’s own actions.
AI plays two important roles in LucidSim. First, a variant of text-to-image diffusion models called control net gives the engineers control over the geometry of the scene that appears in the generated images.
“These networks can be prompted via text, which gives us a way to specify the image content using human language,” says Yang. “We found that if we control the geometry, the same text prompts tend to produce similarly looking images. This is great for reproducibility, but it translates into data that are not diverse at all.
“So, to make the images diverse, we decided to programmatically collect a large number of text prompts from chatGPT,” notes Yang. “Our setup allowed us to quickly generate a large number of different scenarios, and this is how we get the robot to work.
“Our robot derives its parkour behavior from reinforcement learning,” explains Yang. “The robot policy that we deploy is distilled from a teacher policy trained ahead of time using reinforcement learning on a lot of data.”
Yang and his colleagues generated realistic images by extracting depth maps, which provide geometric information, and semantic masks, which label different parts of an image, from the simulated scene.
But, they quickly realized that with tight control on the composition of the image content, the model would produce similar images that weren’t different from each other using the same prompt. So, they devised a way to source diverse text prompts from ChatGPT.
This approach, however, only resulted in a single image. To make short, coherent videos that serve as little “experiences” for the robot, the CSAIL engineers hacked together some image magic into a technique they dubbed “Dreams in Motion.”
The system computes the movements of each pixel between frames to warp a single generated image into a short, multi-frame video. Dreams in Motion does this by considering the 3D geometry of the scene and the relative changes in the robot’s perspective.
“Regarding the robot, I wanted to study robot dogs but they were not really available in our labs,” says Yang. “So, I purchased [one] to support this research.”
Both data quality and quantity remain bottlenecks that Yang and his colleagues are attempting to address. “Quality wise, real-world human demonstrations are very difficult to scale, because humans have to reproduce the robot’s failure scenarios, so that the data they collect can ‘correct’ the robot,” he points out. “For a dataset, quality also includes diversity.
“Without covering diverse scenarios, datasets with a sparse or narrow focus will leave holes,” says Yang. “If the robot encounter those holes during deployment, it likely will fail.
“Data quantity is also a problem,” adds Yang. “A large gap is not something you can cross by scaling existing technology without innovation, but this seems to be the group thinking in robotics at the moment.
“With LucidSim, [we are] taking a very different approach,” claims Yang. “We are trying to find new solutions with an open mind, to solve both the data quality and the quantity problem.”
According to Yang, LucidSim could someday be applied to a variety of manufacturing applications, such as bin picking, part orientation and sorting.
“Although manipulation tasks in a manufacturing scenario are a lot more complex than the simple scenes we used in parkour, the LucidSim GenAI pipeline can still replace the entire graphics front end of the physics simulator,” explains Yang. “To learn a particular task, we can start with a 3D model of the object we care about.
“LucidSim will automatically diversify the lighting, the texture, the background content and more,” says Yang. “This diversity will teach the robot to focus on things that matters for the task.
“Compared to techniques such as domain randomization, LucidSim works better, and offers the additional control to manipulate the semantics of the objects and the scene,” claims Yang. “This is a very general and effective way to train machine vision systems in simulation.
“Assembly tasks require dexterity and tactile feedback,” notes Yang. “We are taking a fully software-driven approach to manipulation, where the human data happens in virtual reality as opposed to the real world. We believe this is a lot more scalable than having to set up physical scenes.”
To support this agenda, Yang and his colleagues recently released a next-generation data collection tool called vuer.ai. It runs physics simulation directly inside a virtual reality device.
“We can simulate very realistic physics, including deformable objects and wind,” says Yang. “We also built a way for users to experience the simulated forces at their fingertips. The nice thing about virtual reality is that the device handles hand tracking.”
Looking for a reprint of this article?
From high-res PDFs to custom plaques, order your copy today!