Researchers teach a robot to walk around San Francisco using AI’s word prediction techniques


A study by the University of California, Berkeley, enables robots to navigate based on the principle of word prediction from language models. This approach could pave the way for a new generation of robots that can navigate complex environments with minimal training.

In their paper, “Humanoid Locomotion as Next Token Prediction,” the researchers treat the complex task of robot locomotion as a sequence prediction problem, similar to predicting the next word in language generation.

They use the same Transformer technology that made the breakthrough in large language models and adapt it to predicting robot steps.

The robot’s steps are treated as “tokens”, comparable to words in a sentence. By predicting these tokens, the Transformer learns to predict the next movement based on the previous movement sequence. In other words, the robot predicts each next step based on the steps it has already taken.



The model was trained using a mix of data sources, including human motion data and YouTube videos. According to the researchers, the robot was able to navigate the streets of San Francisco without seeing any specific examples of the environment beforehand (so-called zero-shot). This was achieved by utilizing just 27 hours of walking data for model training.

The model can also execute commands that it had not seen in training, such as walking backwards. This adaptability could enable robots to move flexibly in complex real-world environments with a fraction of the training effort otherwise required.

Predictions help optimize diverse multimodal training data

The researchers’ method shines in its ability to handle assorted data sources, ranging from videos and sensor readings to computer simulations, converting this information into a uniform format for the Transformer’s use.

They also devised a strategy for capitalizing on incomplete data by introducing learnable mask tokens that can predict the available information, thus overcoming gaps in the data. For example, for YouTube videos, the researchers used the joint positions of the human body to transfer the motion to the humanoid robot.

Das Forschungsteam verwendete vielfältige Daten für das Robotertraining: neuronale Netzpolitik mit kompletten Sequenzen, modellbasierte Steuerungen ohne Aktionen, annähernd übertragene Bewegungserfassungen von Menschen und aus Internet-Videos rekonstruierte menschliche Posen.
The research team trained the robot using a diverse set of data, including everything from simulated neural network sequences to motion capture data from humans and reconstructed movements from online videos. | Image: Radosavovic et al.

The team’s core insight is that even with incomplete trajectories, where certain sensory or motor information is missing, the model can still learn effectively by predicting the available information and filling in the gaps with learnable mask tokens.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top