In Short:
Google DeepMind has upgraded a robot in their office in Mountain View, California, with a large language model called Gemini, allowing it to understand and navigate commands given by humans with up to 90% reliability. The robot can now find its way around the office and perform tasks like locating a whiteboard or a misplaced item. This demonstrates the potential for language models to enhance robots’ abilities in real-world environments.
In a cluttered open-plan office in Mountain View, California, a tall and slender wheeled robot has been busy playing tour guide and informal office helper—thanks to a large language model upgrade, Google DeepMind revealed today. The robot uses the latest version of Google’s Gemini large language model to both parse commands and find its way around.
Enhanced Robot Capabilities
When told by a human “Find me somewhere to write,” for instance, the robot dutifully trundles off, leading the person to a pristine whiteboard located somewhere in the building. Gemini’s ability to handle video and text allows the “Google helper” robot to make sense of its environment and navigate correctly when given commands that require some commonsense reasoning. The robot combines Gemini with an algorithm that generates specific actions for the robot to take, such as turning, in response to commands and what it sees in front of it.
Technical Advancements
When Gemini was introduced in December, Demis Hassabis, CEO of Google DeepMind, told WIRED that its multimodal capabilities would likely unlock new robot abilities. In a new paper outlining the project, the researchers behind the work say that their robot proved to be up to 90 percent reliable at navigating, even when given tricky commands such as “Where did I leave my coaster?” DeepMind’s system “has significantly improved the naturalness of human-robot interaction, and greatly increased the robot usability,” the team writes.
Potential in Large Language Models
The demo neatly illustrates the potential for large language models to reach into the physical world and do useful work. Gemini and other chatbots mostly operate within the confines of a web browser or app, although they are increasingly able to handle visual and auditory input. Academic and industry research labs are racing to see how language models might be used to enhance robots’ abilities.
Investment in Robotics
Investors are pouring money into startups aiming to apply advances in AI to robotics. Several of the researchers involved with the Google project have since left the company to found a startup called Physical Intelligence, which received an initial $70 million in funding; it is working to combine large language models with real-world training to give robots general problem-solving abilities. Skild AI, founded by roboticists at Carnegie Mellon University, has a similar goal. This month it announced $300 million in funding.
Just a few years ago, a robot would need a map of its environment and carefully chosen commands to navigate successfully. Large language models contain useful information about the physical world, and newer versions that are trained on images and video as well as text, known as vision language models, can answer questions that require perception. Gemini allows Google’s robot to parse visual instructions as well as spoken ones, following a sketch on a whiteboard that shows a route to a new destination.
In their paper, the researchers say they plan to test the system on different kinds of robots. They add that Gemini should be able to make sense of more complex questions, such as “Do they have my favorite drink today?” from a user with a lot of empty Coke cans on their desk.