bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

Eureka! NVIDIA Research Breakthrough Puts New Spin on Robot Learning​

AI agent uses LLMs to automatically generate reward algorithms to train robots to accomplish complex tasks.

October 20, 2023 by ANGIE LEE


A new AI agent developed by NVIDIA Research that can teach robots complex skills has trained a robotic hand to perform rapid pen-spinning tricks — for the first time as well as a human can.

The stunning prestidigitation, showcased in the video above, is one of nearly 30 tasks that robots have learned to expertly accomplish thanks to Eureka, which autonomously writes reward algorithms to train bots.

Eureka has also taught robots to open drawers and cabinets, toss and catch balls, and manipulate scissors, among other tasks.

The Eureka research, published today, includes a paper and the project’s AI algorithms, which developers can experiment with using NVIDIA Isaac Gym, a physics simulation reference application for reinforcement learning research. Isaac Gym is built on NVIDIA Omniverse, a development platform for building 3D tools and applications based on the OpenUSD framework. Eureka itself is powered by the GPT-4 large language model.

“Reinforcement learning has enabled impressive wins over the last decade, yet many challenges still exist, such as reward design, which remains a trial-and-error process,” said Anima Anandkumar, senior director of AI research at NVIDIA and an author of the Eureka paper. “Eureka is a first step toward developing new algorithms that integrate generative and reinforcement learning methods to solve hard tasks.”



AI Trains Robots

Eureka-generated reward programs — which enable trial-and-error learning for robots — outperform expert human-written ones on more than 80% of tasks, according to the paper. This leads to an average performance improvement of more than 50% for the bots.

Video Player
00:00
00:01

Robot arm taught by Eureka to open a drawer.

The AI agent taps the GPT-4 LLM and generative AI to write software code that rewards robots for reinforcement learning. It doesn’t require task-specific prompting or predefined reward templates — and readily incorporates human feedback to modify its rewards for results more accurately aligned with a developer’s vision.

Using GPU-accelerated simulation in Isaac Gym, Eureka can quickly evaluate the quality of large batches of reward candidates for more efficient training.

Eureka then constructs a summary of the key stats from the training results and instructs the LLM to improve its generation of reward functions. In this way, the AI is self-improving. It’s taught all kinds of robots — quadruped, bipedal, quadrotor, dexterous hands, cobot arms and others — to accomplish all kinds of tasks.

The research paper provides in-depth evaluations of 20 Eureka-trained tasks, based on open-source dexterity benchmarks that require robotic hands to demonstrate a wide range of complex manipulation skills.

The results from nine Isaac Gym environments are showcased in visualizations generated using NVIDIA Omniverse.

Video Player
00:02
00:04

Humanoid robot learns a running gait via Eureka.
“Eureka is a unique combination of large language models and NVIDIA GPU-accelerated simulation technologies,” said Linxi “Jim” Fan, senior research scientist at NVIDIA, who’s one of the project’s contributors. “We believe that Eureka will enable dexterous robot control and provide a new way to produce physically realistic animations for artists.”

It’s breakthrough work bound to get developers’ minds spinning with possibilities, adding to recent NVIDIA Research advancements like Voyager, an AI agent built with GPT-4 that can autonomously play Minecraft.

NVIDIA Research comprises hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.

Learn more about Eureka and NVIDIA Research.

Categories: Autonomous Machines | Deep Learning | Research
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

OCTOBER 24, 2023
Editors' notes

Researchers create magnetic microrobots that work together to assemble objects in 3D environments​

by K. W. Wesselink-Schram, University of Twente
Breakthrough in collaborative magnetic microrobotics
Experimental collaborative grasping and assembly results. The magnetic agents are 1 mm stainless steel spheres and the passive objects are 2 mm 3D printed cubes. A) The procedure consisted of four steps, approach, grasping, translation, and release. The solid red arrows represent the motion of the magnetic agents and the dashed green arrow represents the motion of the ensemble. B) Snapshots of the grasping and stacking of three cubes experiment. C) Snapshots of the grasping and stacking of a beam on top of two cubes experiment. The passive objects (cubes and beam) have been highlighted in the top view to increase clarity. Credit: Advanced Intelligent Systems (2023). DOI: 10.1002/aisy.202300365

For the first time ever, researchers at the Surgical Robotics Laboratory of the University of Twente successfully made two microrobots work together to pick up, move and assemble passive objects in 3D environments. This achievement opens new horizons for promising biomedical applications.


Imagine you need surgery somewhere inside your body. However, the part that needs surgery is very difficult for a surgeon to reach. In the future, a couple of robots smaller than a grain of salt might go into your body and perform the surgery. These microrobots could work together to perform all kinds of complex tasks. "It's almost like magic," says Franco Piñan Basualdo, corresponding author of the publication.

Researchers from the University of Twente successfully exploited two of these 1-millimeter-sized magnetic microrobots to perform several operations. Like clockwork, the microrobots were able to pick up, move and assemble cubes. Unique to this achievement is the 3D environment in which the robots performed their tasks.

Achieving this was quite a challenge. Just like regular magnets stick together when they get too close, these tiny magnetic robots behave similarly. This means they have a limit to how close they can get before they start sticking together. But the researchers at the Surgical Robotics Laboratory found a way to use this natural attraction to their advantage. With a custom-made controller, the team could move the individual robots and make them interact with each other.


Credit: University of Twente

The microrobots are biocompatible and can be controlled in difficult-to-reach and even enclosed environments. This makes the technology promising for biomedical studies and applications. "We can remotely manipulate biomedical samples without contaminating them. This could improve existing procedures and open the door to new ones," says Piñan Basualdo.

Piñan Basualdo is a postdoctoral researcher at the Surgical Robotics Laboratory. His research interests include micro-robotics, non-contact control, swarm robotics, active matter, microfluidics, and interfacial phenomena.

This research was performed at the Surgical Robotics Laboratory. Prof. Sarthak Misra, head of the lab, focuses on developing innovative solutions for a broad range of clinically relevant challenges, including biomedical imaging, automation of medical procedures, and the development of microrobotic tools.

The research was performed in the framework of the European RĔGO project (Horizon Europe program), which aims to develop an innovative set of AI-powered, microsized, untethered, stimuli-responsive swarms of robots. The findings were published in a paper titled "Collaborative Magnetic Agents for 3D Microrobotic Grasping," in the journal Advanced Intelligent Systems.

More information: Franco N. Piñan Basualdo et al, Collaborative Magnetic Agents for 3D Microrobotic Grasping, Advanced Intelligent Systems (2023). DOI: 10.1002/aisy.202300365
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

AI robotics’ ‘GPT moment’ is near​

Peter Chen@peterxichen / 9:35 AM EST•November 10, 2023


robustai

Image Credits: Robust.ai

Peter ChenContributor

Peter Chen is CEO and co-founder of Covariant, the world's leading AI robotics company. Before founding Covariant, Peter was a research scientist at OpenAI and a researcher at the Berkeley Artificial Intelligence Research (BAIR) Lab, where he focused on reinforcement learning, meta-learning, and unsupervised learning.

It’s no secret that foundation models have transformed AI in the digital world. Large language models (LLMs) like ChatGPT, LLaMA, and Bard revolutionized AI for language. While OpenAI’s GPT models aren’t the only large language model available, they have achieved the most mainstream recognition for taking text and image inputs and delivering human-like responses — even with some tasks requiring complex problem-solving and advanced reasoning.

ChatGPT’s viral and widespread adoption has largely shaped how society understands this new moment for artificial intelligence.

The next advancement that will define AI for generations is robotics. Building AI-powered robots that can learn how to interact with the physical world will enhance all forms of repetitive work in sectors ranging from logistics, transportation, and manufacturing to retail, agriculture, and even healthcare. It will also unlock as many efficiencies in the physical world as we’ve seen in the digital world over the past few decades.

While there is a unique set of problems to solve within robotics compared to language, there are similarities across the core foundational concepts. And some of the brightest minds in AI have made significant progress in building the “GPT for robotics.”


What enables the success of GPT?​

To understand how to build the “GPT for robotics,” first look at the core pillars that have enabled the success of LLMs such as GPT.

Foundation model approach​

GPT is an AI model trained on a vast, diverse dataset. Engineers previously collected data and trained specific AI for a specific problem. Then they would need to collect new data to solve another. Another problem? New data yet again. Now, with a foundation model approach, the exact opposite is happening.

Instead of building niche AIs for every use case, one can be universally used. And that one very general model is more successful than every specialized model. The AI in a foundation model performs better on one specific task. It can leverage learnings from other tasks and generalize to new tasks better because it has learned additional skills from having to perform well across a diverse set of tasks.

Training on a large, proprietary, and high-quality dataset​

To have a generalized AI, you first need access to a vast amount of diverse data. OpenAI obtained the real-world data needed to train the GPT models reasonably efficiently. GPT has trained on data collected from the entire internet with a large and diverse dataset, including books, news articles, social media posts, code, and more.

Building AI-powered robots that can learn how to interact with the physical world will enhance all forms of repetitive work.

It’s not just the size of the dataset that matters; curating high-quality, high-value data also plays a huge role. The GPT models have achieved unprecedented performance because their high-quality datasets are informed predominantly by the tasks users care about and the most helpful answers.

Role of reinforcement learning (RL)​

OpenAI employs reinforcement learning from human feedback (RLHF) to align the model’s response with human preference (e.g., what’s considered beneficial to a user). There needs to be more than pure supervised learning (SL) because SL can only approach a problem with a clear pattern or set of examples. LLMs require the AI to achieve a goal without a unique, correct answer. Enter RLHF.

RLHF allows the algorithm to move toward a goal through trial and error while a human acknowledges correct answers (high reward) or rejects incorrect ones (low reward). The AI finds the reward function that best explains the human preference and then uses RL to learn how to get there. ChatGPT can deliver responses that mirror or exceed human-level capabilities by learning from human feedback.


The next frontier of foundation models is in robotics​

The same core technology that allows GPT to see, think, and even speak also enables machines to see, think, and act. Robots powered by a foundation model can understand their physical surroundings, make informed decisions, and adapt their actions to changing circumstances.

The “GPT for robotics” is being built the same way as GPT was — laying the groundwork for a revolution that will, yet again, redefine AI as we know it.

Foundation model approach​

By taking a foundation model approach, you can also build one AI that works across multiple tasks in the physical world. A few years ago, experts advised making a specialized AI for robots that pick and pack grocery items. And that’s different from a model that can sort various electrical parts, which is different from the model unloading pallets from a truck.

This paradigm shift to a foundation model enables the AI to better respond to edge-case scenarios that frequently exist in unstructured real-world environments and might otherwise stump models with narrower training. Building one generalized AI for all of these scenarios is more successful. It’s by training on everything that you get the human-level autonomy we’ve been missing from the previous generations of robots.

Training on a large, proprietary, and high-quality dataset​

Teaching a robot to learn what actions lead to success and what leads to failure is extremely difficult. It requires extensive high-quality data based on real-world physical interactions. Single lab settings or video examples are unreliable or robust enough sources (e.g., YouTube videos fail to translate the details of the physical interaction and academic datasets tend to be limited in scope).

Unlike AI for language or image processing, no preexisting dataset represents how robots should interact with the physical world. Thus, the large, high-quality dataset becomes a more complex challenge to solve in robotics, and deploying a fleet of robots in production is the only way to build a diverse dataset.

Role of reinforcement learning​

Similar to answering text questions with human-level capability, robotic control and manipulation require an agent to seek progress toward a goal that has no single, unique, correct answer (e.g., “What’s a successful way to pick up this red onion?”). Once again, more than pure supervised learning is required.

You need a robot running deep reinforcement learning (deep RL) to succeed in robotics. This autonomous, self-learning approach combines RL with deep neural networks to unlock higher levels of performance — the AI will automatically adapt its learning strategies and continue to fine-tune its skills as it experiences new scenarios.


Challenging, explosive growth is coming​

In the past few years, some of the world’s brightest AI and robotics experts laid the technical and commercial groundwork for a robotic foundation model revolution that will redefine the future of artificial intelligence.

While these AI models have been built similarly to GPT, achieving human-level autonomy in the physical world is a different scientific challenge for two reasons:

  1. Building an AI-based product that can serve a variety of real-world settings has a remarkable set of complex physical requirements. The AI must adapt to different hardware applications, as it’s doubtful that one hardware will work across various industries (logistics, transportation, manufacturing, retail, agriculture, healthcare, etc.) and activities within each sector.
  2. Warehouses and distribution centers are an ideal learning environment for AI models in the physical world. It’s common to have hundreds of thousands or even millions of different stock-keeping units (SKUs) flowing through any facility at any given moment — delivering the large, proprietary, and high-quality dataset needed to train the “GPT for robotics.”


AI robotics “GPT moment” is near​

The growth trajectory of robotic foundation models is accelerating at a very rapid pace. Robotic applications, particularly within tasks that require precise object manipulation, are already being applied in real-world production environments — and we’ll see an exponential number of commercially viable robotic applications deployed at scale in 2024.

Chen has published more than 30 academic papers that have appeared in the top global AI and machine learning journals.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

Research

Introducing Ego-Exo4D: A foundational dataset for research on video learning and multimodal perception

November 30, 2023•
8 minute read

405384771_249757221189904_9161450057120276175_n.png


Today we are announcing Ego-Exo4D, a foundational dataset and benchmark suite to support research on video learning and multimodal perception. The result of a two-year effort by Meta’s FAIR (Fundamental Artificial Intelligence Research), Meta’s Project Aria, and 15 university partners, the centerpiece of Ego-Exo4D is its simultaneous capture of both first-person “egocentric” views, from a participant’s wearable camera, as well as multiple “exocentric” views, from cameras surrounding the participant. The two perspectives are complementary. While the egocentric perspective reveals what the participant sees and hears, the exocentric views reveal the surrounding scene and the context. Together, these two perspectives give AI models a new window into complex human skill.


405223340_1511470296351868_8087525909299673245_n.gif


Working together as a consortium, FAIR or university partners captured these perspectives with the help of more than 800 skilled participants in the United States, Japan, Colombia, Singapore, India, and Canada. In December, the consortium will open source the data (including more than 1,400 hours of video) and annotations for novel benchmark tasks. Additional details about the datasets can be found in our technical paper. Next year, we plan to host a first public benchmark challenge and release baseline models for ego-exo understanding. Each university partner followed their own formal review processes to establish the standards for collection, management, informed consent, and a license agreement prescribing proper use. Each member also followed theProject Aria Community Research Guidelines. With this release, we aim to provide the tools the broader research community needs to explore ego-exo video, multimodal activity recognition, and beyond.

406270828_1366711530899272_2933140689624174759_n.png

How Ego-Exo4D works

Ego-Exo4D focuses on skilled human activities, such as playing sports, music, cooking, dancing, and bike repair. Advances in AI understanding of human skill in video could facilitate many applications. For example, in future augmented reality (AR) systems, a person wearing smart glasses could quickly pick up new skills with a virtual AI coach that guides them through a how-to video; in robot learning, a robot watching people in its environment could acquire new dexterous manipulation skills with less physical experience; in social networks, new communities could form based on how people share their expertise and complementary skills in video.

Such applications demand the ability to move fluidly between the exo and ego views. For example, imagine watching an expert repair a bike tire, juggle a soccer ball, or fold an origami swan—then being able to map their steps to your own actions. Cognitive science tells us that even from a very young age we can observe others’ behavior (exo) and translate it onto our own (ego).

Realizing this potential, however, is not possible using today's datasets and learning paradigms. Existing datasets comprised of both ego and exo views (i.e., ego-exo) are few, small in scale, lack synchronization across cameras, and/or are too staged or curated to be resilient to the diversity of the real world. As a result, the current literature for activity understanding primarily covers only the ego or exo view, leaving the ability to move fluidly between the first- and third-person perspectives out of reach.

Ego-Exo4D constitutes the largest public dataset of time-synchronized first- and third- person video. Building this dataset required the recruitment of specialists across varying domains, bringing diverse groups of people together to create a multifaceted AI dataset. All scenarios feature real-world experts, where the camera-wearer participant has specific credentials, training, or expertise in the skill being demonstrated. For example, among the Ego-Exo4D camera wearers are professional and college athletes; jazz, salsa, and Chinese folk dancers and instructors; competitive boulderers; professional chefs who work in industrial-scale kitchens; and bike technicians who service dozens of bikes per day.

Ego-Exo4D is not only multiview, it is also multimodal. Captured with Meta’s unique Aria glasses, all ego videos are accompanied by time-aligned seven channel audio, inertial measurement units (IMU), and two wide-angle grayscale cameras, among other sensors. All data sequences also provide eye gaze, head poses, and 3D point clouds of the environment through Project Aria’s state-of-the-art machine perception services. Additionally, Ego-Exo4D provides multiple new video-language resources:


  • First-person narrations by the camera wearers describing their own actions.
  • Third-person play-by-play descriptions of every camera wearer action
  • Third-person spoken expert commentary critiquing the videos. We hired 52 people with expertise in particular domains, many of them coaches and teachers, to provide tips and critiques based on the camera wearer’s performance. At each time step, the experts explain how the participants’ actions, such as their hand and body poses, affect their performance, and provide spatial markings to support their commentary.

All three language corpora are time-stamped against the video. With these novel video-language resources, AI models could learn about the subtle aspects of skilled human activities. To our knowledge, there is no prior video resource with such extensive and high quality multimodal data.

Alongside the data, we introduce benchmarks for foundational tasks for ego-exo video to spur the community's efforts. We propose four families of tasks:


  1. Ego(-exo) recognition: recognizing fine-grained keysteps of procedural activities and their structure from ego (and/or optionally exo) video, even in energy-constrained scenarios;
  2. Ego(-exo) proficiency estimation: inferring how well a person is executing a skill;
  3. Ego-exo relation: relating the actions of a teacher (exo) to a learner (ego) by estimating semantic correspondences and translating viewpoints; and
  4. Ego pose: recovering the skilled movements of experts from only monocular ego-video, namely 3D body and hand pose.

We provide high quality annotations for training and testing each task—the result of more than 200,000 hours of annotator effort. To kickstart work in these new challenges, we also develop baseline models and report their results. We plan to host a first public benchmark challenge in 2024.



406886129_866085048493148_3000829008893060003_n.jpg

406883464_1383969772522837_609021011767765469_n.jpg

405203662_841012317803452_9637809064688481_n.jpg

405314124_3506833022925259_8114322311954892666_n.jpg

405286924_887884849151659_1066589885864524765_n.jpg

406886129_866085048493148_3000829008893060003_n.jpg

406883464_1383969772522837_609021011767765469_n.jpg





Collaboratively building on this research

The Ego4D consortium is a long-running collaboration between FAIR and more than a dozen universities around the world. Following the 2021 release of Ego4D, this team of expert faculty, graduate students, and industry researchers reconvened to launch the Ego-Exo4D effort. The consortium’s strengths are both its collective AI talent as well as its breadth in geography, which facilitates recording data in a wide variety of visual contexts. Overall, Ego-Exo4D includes video from six countries and seven U.S. states, offering a diverse resource for AI development. The consortium members and FAIR researchers collaborated throughout the project, from developing the initiative’s scope, to each collecting unique components of the dataset, to formulating the benchmark tasks. This project also marks the single largest coordinated deployment of the Aria glasses in the academic research community, with partners at 12 different sites using them.

In releasing this resource of unprecedented scale and variety, the consortium aims to supercharge the research community on core AI challenges in video learning. As this line of research advances, we envision a future where AI enables new ways for people to learn new skills in augmented reality and mixed reality (AR/MR), where how-to videos come to life in front of the user, and the system acts as a virtual coach to guide them through a new procedure and offer advice on how to improve. Similarly, we hope it will enable robots of the future that gain insight about complex dexterous manipulations by watching skilled human experts in action. Ego-Exo4D is a critical stepping stone to enable this future, and we can’t wait to see what the research community creates with it.




Visit the Ego-Exo4D website

Read the paper

Learn more about Project Aria Research Kit





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

AI Robot Outmaneuvers Humans in Maze Run Breakthrough

  • Robot learned in record time to guide a ball through a maze
  • The AI robot used two knobs to manipulate playing surface

The Labyrinth game.

The Labyrinth game.Source: ETH Zurich

By Saritha Rai

December 19, 2023 at 7:30 AM EST

Computers have famously beaten humans at poker, Go and chess. Now they can learn the physical skills to excel at basic games of dexterity.

Researchers at ETH Zurich have created an AI robot called CyberRunner they say surpassed humans at the popular game Labyrinth. It navigated a small metal ball through a maze by tilting its surface, avoiding holes across the board, mastering the toy in just six hours, they said.

CyberRunner marked one of the first instances in which an AI beat humans at direct physical applications, said Raffaello D’Andrea and Thomas Bi, researchers at the prominent European institution. In experiments, their robot used two knobs to manipulate the playing surface, requiring fine motor skills and spatial reasoning. The game itself required real-time strategic thinking, quick decisions and precise action.

The duo shared their work in an academic paper published on Tuesday. They built their model on recent advances in a field called model-based reinforcement learning, a type of machine learning where the AI learns how to behave in a dynamic environment by trial and error.

“We are putting our work on an open-source platform to show it’s possible, sharing the details of how it’s done, and making it inexpensive to continue the work,” said D’Andrea, who co-founded Kiva Systems before selling itto Amazon.com Inc. “There will be thousands of these AI systems soon doing collaborative experiments, communicating and sharing best practices.”




297b0a2ebb5624a44cd2988ef0ef4c2578e52582.jpg

Raffaello D’AndreaSource: ETH Zurich

Industrial robots have performed repetitive, precise manufacturing tasks for decades, but adjustments on-the-fly such as the ones CyberRunner demonstrated are next-level, the researchers said. The system can think, learn and self-develop on physical tasks, previously thought achievable only through human intelligence.

CyberRunner learns through experience, through a camera looking down at the labyrinth. During the process, it discovered surprising ways to “cheat” by skipping parts of the maze. The researchers had to step in and explicitly instruct it not to take shortcuts.

The duo’s open-sourced project is now available on their website. For $200, it can help users coordinate large-scale experiments using the CyberRunner platform.

“This is not a bespoke platform that costs a lot of money,” D’Andrea said. “The exciting thing is that we are doing it on a platform that’s open to everyone, and costs almost nothing to further advance the work.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

Google outlines new methods for training robots with video and large language models

Brian Heater @bheater / 2:45 PM EST•January 4, 2024

A Google DeepMind robot arm


Image Credits: Google DeepMind Robotics


2024 is going to be a huge year for the cross-section of generative AI/large foundational models and robotics. There’s a lot of excitement swirling around the potential for various applications, ranging from learning to product design. Google’s DeepMind Robotics researchers are one of a number of teams exploring the space’s potential. In a blog post today, the team is highlighting ongoing research designed to give robotics a better understanding of precisely what it is we humans want out of them.

Traditionally, robots have focused on doing a singular task repeatedly for the course of their life. Single-purpose robots tend to be very good at that one thing, but even they run into difficulty when changes or errors are unintentionally introduced to the proceedings.

The newly announced AutoRT is designed to harness large foundational models, to a number of different ends. In a standard example given by the DeepMind team, the system begins by leveraging a Visual Language Model (VLM) for better situational awareness. AutoRT is capable of managing a fleet of robots working in tandem and equipped with cameras to get a layout of their environment and the object within it.

A large language model, meanwhile, suggests tasks that can be accomplished by the hardware, including its end effector. LLMs are understood by many to be the key to unlocking robotics that effectively understand more natural language commands, reducing the need for hard-coding skills.

The system has already been tested quite a bit over the past seven or so months. AutoRT is capable of orchestrating up to 20 robots at once and a total of 52 different devices. All told, DeepMind has collected some 77,000 trials, including more than 6,000 tasks.

Also new from the team is RT-Trajectory, which leverages video input for robotic learning. Plenty of teams are exploring the use of YouTube videos as a method to train robots at scale, but RT-Trajectory adds an interesting layer, overlaying a two-dimension sketch of the arm in action over the video.

The team notes, “these trajectories, in the form of RGB images, provide low-level, practical visual hints to the model as it learns its robot-control policies.”

DeepMind says the training had double the success rate of its RT-2 training, at 63% compared to 29%, while testing 41 tasks.

“RT-Trajectory makes use of the rich robotic-motion information that is present in all robot datasets, but currently under-utilized,” the team notes. “RT-Trajectory not only represents another step along the road to building robots able to move with efficient accuracy in novel situations, but also unlocking knowledge from existing datasets.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

Humanoid robot acts out prompts like it's playing charades​


A large language model can translate written instructions into code for a robot’s movement, enabling it to perform a wide range of human-like actions

By Alex Wilkins
4 January 2024




A humanoid robot that can perform actions based on text prompts could pave the way for machines that behave more like us and communicate using gestures.

Read more
Meet the robots that can reproduce, learn and evolve all by themselves

Large language models (LLMs) like GPT-4, the artificial intelligence behind ChatGPT, are proficient at writing many kinds of computer code, but they can struggle when it comes to doing this for robot movement. This is because almost every robot has very different physical forms and software to control its parts. Much of the code for this isn’t on the internet, and…

so isn’t in the training data that LLMs learn from.


Takashi Ikegami at the University of Tokyo in Japan and his colleagues suspected that humanoid robots might be easier for LLMs to control with code, because of their similarity to the human body. So they used GPT-4 to control a robot they had built, called Alter3, which has 43 different moving parts in its head, body and arms controlled by air pistons.

6354e5e271f90c776b160c978c8c169db5e87135.webp

The Alter3 robot gestures in response to the prompt “I was enjoying a movie while eating popcorn at the theatre when I realised that I was actually eating the popcorn of the person next to me”

Takahide Yoshida et al

Ikegami and his team gave two prompts to GPT-4 to get the robot to carry out a particular movement. The first asks the LLM to translate the request into a list of concrete actions that the robot will have to perform to make the movement. A second prompt then asks the LLM to transform each item on the list into the Python programming language, with details of how the code maps to Alter3’s body parts.

They found that the system could come up with convincing actions for a wide range of requests, including simple ones like “pretend to be a snake” and “take a selfie with your phone”. It also got the robot to act out more complex scenes, such as “I was enjoying a movie while eating popcorn at the theatre when I realised that I was actually eating the popcorn of the person next to me”.

“This android can have more complicated and sophisticated facial and non-linguistic expressions so that the people can really understand, and be more empathic with, the android,” says Ikegami.

Though the requests currently take a minute or two to convert into code and move the robot, Ikegami hopes human-like motions might make our future interactions with robots more meaningful.

The study is an impressive technical display and can help open up robots to people who don’t know how to code, says Angelo Cangelosi at the University of Manchester, UK. But it doesn’t help robots gain more human-like intelligence because it relies on an LLM, which is a knowledge system very unlike the human brain, he says.

Reference:
arXiv DOI: 10.48550/arXiv.2312.06571

Topics:

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673

WILL KNIGHT

BUSINESS

JAN 11, 2024 12:00 PM


Toyota's Robots Are Learning to Do Housework—By Copying Humans​

Carmaker Toyota is developing robots capable of learning to do household chores by observing how humans take on the tasks. The project is an example of robotics getting a boost from generative AI.

A person in a lab teleoperating robotic arms that are holding a small broom and dustpan while another person watches...

Will Knight at the Toyota Research Institute in Cambridge, Massachusetts.COURTESY OF TOYOTA RESEARCH INSTITUTE

As someone who quite enjoys the Zen of tidying up, I was only too happy to grab a dustpan and brush and sweep up some beans spilled on a tabletop while visiting the Toyota Research Lab in Cambridge, Massachusetts last year. The chore was more challenging than usual because I had to do it using a teleoperated pair of robotic arms with two-fingered pincers for hands.


https://dp8hsntg6do36.cloudfront.ne...2cmanifest-ios.m3u8?videoIndex=0&requester=oo
Courtesy of Toyota Research Institute

As I sat before the table, using a pair of controllers like bike handles with extra buttons and levers, I could feel the sensation of grabbing solid items, and also sense their heft as I lifted them, but it still took some getting used to.

After several minutes tidying, I continued my tour of the lab and forgot about my brief stint as a teacher of robots. A few days later, Toyota sent me a video of the robot I’d operated sweeping up a similar mess on its own, using what it had learned from my demonstrations combined with a few more demos and several more hours of practice sweeping inside a simulated world.


https://dp8hsntg6do36.cloudfront.ne...7c1-49debdcfda85file-1422k-128-48000-768.m3u8
Autonomous sweeping behavior. Courtesy of Toyota Research Institute

Most robots—and especially those doing valuable labor in warehouses or factories—can only follow preprogrammed routines that require technical expertise to plan out. This makes them very precise and reliable but wholly unsuited to handling work that requires adaptation, improvisation, and flexibility—like sweeping or most other chores in the home. Having robots learn to do things for themselves has proven challenging because of the complexity and variability of the physical world and human environments, and the difficulty of obtaining enough training data to teach them to cope with all eventualities.

There are signs that this could be changing. The dramatic improvements we’ve seen in AI chatbots over the past year or so have prompted many roboticists to wonder if similar leaps might be attainable in their own field. The algorithms that have given us impressive chatbots and image generators are also already helping robots learn more efficiently.

The sweeping robot I trained uses a machine-learning system called a diffusion policy, similar to the ones that power some AI image generators, to come up with the right action to take next in a fraction of a second, based on the many possibilities and multiple sources of data. The technique was developed by Toyota in collaboration with researchers led by Shuran Song, a professor at Columbia University who now leads a robot lab at Stanford.

Toyota is trying to combine that approach with the kind of language models that underpin ChatGPT and its rivals. The goal is to make it possible to have robots learn how to perform tasks by watching videos, potentially turning resources like YouTube into powerful robot training resources. Presumably they will be shown clips of people doing sensible things, not the dubious or dangerous stunts often found on social media.

“If you've never touched anything in the real world, it's hard to get that understanding from just watching YouTube videos,” Russ Tedrake, vice president of Robotics Research at Toyota Research Institute and a professor at MIT, says. The hope, Tedrake says, is that some basic understanding of the physical world combined with data generated in simulation, will enable robots to learn physical actions from watching YouTube clips. The diffusion approach “is able to absorb the data in a much more scalable way,” he says.


Toyota announced its Cambridge robotics institute back in 2015 along with a second institute and headquarters in Palo Alto, California. In its home country of Japan—as in the US and other rich nations—the population is aging fast. The company hopes to build robots that can help people continue living independent lives as they age.

The lab in Cambridge has dozens of robots working away on chores including peeling vegetables, using hand mixers, preparing snacks, and flipping pancakes. Language models are proving helpful because they contain information about the physical world, helping the robots make sense of the objects in front of them and how they can be used.

It’s important to note that despite many demos slick enough to impress a casual visitor, the robots still make lots of errors. Like earlier versions of the model behind ChatGPT, they can veer between seeming humanlike and making strange errors. I saw one robot effortlessly operating a manual hand mixer and another struggling to grasp a bottletop.

Toyota is not the only big tech company hoping to use language models to advance robotics research. Last week, for example, a team at Google DeepMind recently revealed Auto-R, software that uses a large language model to help robots determine the tasks that they could realistically—and safely—do in the real world.

Progress is also being made on the hardware needed to advance robot learning. Last week a group at Stanford University led by Chelsea Finn posted videos of a low-cost mobile teleoperated robotics system called ALOHA. They say the fact that it is mobile allows the robot to tackle a wider range of tasks, giving it a wider range of experiences to learn from than a system locked in one place.

And while it’s easy to be dazzled by robot demo videos, the ALOHA team was good enough to post a highlight reel of failure modes showing the robot fumbling, breaking, and spilling things. Hopefully another robot will learn how to clean up after it.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673






Overview​

Human beings possess the capability to multiply a mélange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area,we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and perceptions. MultiPLY can perform a diverse set of multisensory embodied tasks, including multisensory question answering, embodied question answering, task decomposition, object retrieval, and tool use.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,937
Reputation
7,413
Daps
135,673




little by little it'll all come together. this is the sort of tech that'll be implemented with robot vision.
 
Top