Robots That Understand You
MIT’s “Relevance” System Helps Robots Intuitively Assist Humans in Real-Time Scenes

For a robot, the real world is a lot to take in. Making sense of every data point in a scene can take a huge amount of computational effort and time. Now, MIT roboticists have a method to help robots focus on the features in a scene that are most relevant for assisting humans by cutting through the data noise. Using that information to then decide how to best assist a human is an even more difficult process. A robot can use cues in a scene, such as audio and visual information, to determine a human's objective and then quickly identify the objects that are most likely to be relevant in fulfilling that objective. This method, which they aptly refer to as "Relevance," allows for this. The robot then carries out a set of maneuvers to safely offer the relevant objects or actions to the human. The paper is available on the arXiv preprint server.
An experiment that simulated a conference breakfast buffet provided the researchers with a demonstration of the strategy. They set up a table with various fruits, drinks, snacks, and tableware, along with a robotic arm outfitted with a microphone and camera. Applying the new Relevance approach, they showed that the robot was able to correctly identify a human's objective and appropriately assist them in different scenarios.
In one instance, the robot gave a person milk and a stir stick as soon as it saw them reaching for a can of prepared coffee. In another scenario, the robot picked up on a conversation between two people talking about coffee, and offered them a can of coffee and creamer.
Overall, the robot was able to identify relevant objects with 96 percent accuracy and predict a human's goal with 90 percent accuracy. When compared to performing the same tasks without using the new method, the method also enhanced a robot's safety by reducing the number of collisions by more than 60%. "This approach of enabling relevance could make it much easier for a robot to interact with humans," says Kamal Youcef-Toumi, professor of mechanical engineering at MIT. "A robot would not need to ask a person as many questions about what they require. It would simply actively gather information from the scene in order to determine how to assist." The team led by Youcef-Toumi is looking into how Relevance-programmed robots can help in smart manufacturing and warehouse settings, where they envision robots collaborating with humans and intuitively assisting them. At the IEEE International Conference on Robotics and Automation (ICRA 2025) in May, Youcef-Toumi and graduate students Xiaotong Zhang and Dingcheng Huang will present their new method. The work builds on another paper presented at ICRA the previous year.
Finding focus
The team's approach is inspired by our own ability to gauge what's relevant in daily life. Humans can filter out distractions and focus on what's important, thanks to a region of the brain known as the Reticular Activating System (RAS). The RAS is a bundle of neurons in the brainstem that acts subconsciously to prune away unnecessary stimuli, so that a person can consciously perceive the relevant stimuli.
The RAS helps to prevent sensory overload, keeping us, for example, from fixating on every single item on a kitchen counter, and instead helping us to focus on pouring a cup of coffee.
According to Youcef-Toumi's explanation, "the amazing thing is that these groups of neurons filter everything that is not important, and then it has the brain focus on what is relevant at the time." "That pretty much sums up our proposition." He and his team developed a robotic system that broadly mimics the RAS's ability to selectively process and filter information. There are four main stages to the strategy. The first is the "watch-and-learn" "perception" stage, in which an AI "toolkit" continuously receives audio and visual cues from a robot's microphone and camera, for example. A large language model (LLM) that processes audio conversations to identify keywords and phrases, as well as a variety of algorithms that detect and classify things, people, physical actions, and task objectives, may be included in this toolkit. The AI toolkit is designed to run continuously in the background, similarly to the subconscious filtering that the brain's RAS performs.
The second stage is a "trigger check" phase, which is a periodic check that the system performs to assess if anything important is happening, such as whether a human is present or not. If a human has stepped into the environment, the system's third phase will kick in. This phase is the heart of the team's system, which acts to determine the features in the environment that are most likely relevant to assist the human.
An algorithm that takes into account the AI toolkit's real-time predictions was developed by the researchers in order to determine relevance. For instance, the toolkit's LLM may pick up the keyword "coffee," and an action-classifying algorithm may label a person reaching for a cup as having the objective of "making coffee."
The team's Relevance method would factor in this information to first determine the "class" of objects that have the highest probability of being relevant to the objective of "making coffee." This might automatically filter out classes such as "fruits" and "snacks," in favor of "cups" and "creamers."
The algorithm would then further filter within the relevant classes to determine the most relevant "elements." For instance, based on visual cues of the environment, the system may label a cup closest to a person as more relevant—and helpful—than a cup that is farther away.
In the fourth and final phase, the robot would then take the identified relevant objects and plan a path to physically access and offer the objects to the human.
Helper mode
The new system was put through its paces in simulations of conference breakfast buffets by the researchers. The Breakfast Actions Dataset, which contains videos and images of typical activities that people perform during breakfast time, such as preparing coffee, cooking pancakes, making cereal, and frying eggs, was the basis for their selection of this scenario. Each video and image's actions are labeled, as well as the overall goal (making coffee versus frying eggs). Using this dataset, the team tested various algorithms in their AI toolkit, such that, when receiving actions of a person in a new scene, the algorithms could accurately label and classify the human tasks and objectives, and the associated relevant objects.
In their experiments, they set up a robotic arm and gripper and instructed the system to assist humans as they approached a table filled with various drinks, snacks, and tableware. They found that when no humans were present, the robot's AI toolkit operated continuously in the background, labeling and classifying objects on the table.
When, during a trigger check, the robot detected a human, it snapped to attention, turning on its Relevance phase and quickly identifying objects in the scene that were most likely to be relevant, based on the human's objective, which was determined by the AI toolkit.
"Relevance can guide the robot to generate seamless, intelligent, safe, and efficient assistance in a highly dynamic environment," says co-author Zhang.
The team plans to use the system in scenarios that are similar to those found in warehouses and offices, as well as for other tasks and goals that are typically carried out in homes.
WRITTED BY ME
About the Creator
Mahafuj Alam
🗞️ Mahafuj Alam | News Curator & Independent Media Voice
delivering news quickly, accurately, and bravely, including headline breaking stories and untold local stories. Stay informed. Keep your power.



Comments
There are no comments for this story
Be the first to respond and start the conversation.