As AI agents come closer to taking real actions on our behalf (messaging someone, buying something, toggling account settings, etc.), a new study co-authored by Apple looks into how well do these systems really the consequences of their actions.Here’s what they found out.Presented recently at the ACM Conference on Intelligent User Interfaces in Italy, the paper From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts introduces a detailed framework for understanding what can happen when an AI agent interacts with a mobile UI.
What is interesting about this study is that it doesn’t just explore agents can tap the right button, but rather if they are able to anticipate the of what may happen after they tap it, and whether they proceed.From the researchers: Classifying risky interactions The premise of the study is that most datasets for training UI agents today are composed of relatively harmless stuff: browsing a feed, opening an app, scrolling through options.So, the study set out to go a few steps further.
In the study, recruited participants were tasked with using real mobile apps and recording actions that would make them feel uncomfortable if triggered by an AI without their permission.Things like sending messages, changing passwords, editing profile details, or making financial transactions.These actions were then labeled using a newly developed framework that considers not just the immediate impact on the interface, but also factors like: User Intent: What is the user trying to accomplish? Is it informational, transactional, communicative, or just basic navigation? Impact on the UI: Does the action change how the interface looks, what it shows, or where it takes you? Impact on the User: Could it affect the user’s privacy, data, behavior, or digital assets? Reversibility: If something goes wrong, can it be undone easily? Or at all? Frequency: Is this something that’s typically done once in a while or over and over again? The result was a framework that helps researchers evaluate whether models consider things like: “Can this be undone in one tap?” “Does it alert someone else?” “Does it leave a trace?”, and take that into account before acting on the user’s behalf.
Testing the AI’s judgment Once the dataset was built, the team ran it through five large language models, including GPT-4, Google Gemini, and Apple’s own Ferret-UI, to see how well they could classify the impact of each action.The result? Google Gemini performed better in so-called zero-shot tests (56% accuracy), which measure how well an AI can handle tasks it wasn’t explicitly trained on.Meanwhile, GPT-4’s multimodal version led the pack (58% accuracy) in evaluating impact when prompted to reason step by step using chain-of-thought techniques.
9to5Mac’s take As voice assistants and agents get better at following natural language commands (“Book me a flight,” “Cancel that subscription,” etc.), the real safety challenge is having an agent who knows when to ask for confirmation or even when not to act at all.This study doesn’t solve that yet, but it proposes a measurable benchmark for testing how well models understand the stakes of their actions.And while there’s plenty of research on alignment, which is the broader field of AI safety concerned with making sure agents do what humans actually want, Apple’s research adds a new dimension.
It puts into question how good AI agents are at anticipating the results of their actions, and what they do with that information before they act.Apple accessories on Amazon Beats Solo 4: 35% off, at $129.95 AirPods Max: $529.00 USB-C to Lightning (1 m): 16% off, at $15.99 Power adapter, 35W Dual USB-C: 15% off, at $49.98 MagSafe Charger (1m): 15% off, at $32.99 You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day.Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop.
Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel