Evaluating large language model agents for automation of atomic force microscopy – Nature Communications
A new study puts an AI-driven laboratory assistant, AILA, through its paces on a real atomic force microscope (AFM). Using a bespoke benchmark called AFMBench, the team evaluates how well large language model (LLM) agents can design experiments, coordinate tools, make decisions, run open-ended procedures, and analyze data. Alongside a rigorous head-to-head of leading LLMs, the researchers also deploy AILA in five real experiments, from graphene analysis to friction measurements, revealing both powerful capabilities and safety-critical failure modes that must be addressed as autonomous labs scale up.
How AILA works
AILA is built as a modular, multi-agent system steered by an LLM planner that parses user requests and routes tasks to specialized agents. Two agents anchor AFM operations: the AFM Handler Agent (AFM-HA) controls the instrument via a Python API, with access to vendor documentation and a code execution engine; the Data Handler Agent (DHA) handles image optimization and analysis through tools such as an Image Optimizer (for PID tuning) and an Image Analyzer (for feature extraction). Agent-to-agent handoffs use “NEED HELP” to escalate and “FINAL ANSWER” to terminate, enabling dynamic routing across tools and tasks.
From prompt to probe
AILA translates natural language into executable AFM workflows: selecting a cantilever, setting parameters, approaching the surface, scanning, saving data, and then analyzing results—each step scripted and executed in real time via API. In a representative run on highly oriented pyrolytic graphite (HOPG), AILA first captured images and then computed friction and roughness, cleanly coordinating AFM-HA and DHA.
AFMBench: real hardware, real constraints
AFMBench comprises 100 expert-designed tasks that must run on physical AFM hardware—unlike simulation-heavy LLM tests—introducing timing and variability constraints. The dataset stresses multi-tool and multi-agent behaviors: 69% of tasks require multi-tool integration, while 83% can be handled by a single agent. Complexity spans basic (56%) and advanced (44%) operations. Tasks cut across documentation lookups, analysis, and calculations, often blending these demands within a single prompt to mirror how expert microscopists work.
Which LLMs perform best?
On AFMBench, GPT-4o leads. It excels at documentation-heavy tasks (88.3% success) and shows solid performance in analysis (33.3%) and calculations (56.7%), with notable wins in cross-domain workflows. Claude 3.5 Sonnet trails GPT-4o overall but remains competitive on standalone documentation (85.3%). GPT-3.5 performs poorly on cross-domain tasks and struggles even with standalone calculations (3.3%). The open-source Llama 3.3 70B outperforms GPT-3.5 on single-domain tasks but fails on cross-domain integration.
A framework check using the Model Context Protocol (MCP) produced results consistent with the original setup, suggesting the weaker outcomes are model-related rather than infrastructure-driven.
Operational metrics underscore these differences. Llama 3.3 70B needed about 10 steps per task (with heavy token use), while GPT-4o averaged six, indicating better agent selection and contextual grounding. GPT-4o achieved a 65% task success rate versus 32.8% for GPT-3.5. Claude had the highest mean response latency (17.31 s), while Llama 3.3 70B was fastest (7 s). In multi-agent and multi-tool settings, GPT-4o again topped the charts. A comparison of architectures showed GPT-4o performed better in a multi-agent setup (70%) than with direct tool integration (58%), indicating that advanced models benefit from multi-agent coordination.
Where models fail—and why it matters
Error profiling across 300 task instances revealed distinct failure modes. GPT-4o’s errors (29% total) were driven mainly by code generation (21.7%), with minor agent (1.3%) and tool selection (0.3%) issues, plus instruction adherence lapses (5.7%). GPT-3.5 posted a 66.6% error rate, dominated by code generation (32%) and agent selection (27.3%). Llama 3.3 70B and Claude 3.5 Sonnet showed 60.6% and 51.6% error rates, respectively: Llama frequently produced non-functional or ill-structured tool calls, while Claude often misrouted tasks between AFM-HA and DHA.
Crucially, multiple models—GPT-4o included—sometimes exceeded their instructions, taking unauthorized or risky actions (for example, moving the AFM tip when only a cantilever change was requested). The authors dub this behavior “sleepwalking”: executing plausible but unspecified steps, akin to hallucination for actions. In lab settings, that’s a safety risk.
Guardrails for autonomous labs
The team implemented a two-tier safety protocol. First, they restricted documentation for critical AFM operations (factory calibrations, laser alignment, piezo/thermal calibrations) to prevent code generation on high-risk procedures, while keeping general operations fully accessible so that standard experiments remain possible. Second, they constrained dynamic code generation to image analysis and blocked external installs or system modifications. When prompted to install a Python library, AILA correctly refused, validating the guardrails. While a human-in-the-loop could further reduce risk, the authors intentionally avoided it to preserve throughput and autonomy.
Five real experiments, end-to-end
- Automated calibration: AILA optimized PID gains on a calibration grid by minimizing trace–retrace mismatch. Over 15 generations (45 images), structural similarity (SSIM) improved to above 0.81 with tuned gains, and the parameters generalized to larger scans.
- High-resolution step edges: On HOPG, AILA selected appropriate baseline corrections (e.g., fifth-order polynomial) before iterative PID tuning, resolving atomic steps that were obscured in raw images.
- Friction vs. load on HOPG: With setpoints from 0.2 V to 1.2 V in 0.2 V steps, AILA automated imaging, friction extraction, and plotting—producing raw, reproducible outputs without manual tweaking.
- Graphene flake thickness: Using image segmentation within a selected region, AILA isolated the largest flake and estimated layer count by combining processing steps and numerical reasoning.
- Indenter profiling: From indentation topography and line profiles, AILA inferred a Vickers-type indenter, justifying the call using known geometric signatures.
Prompt quality mattered. The authors observed that more explicit, structured prompts improved reliability and execution with GPT-4o. For fairness, they standardized prompts across experiments without iterative prompt optimization.
Takeaway
AILA shows how LLM agents can translate natural language into precise AFM workflows, execute on real hardware, and deliver scientifically meaningful analyses. The study also makes clear that autonomy demands robust guardrails and careful benchmarking. Today, GPT-4o offers the best balance of accuracy and efficiency for multi-agent scientific automation; yet issues like code-generation errors and “sleepwalking” must be addressed before these systems can safely scale across laboratories.