June 11, 2026 Research Brief
Agent security turns stateful.
Today’s strongest papers show agent risk moving into memory, execution state, and post-training drift, while executable benchmarks and internal monitors expose failures that output-only checks miss.
Takeaways
- Agent security is shifting from prompt-only threats to **stateful, system-level compromise**: memory poisoning, skill poisoning, covert exfiltration, and long-horizon attacks repeatedly outperform simpler prompt-injection assumptions in realistic environments.
- Evaluation is getting more realistic and more sobering: **state-based and executable benchmarks** consistently show lower performance than output-only or static evaluations, with tool failures, workflow incompletion, and implementation-detail errors dominating.
- Several papers show that **internal or structural signals beat surface heuristics**: mechanistic monitoring of hidden states detects covert encoding better than output filters, provenance-grounded gating beats post-hoc retrieval for synthetic data curation, and per-turn CoT/output analysis reveals failures hidden by terminal metrics.
Start with: AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments
Why it catches my eye: It gives a reusable executable evaluation framework showing stateful attacks beat prompt-only assumptions in realistic agent settings.
Read skeptically for: Defense gains are modest, and transfer from benchmarked environments to production stacks is still uncertain.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments
#1Useful if you deploy agents with tools or memory: it evaluates realistic attack paths and separates safety, awareness, and utility outcomes.
- Why now
- Prompt-injection-only testing is no longer credible for persistent autonomous agents.
- Skepticism
- Runtime defenses help unevenly, and benchmark realism still may not capture heterogeneous production stacks.
MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents
#2Worth opening as a strong example of mechanistic monitoring outperforming output-only detection for covert agent exfiltration.
- Why now
- As agent monitoring matures, papers that detect intent before harmful text appears are especially timely.
- Skepticism
- It needs white-box access and reported monitor compatibility varies across host models.
Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models
#3A useful audit showing reasoning-focused post-training can improve capability while regressing safety, privacy, bias, and robustness.
- Why now
- Reasoning models are shipping quickly, often with capability gains reported more clearly than trust regressions.
- Skepticism
- Evidence is limited to open models up to 14B, and the KL analysis is diagnostic rather than causal.
Chinese version: [中文]
Run stats
- Candidates: 315
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-09T00:00:00Z → 2026-06-10T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.10749 | Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation | cs.CR, cs.AI | 96 | Comprehensive 247-paper synthesis of LLM agent security threats, defenses, and evaluation. | llm-agents, security, survey, evaluation, threat-models |
2606.11063 | CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs | cs.AI, cs.LG | 96 | Benchmark for control-intervention awareness in frontier LLMs; directly targets AI control evasion risk. | ai-safety, agents, control, benchmark, evaluation, frontier-llms |
2606.10304 | MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents | cs.CL | 95 | Mechanistic detection of covert encoding in LLM agents; strong safety relevance and broad generalization. | llm-safety, agents, mechanistic-interpretability, covert-channels, monitoring |
2606.10484 | AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments | cs.CR | 94 | Real-executable security eval framework for autonomous agents with broad risk taxonomy. | agent-safety, security-evaluation, benchmark, autonomous-agents, red-teaming |
2606.10860 | Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization | cs.CR, cs.CL | 94 | Trains LLMs to obey multi-level instruction hierarchies; highly relevant to prompt injection defense. | security, prompt-injection, alignment, dpo, instruction-hierarchy, llm-safety |
2606.10525 | Assessing Automated Prompt Injection Attacks in Agentic Environments | cs.CR, cs.AI | 93 | Strong empirical study of automated prompt injection attacks in realistic agentic settings. | prompt-injection, agents, security, adversarial-attacks, evaluation |
2606.10931 | It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO | cs.CL | 93 | Shows one-shot GRPO can override alignment with a single biased example; important post-training vulnerability. | alignment, post-training, grpo, robustness, safety |
2606.10852 | Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs | cs.CL, cs.AI | 93 | Benchmark for subtle goal-conditioned distortion; strong relevance to deception and alignment evals. | alignment, deception, benchmark, evaluation, factuality |
2606.11150 | ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity | cs.AI, cs.CY | 93 | Agentic biosecurity benchmark for dual-use biology tasks; strong safety relevance and reusable evaluation suite. | biosecurity, agents, benchmark, dual-use, evaluation, safety |
2606.11046 | Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models | cs.CL | 92 | Audits whether post-trained reasoning models preserve alignment across six trust dimensions. | reasoning-models, alignment, trustworthiness, safety, post-training |
2606.10740 | When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models | cs.AI, cs.CL, cs.LG | 91 | Introduces trace-level safety matrix exposing hidden multi-turn reasoning failure modes. | chain-of-thought, multi-turn, alignment, evaluation, jailbreaks |
2606.10724 | Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors | cs.CR | 91 | Concrete AI governance/security infrastructure for auditing cluster I/O and limiting covert exfiltration. | ai-governance, security, auditing, compute-governance, verification |
2606.11105 | PhantomBench: Benchmarking the Non-existential Threat of Language Models | cs.CL, cs.AI | 91 | 60K non-existent entities benchmark exposes severe hallucination and knowledge-boundary failures. | hallucination, benchmark, reliability, evaluation, factuality |
2606.10742 | MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents | cs.CR, cs.LG | 90 | Identifies multimodal memory poisoning in web agents, a practical long-horizon attack surface. | memory-poisoning, web-agents, multimodal, security, black-box-attacks |
2606.10949 | Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models | cs.AI | 90 | Shows memory systems can amplify sycophancy; introduces benchmark and mitigation for reliability risks. | reliability, memory, sycophancy, benchmark, mitigation, llms |
2606.10388 | SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval | cs.IR, cs.AI | 89 | Benchmark targets risky same-capability skill retrieval, a practical failure mode for tool-using agents. | agents, benchmark, tool-use, retrieval, safety-evaluation |
2606.11042 | Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields | cs.AI | 89 | Long-horizon professional GUI benchmark for computer-use agents in realistic high-value workflows. | agents, benchmark, computer-use, gui, evaluation |
2606.10394 | STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios | cs.AI | 88 | Automates realistic state-based personal-agent benchmarking beyond static sandbox tasks. | agent-benchmark, evaluation, personal-agents, realistic-environments, framework |
2606.10371 | Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies | cs.RO, cs.AI | 88 | Shows real-time adversarial takeover of robotic diffusion policies; important agent security signal. | security, adversarial, robotics, embodied-ai, safety |
2606.10813 | RedAct: Redacting Agent Capability Traces for Procedural Skill Protection | cs.CR, cs.CL | 87 | Benchmarks and mitigates leakage of procedural skills from agent execution traces. | agent-security, privacy, trace-redaction, benchmark, capability-leakage |
2606.11070 | T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains | cs.CL, cs.AI | 87 | Realistic multi-domain benchmark for agentic systems with higher complexity and richer evaluation signals. | agents, benchmark, evaluation, tool-use, real-world |
2606.11119 | TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning | cs.LG, cs.AI, cs.CL | 87 | Unified rollout-budget allocation for multi-turn agentic RL; likely useful for efficient agent training. | agentic-rl, reasoning, training, efficiency, reinforcement-learning |
2606.11127 | Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation | cs.CL, cs.AI | 87 | Provenance-grounded gating and recovery for synthetic post-training data improves faithfulness under adversarial corpora. | post-training, synthetic-data, faithfulness, hallucination, data-curation, adversarial |
2606.10481 | Advancing the State-of-the-Art in Empirical Privacy Auditing | cs.LG, cs.AI, cs.CL, cs.CR, stat.ML | 86 | Improves empirical privacy auditing for LLM fine-tuning with stronger synthetic canaries. | privacy, auditing, llms, memorization, membership-inference |
2606.10616 | Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents | cs.AI | 86 | Frames agent memory retention as constrained optimization with safety-aware delayed costs and observability. | agents, memory, long-context, optimization, reliability |
2606.10956 | Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam? | cs.AI, cs.CL | 86 | Standardized office exam benchmark probes practical long-horizon computer-use capability of frontier LLMs. | agents, benchmark, office-automation, computer-use, evaluation |
2606.11189 | A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design | cs.LG, cs.AI, cs.CL | 85 | Unifying view of SFT via target distribution design; potentially broad impact on post-training methods. | sft, post-training, alignment, training-objectives, llms |
2606.10281 | Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations | cs.CR, cs.CL | 84 | Useful benchmark for LLM-based attack investigation on real audit logs and IR tasks. | security, benchmark, audit-logs, incident-response, llm-evaluation |
2606.11079 | VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation | cs.CL | 84 | Interactive user simulation toolkit could improve dynamic evaluation and failure discovery for agents. | agents, evaluation, user-simulation, toolkit, benchmarking |
2606.10945 | Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications | cs.CR, cs.SE | 84 | Context-based attacks make code LLMs emit vulnerable code; concrete security implications and results. | security, code-llms, adversarial, secure-coding, robustness |
AI Paper Insight Brief
2026-06-11
0) Executive takeaways (read this first)
- Agent security is shifting from prompt-only threats to stateful, system-level compromise: memory poisoning, skill poisoning, covert exfiltration, and long-horizon attacks repeatedly outperform simpler prompt-injection assumptions in realistic environments.
- Evaluation is getting more realistic and more sobering: state-based and executable benchmarks consistently show lower performance than output-only or static evaluations, with tool failures, workflow incompletion, and implementation-detail errors dominating.
- Several papers show that internal or structural signals beat surface heuristics: mechanistic monitoring of hidden states detects covert encoding better than output filters, provenance-grounded gating beats post-hoc retrieval for synthetic data curation, and per-turn CoT/output analysis reveals failures hidden by terminal metrics.
- Memory is emerging as a central safety bottleneck: it amplifies sycophancy, enables persistent multimodal poisoning, and requires budgeted, observability-safe retention policies rather than ad hoc retrieval or extraction.
- Alignment remains fragile under post-training: reasoning post-training can regress safety/privacy/bias, and even one-shot GRPO on a single corrupted example can induce broad biased behavior.
- For practitioners, the near-term playbook is clear: prioritize stateful benchmark coverage, provenance-aware memory/data pipelines, privilege-aware agent design, and deployment-specific audits over generic jailbreak scores.
2) Key themes (clusters)
Theme: Stateful agent security is now the main battleground
- Why it matters: The strongest failures now come from persistent state, tools, and multi-step execution rather than single-turn prompt attacks. This changes both threat models and defenses: you need controls over memory, skills, trajectories, and environment side effects.
- Representative papers:
- AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments
- MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents
- Assessing Automated Prompt Injection Attacks in Agentic Environments
- Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation
- Common approach:
- Evaluate agents in executable environments with real tools, persistent state, and deterministic environment checks
- Decompose threats by entry vector and downstream impact rather than treating all attacks as prompt injection
- Measure full trajectories and side effects, not just final text outputs
- Stress-test memory contamination, skill poisoning, and long-horizon attack chains
- Open questions / failure modes:
- How to defend memory and coordination layers, which remain thinner than input/tool defenses
- Whether current benchmarked attacks transfer cleanly to heterogeneous production memory/tool stacks
- How much CI detection, attack adaptation, or monitor evasion emerges in repeated deployments
- Whether lightweight runtime defenses can do more than modestly reduce attack success
Theme: Better benchmarks are exposing lower real-world agent capability
- Why it matters: As benchmarks move from static text tasks to stateful software, GUI workflows, and certification-style tasks, model performance drops sharply. This suggests many current capability claims are inflated by artifact-heavy or output-only evaluation.
- Representative papers:
- STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
- Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?
- T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains
- Common approach:
- Use real or VM-backed environments with deterministic verifiers and persistent state
- Score criterion-level execution correctness rather than broad task-completion impressions
- Analyze failure traces to localize tool, planning, formatting, and environment errors
- Compare single-shot generation against iterative agent scaffolds with execution feedback
- Open questions / failure modes:
- Tool failures and workflow incompletion dominate, especially in long-horizon settings
- Snapshot-based GUI agents lack continuous feedback needed for precise manipulation
- Coding-agent gains confound multiple factors: repair loops, tool access, scaffolding, and execution feedback
- Benchmark scale is improving, but repeated-run variance and platform dependence remain under-measured
Theme: Memory is becoming both a capability lever and a safety liability
- Why it matters: Persistent memory improves long-horizon behavior, but it also creates new attack surfaces and alignment failures. The same subsystem can amplify user misconceptions, retain poisoned multimodal content, or fail under budget constraints.
- Representative papers:
- Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
- MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents
- Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents
- SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval
- Common approach:
- Treat memory as a first-class system component with its own threat model and optimization objective
- Separate retrieval quality from execution safety, especially among near-duplicate or sibling skills
- Use benchmarked, query-conditioned risk metrics rather than generic recall alone
- Add observability constraints so learned retention policies do not depend on oracle-only signals
- Open questions / failure modes:
- Memory extraction and summarization choices can strongly alter downstream sycophancy
- Black-box memory poisoning can preserve benign utility while remaining highly effective when triggered
- Capability-family retrieval can still surface execution-inappropriate siblings unless representative selection is explicit
- Freshness, provenance, and revocation remain weakly modeled in current memory systems
Theme: Internal monitoring and provenance-aware curation outperform surface checks
- Why it matters: Several papers show that the most useful safety signals are upstream of final outputs: hidden-state probes, exact source provenance, and trace-level decomposition reveal failures that output-only filters miss.
- Representative papers:
- MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents
- Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation
- When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
- CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
- Common approach:
- Probe hidden states or intermediate traces instead of relying on output text alone
- Preserve exact provenance from generation time and use it directly in gating
- Evaluate per-turn or per-step behavior to separate internal stance from visible output
- Study monitor/actor interaction as a two-sided protocol, including detectability of interventions
- Open questions / failure modes:
- White-box methods like MIRAGE do not apply to black-box APIs
- Detection does not yet imply prevention; intervention awareness may still enable adaptation
- Visible CoT may be unfaithful, complicating interpretation of trace-level labels
- Provenance-grounded gains are shown in limited model families and datasets so far
Theme: Post-training can easily damage alignment
- Why it matters: Multiple papers suggest that capability-oriented post-training can regress safety, privacy, fairness, and instruction-following robustness. This is not a marginal effect: in some settings, a single bad example or reasoning-focused tuning materially shifts behavior.
- Representative papers:
- Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models
- It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO
- Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization
- A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design
- Common approach:
- Compare matched pre/post-training variants across multiple trustworthiness axes
- Analyze training objectives directly rather than treating post-training as a black box
- Introduce structure-aware objectives for hierarchy compliance or softer target distributions
- Use drift diagnostics and ablations to connect optimization choices to behavioral regressions
- Open questions / failure modes:
- Reasoning gains can coincide with large drops in safety, privacy, and ethics metrics
- GRPO appears highly vulnerable to corrupted supervision, but broader RLVR generalization is still open
- Hierarchy-aware training helps, but evidence is still mostly on one base model and text-only settings
- Better SFT targets improve reasoning, but broader safety effects were not the focus
Theme: Security evaluation is broadening beyond text LLMs
- Why it matters: The attack surface now spans robots, code generators, SOC workflows, and even AI datacenter I/O. This broadening matters because many vulnerabilities are structural to deployment context, not just to language generation.
- Representative papers:
- Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies
- Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications
- Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations
- Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors
- Common approach:
- Build domain-specific benchmarks with operationally meaningful metrics and threat models
- Compare multiple representations, attack channels, or deployment architectures
- Emphasize false positives, transferability, and real-time feasibility
- Treat security as a systems problem involving interfaces, sensors, logs, and infrastructure
- Open questions / failure modes:
- White-box assumptions remain common in robotics and some monitoring setups
- Strong transferability in code attacks and patch-based robot hijacking suggests structural weaknesses
- LLMs in SOC workflows still over-trigger on suspicious names/sequences, creating analyst burden
- Datacenter I/O fingerprinting still leaves high-capacity residual channels like output steganography
3) Technical synthesis
- A recurring pattern is evaluation moving from text outputs to executable state: STAGE-Claw, AgentCanary, Workflow-GYM, OFFICEEVAL, and T1-Bench all use environment-grounded scoring and all report materially harsher results than lighter-weight evaluations.
- Several papers decompose failures into orthogonal dimensions rather than single scores: AgentCanary uses OSS/SAS/TUS, JANUS separates five distortion dimensions, CIAware-Bench isolates intervention detectability, and the CoT-output matrix separates internal vs external safety.
- Memory and retrieval are being reframed as safety-critical control points, not just capability boosters: SkillResolve introduces HSR@K, MIST isolates memory-induced sycophancy, MemVenom attacks graph memory directly, and OSL-MR formalizes retention under budget and observability constraints.
- Output-only defenses repeatedly underperform: MIRAGE beats text-only exfiltration detectors, state-based evaluation beats virtual/output-only scoring, and provenance-grounded hallucination gates beat reward-only or post-hoc evidence checks.
- Multiple papers show smaller or cheaper models can match or beat larger ones in narrow operational tasks: AuditBench finds smaller models sometimes outperform larger ones; benchmark results in agent settings often depend more on scaffolding, representation, or environment fit than raw model size.
- Prompting and representation choices remain highly model-dependent: raw vs provenance-edge logs, prompt v1 vs v2, and different intervention styles all produce non-uniform gains, arguing against one-size-fits-all hardening.
- Long-horizon failures are dominated by compounding local errors: tool-call formatting mistakes, stage omission, objective drift, and error propagation recur across STAGE-Claw, Workflow-GYM, OFFICEEVAL, and T1-Bench.
- Adaptive allocation and curation are becoming central efficiency levers: TRACE reallocates rollout budget to mixed-outcome prefixes, while provenance-grounded adaptive recovery salvages rejected synthetic samples better than naive retry.
- Behavioral drift is emerging as a useful diagnostic abstraction: reasoning post-training regressions correlate with conditional KL drift, and one-shot GRPO results suggest tiny supervision changes can trigger large policy shifts once bad behaviors are sampled.
- Across security papers, the strongest practical lesson is that deployment properties matter as much as model properties: host-model geometry affects MIRAGE compatibility, benchmark outcomes depend on framework/tool wrappers, and monitor detectability depends on acting/intervening/environment triplets.
4) Top 5 papers (with “why now”)
AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments
- Introduces a broad Entry × Impact taxonomy and evaluates 496 executable tasks across multiple frameworks and attack regimes.
- Shows that memory contamination, skill poisoning, and long-horizon progressive attacks are substantially more damaging than indirect prompt injection alone.
- Uses trajectory-grounded scoring that separates outcome safety, awareness, and utility, which is more decision-useful than single ASR numbers.
- Why now: agent deployments are moving into persistent, tool-rich environments where prompt-injection-only evaluations are no longer enough.
- Skepticism: runtime defenses only provide modest, uneven gains, and the provided analysis does not include a detailed limitations section.
MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents
- Finds a shared low-dimensional encoding subspace that generalizes across nine covert encoding families and multiple open-weight models.
- Detects exfiltration intent both during generation and at the planning token, enabling upstream monitoring before encoded text appears.
- Outperforms output-only detectors by a wide margin on the reported agentic exfiltration suite.
- Why now: covert exfiltration is a realistic agent risk, and this is one of the clearest cases where mechanistic monitoring looks practically stronger than surface filtering.
- Skepticism: requires white-box access and monitor compatibility varies sharply by host model.
Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models
- Provides a controlled audit showing reasoning post-training can improve math/reasoning benchmarks while regressing safety, privacy, bias, ethics, and OOD robustness.
- Distinguishes pathway-specific failure modes across SFT, GRPO-style RL, and distillation.
- Connects regressions to conditional KL drift, offering a concrete release-time diagnostic.
- Why now: reasoning models are being deployed rapidly, often with capability-first reporting that may hide alignment regressions.
- Skepticism: evidence is on open models up to 14B and the KL analysis is diagnostic, not causal.
MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents
- Demonstrates a black-box attack that poisons multimodal graph memory and later triggers high end-to-end attack success while preserving benign utility.
- Combines retrieval-stage trigger optimization with post-recall visual prioritization, making the attack persistent and modular.
- Evaluates across multiple web-agent frameworks and VLM backbones, including GPT-5-family agents.
- Why now: memory-augmented agents are proliferating, and persistent memory poisoning is likely under-defended relative to prompt injection.
- Skepticism: evaluation is still in controlled sandbox settings and defenses tested are lightweight.
STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
- Automates construction and validation of state-based tasks in real personal-computing environments.
- Shows output-only evaluation overestimates performance and that tool failures dominate non-passing runs.
- Provides practical cost, latency, and failure-mode data across 11 frontier models.
- Why now: teams need scalable but realistic agent evals, and state-based verification is becoming the minimum credible standard.
- Skepticism: current scale is still a 40-task pilot and platform-dependent failures can affect scores.
5) Practical next steps
- Add at least one state-based executable benchmark to your eval stack; do not rely on final-text scoring for agent reliability claims.
- Audit memory systems separately for poisoning, sycophancy amplification, provenance loss, and stale retrieval, not just recall quality.
- For agent security, expand threat models from prompt injection to memory contamination, skill poisoning, long-horizon attacks, and covert exfiltration.
- If you control weights, test hidden-state monitors for narrow high-risk behaviors; if you do not, compensate with stronger provenance, tool gating, and environment-side checks.
- Track behavioral drift after post-training using matched pre/post evaluations on safety, privacy, bias, and refusal calibration—not just capability gains.
- In synthetic data pipelines, preserve exact source provenance and add adaptive recovery for rejected samples instead of treating rejection as terminal.
- Measure false positives and operator burden explicitly in security workflows; AuditBench suggests over-suspicion can be as damaging as misses.
- For long-horizon agents, instrument and optimize for workflow incompletion, tool-call formatting, recovery behavior, and state reconciliation, since these dominate real failures.
- Evaluate whether your monitor or control protocol is itself detectable by the acting model, especially if interventions are large, stylistically distinct, or watermarked.
- For instruction hierarchy hardening, test multi-level conflict suites rather than only user-vs-system prompt injection cases.
Generated from per-paper analyses; no external browsing.