May 23, 2026 Research Brief

Agent safety gets stateful.

Today’s strongest papers show agent reliability now depends less on bigger models than on realistic security evaluation, runtime scaffolds, and explicit control of state, logs, and interfaces.

Takeaways

  1. Agent work is shifting from “train the model harder” toward “shape the interface, state, and data around the model”: compiled agent trajectories, privileged process curation, runtime harness adaptation, event-sourced execution, and millisecond checkpoint/rollback all show meaningful gains without changing core model architectures.
  2. Security evaluation is getting more realistic and more pessimistic. Multiple papers show that static or text-only safety checks miss the real failure modes: domain-camouflaged prompt injection, multi-turn/stateful evasions, artifact-level unsafe edits, benchmark exploitation, and latent KV leakage all remain substantial risks.
  3. Evaluation methodology itself is now a first-order research topic. Several papers argue that benchmark scores are easy to misread or game: contamination can hide behind CoT, single-threshold metrics can reverse conclusions in forecasting, and security benchmarks can be exploited by the agents they test.
#1

Start with: Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Why it catches my eye: It captures the shift from single-turn jailbreaks to realistic multi-turn, tool-using agent failures that current safety checks miss.

Read skeptically for: As a benchmark, its long-term value depends on breadth of tasks and whether defenses overfit its attack patterns.

agent safety multi-turn eval tool use benchmark

Themes

Agent training from trajectories and process signals Several papers replace expensive manual supervision with signals already present in agent runs, patches, or sibling rollouts. The common bet is that better process data—not just more RL—can improve long-horizon reasoning, search quality, and software-agent behavior.
Runtime scaffolds, state management, and auditable agent infrastructure A strong theme today is that many agent failures are interface and systems failures, not pure model failures. Papers here show that changing the harness, execution log, or sandbox substrate can materially improve reliability, reproducibility, and search depth.
Agent security is moving from prompt attacks to stateful, evasive, and protocol-level threats The threat model is broadening from single-turn jailbreaks to attacks that exploit persistence, artifacts, debate dynamics, OAuth flows, retrieval drift, and latent channels. The practical message is that current guards are often calibrated to the wrong surface.
Signal Stateful attacks are the real threat. Boiling the Frog, A3S-Bench, and domain-camouflaged injection all show multi-turn or semantically hidden attacks beating static guards.
Tension Better scaffolds help and expose fragility. Harness adaptation, event-sourced logs, and checkpoint/rollback improve control, but patching the diagnosed module can still worsen pipelines.
Bet Agent progress will come from interfaces. ACC, process supervision, and runtime harness work all improve outcomes without changing core model architectures.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

#1

Best first read for understanding how realistic, incremental attacks break tool-using agents beyond single-turn safety tests.

Why now
Agent deployments increasingly persist state and act through tools, making multi-turn safety the relevant evaluation target.
Skepticism
Benchmark realism is strong, but coverage may still miss other enterprise workflows and attack surfaces.

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

#2

A useful companion because it measures authentic agent usefulness on real terminal work rather than synthetic puzzles.

Why now
Terminal agents are moving into developer workflows, and this benchmark shows readiness is still limited.
Skepticism
It excludes some GUI/TUI and irreproducible environments, so real deployment difficulty may be understated.

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

#3

Shows a concrete failure mode where prompt-injection defenses miss attacks that look domain-natural to the system.

Why now
Many teams rely on text-pattern guards that are unlikely to survive semantically adapted attacks.
Skepticism
Empirical gaps are compelling, but transfer across agent stacks and defense implementations remains uncertain.

Chinese version: [中文]

Run stats

  • Candidates: 355
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-21T00:00:00Z → 2026-05-22T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.22643Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
PDF
cs.CL95Multi-turn benchmark for incremental attacks on tool-using agents in realistic office settings.agent-safety, benchmark, tool-use, multi-turn, red-teaming
2605.22001Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems
PDF
cs.CR, cs.AI, cs.CL95Shows major prompt-injection blind spot in multi-agent LLM defenses with strong empirical gaps.agent-safety, prompt-injection, security, multi-agent, evaluation
2605.22535TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
PDF
cs.AI94Large real-world terminal benchmark for agents; strong eval signal for agent capability and safety gaps.agents, benchmark, evaluation, terminal, real-world
2605.22786LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
PDF
cs.AI, cs.ET, cs.LG, cs.MA93Targets a new safety gap: sensitive leakage through shared KV caches in multi-agent LLMs.multi-agent, safety, KV-cache, privacy, latent-communication
2605.21958Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines
PDF
cs.CL92Important agent pipeline result: fixing the diagnosed module can hurt; upstream patching works better.agents, llm-pipelines, reliability, debugging, intervention
2605.21856The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
PDF
cs.LG, cs.AI92Black-box method to detect evasive benchmark contamination by truncating CoT; high eval integrity value.llm-evaluation, data-contamination, reasoning, benchmarking, robustness
2605.22166Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
PDF
cs.AI92Runtime harness adaptation improves frozen LLM agents; highly relevant to agent reliability and control.agents, runtime, reliability, tool-use, evaluation
2605.22321Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions
PDF
cs.CR, cs.AI, cs.SE91Benchmarks temporal, spatial, and semantic evasions against privileged autonomous agents.agent-security, benchmark, evasion, tool-use, adversarial-evaluation
2605.22763Advancing Mathematics Research with AI-Driven Formal Proof Search
PDF
cs.AI91Formal-proof agent solves open problems at scale; major frontier agent progress with verification.formal-proofs, agents, reasoning, verification, math
2605.22781DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
PDF
cs.OS, cs.AI91OS-level sandbox checkpoint/rollback for scalable agent search; strong infra relevance for safe agent execution.agents, sandboxing, systems, checkpointing, infrastructure
2605.21997The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
PDF
cs.AI, cs.MA91Auditable, replayable agent runtime with deterministic logs and forkable execution; strong agent safety relevance.agents, auditing, observability, runtime, deterministic-replay, memory
2605.22333A First Measurement Study on Authentication Security in Real-World Remote MCP Servers
PDF
cs.CR90First measurement study of auth security in remote MCP servers; directly relevant to agent tooling.MCP, authentication, security, agents, measurement
2605.22720Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
PDF
cs.AI, cs.HC90Evaluates harmful LLM behavior in conflict settings across providers; strong real-world alignment relevance.alignment, safety-evaluation, harmful-outputs, deployment, social-impact
2605.22511Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
PDF
cs.AI, cs.CL, cs.IR89Simplifies search-augmented reasoning post-training via self-distillation; likely reusable recipe.search-augmented, reasoning, post-training, self-distillation, llm
2605.21850ACC: Compiling Agent Trajectories for Long-Context Training
PDF
cs.CL, cs.AI89Turns agent trajectories into long-context training data; useful for frontier agentic LLM capability gains.llm, agents, long-context, training, sft
2605.22041RADAR: Defending RAG Dynamically against Retrieval Corruption
PDF
cs.CR, cs.LG88Dynamic defense for RAG retrieval corruption with explicit robustness-storage tradeoff.RAG, security, retrieval, poisoning, robustness
2605.22731Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
PDF
cs.LG, cs.AI88Useful conceptual lens on post-training: state distributions unify SFT, RL, and distillation behavior.llm-training, post-training, rl, distillation, theory
2605.22446Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
PDF
cs.CV, cs.AI, cs.RO88Preemptive runtime verification for VLA/world-model actions targets embodied reliability and safety.robotics, runtime-verification, vision-language-action, safety, world-models
2605.21938Optimal Guarantees for Auditing Rényi Differentially Private Machine Learning
PDF
cs.LG, cs.CR, cs.IT88Optimal black-box auditing for Rényi DP claims with theory and confidence bounds; strong safety/privacy value.privacy, differential-privacy, auditing, theory, evaluation
2605.22476Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
PDF
cs.LG, cs.CL88Subquadratic structured attention for entity tracking over long sequences; relevant to long-context LLM efficiency.long-context, attention, efficiency, entity-tracking, transformers
2605.22672Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
PDF
cs.AI87Inverse scaling on tail-risk forecasting exposes reliability failures in stronger LLMs.reliability, forecasting, inverse-scaling, evaluation, risk
2605.22608Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
PDF
cs.CL, cs.AI86Automates multi-level evaluation of LLM agents across system, trace, and node levels.agent-evaluation, observability, framework, LLM-agents, monitoring
2605.22769Understanding Data Temporality Impact on Large Language Models Pre-training
PDF
cs.CL, cs.AI86Studies temporal ordering in LLM pretraining with a new benchmark for time-grounded factual knowledge.llm-pretraining, temporal-reasoning, benchmark, knowledge, evaluation
2605.22664WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
PDF
cs.AI86End-to-end spreadsheet benchmark for LLM agents in finance; realistic evaluation of agent workflows.agents, benchmark, evaluation, finance, spreadsheets
2605.22012LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
PDF
cs.CL, cs.CV86Unified latent audio-visual reasoning for multimodal LLMs; notable frontier multimodal reasoning direction.multimodal, reasoning, audio-visual, latent-space, MLLM
2605.22737The Distillation Game: Adaptive Attacks & Efficient Defenses
PDF
cs.LG, cs.AI85Addresses model distillation attacks with adaptive threat model and efficient teacher-side defense.model-security, distillation, defenses, adaptive-attacks, llm-deployment
2605.22681Forecasting Scientific Progress with Artificial Intelligence
PDF
cs.AI85Benchmark for forecasting scientific progress under temporal constraints; useful for capability evaluation.benchmark, scientific-reasoning, forecasting, evaluation, frontier-models
2605.21996From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents
PDF
cs.SE, cs.AI85Improves SWE agent training via privileged process supervision, targeting trajectory quality not just outcomes.agents, software-engineering, training, process-supervision, sft
2605.22568Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard
PDF
cs.CR, cs.AI84Sharp critique of agent security benchmarking pitfalls; useful for evaluation methodology.agent-security, evaluation, benchmarking, methodology
2605.22456Steins;Gate Drive: Semantic Safety Arbitration over Structured Futures for Latency-Decoupled LLM Planning
PDF
cs.RO, cs.AI84LLM planning with semantic safety arbitration over predicted futures for autonomous driving.planning, autonomous-driving, safety, runtime, llm-agents

AI Paper Insight Brief

2026-05-23

0) Executive takeaways (read this first)

  • Agent work is shifting from “train the model harder” toward “shape the interface, state, and data around the model”: compiled agent trajectories, privileged process curation, runtime harness adaptation, event-sourced execution, and millisecond checkpoint/rollback all show meaningful gains without changing core model architectures.
  • Security evaluation is getting more realistic and more pessimistic. Multiple papers show that static or text-only safety checks miss the real failure modes: domain-camouflaged prompt injection, multi-turn/stateful evasions, artifact-level unsafe edits, benchmark exploitation, and latent KV leakage all remain substantial risks.
  • Evaluation methodology itself is now a first-order research topic. Several papers argue that benchmark scores are easy to misread or game: contamination can hide behind CoT, single-threshold metrics can reverse conclusions in forecasting, and security benchmarks can be exploited by the agents they test.
  • Long-context and process supervision continue to look like high-leverage capability multipliers. ACC turns agent logs into long-context QA and gets a 30B model near a much larger model on long-range benchmarks; P2T improves SWE Pass@1 while reducing inference cost; Search-E1 extracts dense supervision from the model’s own search rollouts.
  • Frontier agent systems are still brittle on authentic workflows. Real-world terminal tasks top out at 62.5% pass rate, finance spreadsheet agents top out at 69.1/100, and scientific forecasting remains weak on feasibility and timing even when models can identify plausible mechanisms.
  • A recurring pattern across safety and robustness papers: the most useful interventions are often not where naive diagnosis points. Patching the “most causal” module can hurt, stronger models can forecast worse in tail-risk settings, and more exposed reasoning traces can improve utility while increasing distillation risk.

2) Key themes (clusters)

Theme: Agent training from trajectories and process signals

Theme: Runtime scaffolds, state management, and auditable agent infrastructure

Theme: Agent security is moving from prompt attacks to stateful, evasive, and protocol-level threats

Theme: Evaluation is under attack—from contamination, metric choice, and benchmark exploitability

Theme: Real-world benchmarks are exposing a large gap between synthetic competence and deployed usefulness

Theme: Privacy and leakage are shifting to harder-to-see channels

3) Technical synthesis

  • A large fraction of papers replace end-to-end optimization with structured intermediate objects: compiled contexts (ACC), process graphs (P2T), typed forecasts (Steins;Gate Drive), event logs (ActiveGraph), and sanitized KV transforms (LCGuard). The trend is toward making hidden agent state explicit and controllable.
  • Several methods improve performance by changing the supervision target rather than the base model: ACC supervises evidence tokens directly, P2T scores per-step groundedness/progress, Search-E1 distills from privileged sibling trajectories, and OPD/RL are framed as changing the state distribution being updated.
  • Security papers increasingly evaluate at the artifact/action layer instead of the response layer: unsafe file predicates in Boiling the Frog, RTR on real OpenClaw executions in A3S-Bench, and OAuth lifecycle flaws in MCP servers.
  • Multiple papers show that static detectors fail under semantic adaptation: ZCP for contamination, domain-camouflaged injection, and adaptive distillation evaluation all exploit the gap between surface cues and latent capability/leakage.
  • Runtime control is becoming layered: LIFE-HARNESS splits contract/skill/action/trajectory regulation; Pre-VLA adds a verifier before action execution; Steins;Gate Drive separates slow strategic selection from fast predicate-based invalidation.
  • Several works use exact or principled optimization in places where heuristics are common: RADAR uses exact Min-Cut for context selection, the RDP auditor gives finite-sample confidence bounds with minimax lower bounds, and the distillation paper derives exponential-tilt best responses.
  • Realistic evaluation papers repeatedly find weak transfer from standard benchmarks: TerminalWorld correlates weakly with Terminal-Bench (r = 0.20), forecasting conclusions flip under CRPS vs Brier, and scientific forecasting remains poor even when mechanistic MCQ performance is strong.
  • There is a recurring “bigger/stronger is not always safer or better” pattern: more capable models can forecast worse in tail-risk regimes, patching the highest-blame module can hurt, and richer outputs can increase distillation leakage.
  • Systems papers are increasingly optimized for branching search workloads: DeltaBox’s millisecond checkpoint/restore and ActiveGraph’s cheap forks both target the same bottleneck—reusing shared prefixes without replaying expensive model/tool calls.
  • Many papers rely on LLM judges, but the stronger ones either validate them against humans (WorkstreamBench, Agentic CLEAR) or constrain them with exact artifact checks and formal predicates (Boiling the Frog, RADAR, RDP auditing).

4) Top 5 papers (with “why now”)

ACC: Compiling Agent Trajectories for Long-Context Training

  • Turns answer-verified agent logs into long-context QA examples, directly supervising integration over distant evidence rather than masking tool outputs.
  • Delivers large long-context gains on Qwen3-30B-A3B: MRCR 68.28 (+18.09) and GraphWalks 77.51 (+7.59), with performance comparable to Qwen3-235B-A22B on those benchmarks.
  • Useful now because many teams already have abundant agent traces but lack high-quality long-context training corpora.
  • Suggests a practical path to improve smaller models’ long-range reasoning without architecture changes or RL-heavy pipelines.
  • Skeptical take: Evidence is from one base model and three agent types, with teacher-rationale dependence and low rationale pass rates for SWE trajectories.

RADAR: Defending RAG Dynamically against Retrieval Corruption

  • Recasts dynamic RAG defense as exact graph-cut selection over atomic answers, with a Bayesian memory node to balance stability against adaptation.
  • Shows strong robustness in both static and dynamic settings, including 75.0% accuracy with 5.0% ASR on one static PIA setting and 63.60% accuracy / 17.85% ASR in cumulative dynamic evaluation.
  • Useful now because live-web RAG is increasingly the default, and most defenses are still designed for static corpora.
  • The memory-node design is especially relevant for production systems that cannot store full historical documents.
  • Skeptical take: Runtime and dense-graph costs may become significant at larger retrieval depths, and the method assumes benign evidence forms the dominant coherent cluster.

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

  • Builds 1,530 validated terminal tasks from 80,870 real asciinema recordings, with a 200-task manually reviewed VERIFIED subset.
  • Shows frontier agents still struggle on authentic CLI workflows; best VERIFIED pass rate is 62.5%, and transfer from Terminal-Bench is weak (Pearson r = 0.20).
  • Useful now because terminal agents are being deployed into real developer workflows, and synthetic puzzle benchmarks appear to overstate readiness.
  • The benchmark’s command diversity is a major asset: 1,280 unique commands, 91% absent from Terminal-Bench.
  • Skeptical take: The pipeline excludes TUI/GUI workflows and irreproducible environments, so some important real-world complexity is still missing.

Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions

  • Introduces A3S-Bench, a 2,254-trajectory benchmark for stateful agent attacks spanning temporal fragmentation, artifact-mediated evasion, and benign-context concealment.
  • Finds advanced evasions raise average RTR@1 from 28.3% to 52.6% across 10 backbones, with multi-turn injection much stronger than single-turn.
  • Useful now because agent security discussions are still too focused on single-turn prompt injection, while deployed agents have persistent state and system privileges.
  • Includes defense tests showing current guardrails and platform upgrades offer only limited mitigation.
  • Skeptical take: Main evaluation is on OpenClaw, so platform-specific design choices may influence the absolute vulnerability profile.

A First Measurement Study on Authentication Security in Real-World Remote MCP Servers

  • Provides the first Internet-scale measurement of remote MCP authentication, validating 7,973 live servers and finding 40.55% expose tools without authentication.
  • On a tested DCR-enabled subset of 119 servers, finds 325 confirmed flaw instances; every tested server had at least one flaw, and responsible disclosure yielded 9 CVEs.
  • Useful now because MCP adoption is accelerating faster than its security hygiene, and protocol-layer weaknesses can lead to account takeover regardless of model quality.
  • Particularly decision-relevant for teams deploying remote MCP with OAuth, DCR, or delegated authorization.
  • Skeptical take: Coverage is limited to publicly discoverable assets and a manually verified subset, so enterprise/private deployments may differ.

5) Practical next steps

  • Treat agent logs as a strategic asset: pilot ACC-style compilation for long-context training and measure whether direct evidence-token supervision improves your own retrieval/tool traces.
  • If you train SWE or tool agents, add per-step groundedness and trajectory-efficiency filters; compare outcome-filtered SFT against P2T-style curated trajectories on both success rate and inference cost.
  • Audit your evaluation stack for hidden confounders: run contamination checks with zero-CoT-style probes, add canaries to security benchmarks, and report tail-aware metrics where applicable.
  • Red-team prompt-injection defenses with semantically camouflaged payloads and multi-turn fragmentation, not just explicit override strings.
  • For production agents in deterministic environments, test harness-side interventions before retraining: action canonicalization, trajectory regulation, skill retrieval, and contract updates may yield faster wins.
  • If your agents branch or search over stateful environments, benchmark checkpoint/rollback overhead explicitly; DeltaBox-style fast C/R or event-log forking can materially change feasible search depth.
  • Move safety scoring closer to artifact/state changes: define unsafe predicates over files, configs, or tool actions rather than relying on refusal text alone.
  • For multi-agent systems sharing latent state, evaluate reconstructability of shared KV artifacts and consider representation-level sanitization if latent communication is used.
  • If you deploy RAG over dynamic sources, test stability/plasticity tradeoffs under evolving corruption; exact consistency selection plus lightweight memory may outperform static filters.
  • Add multi-level observability: combine trace-level judges, node-level clustering, and replayable logs so failures can be localized and compared across harness/model variants.

Generated from per-paper analyses; no external browsing.