June 13, 2026 Research Brief
Agent safety moves upstream.
Today’s papers argue that reliable agents depend less on bigger models than on containment, memory control, harder evaluation, and failure-targeted training loops.
Takeaways
- Agent reliability is increasingly bottlenecked by **system design choices outside the base model**: containment boundaries, memory policies, tool-execution abstractions, environment engineering, and evaluation harnesses repeatedly mattered as much as or more than raw model scale.
- Several papers show that **persistent memory is now a primary failure surface**: single poisoned writes can permanently corrupt agent behavior, naive forgetting collapses useful state, and version-unaware memory breaks under evolving environments. Patch histories, learned retention, and explicit validation are emerging as practical fixes.
- Search/web agents remain far from robust deployment: new benchmarks make this clear from different angles—**long-horizon search is still hard**, daily-report generation is weak on factuality despite citations, and evolving/fresh benchmarks sharply reduce apparent performance versus static sets.
Start with: The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
Why it catches my eye: It turns agent safety into an actionable systems checklist and shows cheap deterministic controls can block persistent failures.
Read skeptically for: Runtime attacks were tested mainly on one stack, and semantic attackers may evade the proposed validator.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
#1Useful if you deploy agents: it identifies missing containment guarantees and offers low-overhead mitigations for persistent corruption.
- Why now
- Public-facing agent stacks are shipping before their memory and tool boundaries are well specified.
- Skepticism
- Evidence is strongest on audited frameworks and one runtime stack, not every production architecture.
LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
#2A strong companion read because it shows how much current search-agent competence drops under genuinely hard long-horizon evaluation.
- Why now
- Search agents are being overclaimed on saturated benchmarks while fresh, structurally hard tasks remain unsolved.
- Skepticism
- Benchmark uniqueness is bounded by its knowledge graph, so some answers may exist outside the constructed space.
SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents
#3Worth opening for a concrete training recipe that improves tool agents by learning directly from their observed failures.
- Why now
- Post-training is shifting from generic RL toward targeted adaptation for real agent bottlenecks.
- Skepticism
- Generalization beyond its evaluated domains and tool settings is still uncertain.
Chinese version: [中文]
Run stats
- Candidates: 306
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-11T00:00:00Z → 2026-06-12T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.13385 | Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents | cs.CR, cs.AI, cs.CY, cs.HC, cs.MM | 95 | Stakeholder-aware prompt injection benchmark for web agents; strong real-world safety framing. | agent-safety, prompt-injection, web-agents, benchmark, security |
2606.12797 | The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements | cs.AI | 94 | Audits major agent frameworks and finds missing containment guarantees; highly actionable safety result. | agent-safety, frameworks, containment, memory-integrity, audit |
2606.12908 | SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents | cs.CL | 93 | Failure-driven RL for tool agents; strong relevance to reliable agent training and adaptation. | agents, tool-use, reinforcement-learning, reliability, post-training |
2606.12918 | MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems | cs.CR, cs.AI | 92 | Targets collusive failures in multi-agent systems with principled red-teaming via Shapley guidance. | multi-agent, red-teaming, security, collusion, evaluation |
2606.12897 | SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings | cs.CL | 92 | Hallucination-resistant extraction for safety-critical RAG; directly targets reliability/compliance risks. | RAG, hallucination, safety, reliability, grounding |
2606.13079 | The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems | cs.CR, cs.AI | 91 | Evaluates autonomous penetration capability, a key frontier risk threshold for agentic AI systems. | cybersecurity, agents, dangerous-capabilities, evaluation, autonomy |
2606.13663 | HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents | cs.CL | 91 | New tool-execution abstraction reduces trace burden; important for scalable, efficient agent tool use. | agents, tool-use, efficiency, interfaces, MCP |
2606.13598 | Reward Modeling for Multi-Agent Orchestration | cs.AI, cs.CL, cs.LG, cs.MA | 91 | Self-supervised reward modeling for multi-agent orchestration; strong agent-training relevance. | multi-agent, reward-modeling, orchestration, agents, test-time-scaling |
2606.13044 | No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions | cs.CL | 90 | Shows AI peer review can be gamed by presentation-only edits, exposing subtle evaluation failure modes. | evaluation, robustness, peer-review, adversarial, llm-safety |
2606.13649 | Operadic consistency: a label-free signal for compositional reasoning failures in LLMs | cs.CL, cs.LG | 90 | Strong label-free signal for reasoning failure detection across many LLMs; useful for runtime monitoring. | reasoning, uncertainty, evaluation, monitoring, reliability |
2606.12837 | LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling | cs.CL | 90 | Hard new benchmark for long-horizon search agents beyond saturated prior evals. | benchmark, agents, search, evaluation, long-horizon |
2606.12809 | MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs | cs.AI, cs.LG | 89 | Large benchmark for lifelong unlearning in MLLMs; highlights cumulative failures in current methods. | unlearning, multimodal, benchmark, privacy, safety |
2606.13221 | From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation | cs.LG | 89 | Calibrates LLM-as-a-judge Elo with uncertainty; directly useful for reliable model evaluation. | evaluation, llm-as-a-judge, calibration, elo, uncertainty |
2606.13104 | Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models | cs.LG | 88 | Large benchmark on citation-induced epistemic bias; directly relevant to trust and factuality. | factuality, benchmark, citation-bias, epistemics, reliability |
2606.13662 | EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery | cs.AI, cs.CL | 88 | Argues environment engineering is key for autonomous discovery; includes reward-hacking concerns. | agents, scientific-discovery, environment-design, safety, reward-hacking |
2606.12871 | DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks | cs.AI | 87 | Open-ended benchmark for search agents with fine-grained rubrics on realistic daily tasks. | agents, benchmark, search, evaluation, real-world |
2606.13680 | Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning | cs.CL, cs.AI | 87 | RAG plus reinforcement fine-tuning for analogy-based reasoning; promising reasoning advance. | reasoning, RAG, reinforcement-learning, retrieval, post-training |
2606.13126 | MiniPIC: Flexible Position-Independent Caching in <100LOC | cs.LG, cs.AI, cs.CL | 87 | Practical position-independent KV caching for RAG/agents; high efficiency and deployment impact. | inference, kv-cache, efficiency, rag, agents, long-context |
2606.13681 | EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments | cs.CL | 86 | Dynamic-environment benchmark for agent memory evolution; useful for realistic agent reliability testing. | agents, benchmark, memory, dynamic-environments, evaluation |
2606.13602 | EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis | cs.AI | 86 | Verifiable benchmark exposes weak agent performance on real scientific analysis workflows. | agents, benchmark, evaluation, science, tool-use |
2606.12945 | Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory | cs.AI | 86 | Cognitively grounded memory value model for long-running agents under budget constraints. | agents, memory, long-context, cognitive-modeling, efficiency |
2606.13220 | LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis | cs.AI, cs.CE, cs.ET, cs.LG, cs.MA | 85 | Evidence-first diagnosis tackles user-driven sycophancy, improving robustness in interactive agents. | agents, sycophancy, robustness, reasoning, interactive |
2606.13349 | From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent | cs.CL | 85 | Proactive review agent with structured evidence gathering; notable agentic reasoning framework. | agents, scientific-review, reasoning, mdp, evidence-tracking |
2606.13120 | EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge | cs.CL | 84 | Contamination-resistant benchmark for search agents on evolving knowledge with bilingual coverage. | search-agents, benchmark, retrieval, evaluation, contamination |
2606.13643 | Recursive Agent Harnesses | cs.CL | 84 | Studies recursive agent harnesses with tools/subagents; relevant to frontier agent design and risks. | agents, recursion, tool-use, long-horizon, systems |
2606.13177 | MemRefine: LLM-Guided Compression for Long-Term Agent Memory | cs.CL, cs.AI, cs.LG | 84 | LLM-guided compression for long-term agent memory with explicit storage-budget framing. | agents, memory, compression, long-context, retrieval |
2606.13037 | DIG: Oracle-Guided Directed Input Generation for One-Day Vulnerabilities | cs.CR, cs.SE | 83 | Security-focused input generation for one-day vulns; notable for agentic reasoning failure mitigation. | security, vulnerabilities, agents, fuzzing, software |
2606.12941 | Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL | cs.CL | 83 | Memory-augmented RL improves multi-turn reasoning when context is fragmented across turns. | reasoning, memory, reinforcement-learning, multi-turn, long-context |
2606.13449 | Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests | cs.SE, cs.AI | 83 | Large empirical study of instruction files and agentic PR outcomes; actionable for coding agents. | coding-agents, software-engineering, instructions, evaluation, agentic-pr |
2606.13608 | AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility | cs.AI, cs.LG | 82 | Open, standardized, agent-agnostic evaluation framework could improve reproducibility across agents. | agents, evaluation, reproducibility, standards, framework |
AI Paper Insight Brief
2026-06-13
0) Executive takeaways (read this first)
- Agent reliability is increasingly bottlenecked by system design choices outside the base model: containment boundaries, memory policies, tool-execution abstractions, environment engineering, and evaluation harnesses repeatedly mattered as much as or more than raw model scale.
- Several papers show that persistent memory is now a primary failure surface: single poisoned writes can permanently corrupt agent behavior, naive forgetting collapses useful state, and version-unaware memory breaks under evolving environments. Patch histories, learned retention, and explicit validation are emerging as practical fixes.
- Search/web agents remain far from robust deployment: new benchmarks make this clear from different angles—long-horizon search is still hard, daily-report generation is weak on factuality despite citations, and evolving/fresh benchmarks sharply reduce apparent performance versus static sets.
- A strong pattern across safety/security papers is that lightweight deterministic controls can eliminate major failure modes cheaply: policy gates, memory validators, hidden evaluators, constrained extraction, and structured interfaces often delivered large gains with negligible overhead.
- Training is shifting from static supervision toward closed-loop adaptation: failure-driven RL, orchestration reward models, retrieval-augmented RL with reasoning analogies, and memory-augmented RL all improve performance by targeting the agent’s actual bottlenecks rather than generic data.
- Evaluation itself is under pressure: papers expose vulnerabilities in AI peer review, citation authority bias, prompt injection, and judge calibration, suggesting that many current automated assessments are easier to game or miscalibrated than leaderboard numbers imply.
2) Key themes (clusters)
Theme: Memory as the new control plane
- Why it matters: Multiple papers converge on memory as both a capability multiplier and a safety liability. The same persistent state that enables long-horizon behavior also creates durable attack surfaces, forgetting errors, and brittleness under environment change.
- Representative papers:
- The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
- Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory
- EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
- MemRefine: LLM-Guided Compression for Long-Term Agent Memory
- Common approach:
- Treat memory as a first-class subsystem with explicit write/update/retrieval policies rather than passive context storage.
- Add structure around memory operations: validators, patch histories, learned value functions, or post-hoc compression.
- Evaluate memory in realistic regimes: blind forgetting, persistent evolution, fixed storage budgets, or adversarial poisoning.
- Open questions / failure modes:
- Semantic validators may be more robust than regex/schema checks, but add latency and their own attack surface.
- Many methods are benchmark-limited or tested on a small set of agent stacks.
- Compression and forgetting still optimize proxies like evidence retention rather than end-task correctness.
- Long-horizon compound attacks and multi-agent memory interactions remain underexplored.
Theme: Search agents need harder, fresher, more user-centered evaluation
- Why it matters: Static or human-authored search benchmarks are saturating or leaking into model parameters, while real user tasks demand fresh retrieval, long trajectories, and evidence-grounded synthesis. New benchmarks show current systems still underperform on factuality, calibration, and long-horizon browsing.
- Representative papers:
- Common approach:
- Generate harder tasks automatically using knowledge graphs, live-web synthesis, or trending-topic pipelines.
- Decompose evaluation into interpretable dimensions such as instruction following, factuality, rationality, or chain-level success.
- Stress contamination resistance by using fresh knowledge, non-popular evidence, or structurally complex multi-hop constraints.
- Open questions / failure modes:
- LLM judges still assess many of these benchmarks, leaving room for judging noise and shortcut success.
- Human uniqueness verification remains incomplete in some datasets.
- Better context management helps, but gains are modest on truly long-horizon tasks.
- Search traces often include citations without real claim-reference grounding.
Theme: Security failures are increasingly architectural, not just model-level
- Why it matters: The strongest security papers argue that many agent failures stem from missing boundaries, unsafe tool interfaces, and weak environment controls. This shifts the defense agenda from “align the model better” to “constrain the system correctly.”
- Representative papers:
- The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
- MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems
- Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
- The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems
- Common approach:
- Model attacks at the system level: cross-agent collusion, stakeholder-specific harms, persistent poisoning, or end-to-end offensive workflows.
- Use sandboxed environments with executable tasks and explicit success conditions.
- Measure not just attack success, but stealth, disruption, coalition effects, or downstream action harms.
- Open questions / failure modes:
- Many threat models assume strong attacker visibility into system structure or messages.
- Benchmarks are still mostly sandboxed and domain-bounded.
- Existing guardrails often miss trajectory-level or coalition-level attacks.
- Tool capability itself can amplify offensive capability even when the base model is unchanged.
Theme: Better agent training comes from targeting actual failure modes
- Why it matters: Several papers show that generic RL or static curricula underperform compared with methods that adapt to the agent’s observed weaknesses, retrieve useful analogies, or score orchestration plans directly. The common lesson: training signal quality matters more than just more rollouts.
- Representative papers:
- SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents
- Reward Modeling for Multi-Agent Orchestration
- Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning
- Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
- Common approach:
- Close the loop between observed failures and future training data.
- Replace expensive full-trajectory supervision with cheaper intermediate signals such as orchestration rewards or retrieved analogies.
- Train policies to externalize or compress state explicitly when context is fragmented or long-horizon.
- Open questions / failure modes:
- Most evaluations are still narrow: one domain, one model family, or one benchmark.
- Coverage depends on what failures the current policy already exposes.
- Offline judges/retrievers introduce extra cost and possible bias.
- Gains may weaken when orchestration diversity or retrieval quality is low.
Theme: Evaluation pipelines themselves are vulnerable and miscalibrated
- Why it matters: A recurring meta-theme is that automated evaluation can be gamed, biased by authority cues, or poorly calibrated. That undermines both benchmarking and deployment decisions if scores are taken at face value.
- Representative papers:
- No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions
- Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models
- From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation
- Operadic consistency: a label-free signal for compositional reasoning failures in LLMs
- Common approach:
- Probe hidden failure modes in judges: presentation sensitivity, citation authority bias, uncertainty collapse, or compositional inconsistency.
- Add calibration layers or auxiliary signals rather than trusting raw judge outputs.
- Use pairwise or selective-prediction analyses to separate ranking quality from confidence quality.
- Open questions / failure modes:
- Many judge pipelines still cannot verify sources independently.
- Robustness to distribution shift and new model families remains unresolved.
- Some attacks preserve enough semantics to evade simple policy filters.
- Label-free confidence signals are promising but still task-structure dependent.
Theme: Constraining generation and execution beats unconstrained free-form behavior
- Why it matters: Across clinical QA, tool use, inference systems, and scientific discovery, papers repeatedly find that constrained interfaces outperform unconstrained generation. The practical pattern is to narrow what the model is allowed to emit while preserving useful flexibility.
- Representative papers:
- SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings
- HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents
- MiniPIC: Flexible Position-Independent Caching in <100LOC
- EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
- Common approach:
- Replace free-form rewriting with extraction, line selection, or hidden deterministic subroutines.
- Move complexity into the environment, runtime, or interface rather than the prompt.
- Preserve auditability through verbatim evidence, hidden evaluators, or structured execution blocks.
- Open questions / failure modes:
- Constrained outputs can trade recall for precision and omit critical details.
- Hidden execution improves efficiency but reduces interpretability.
- Systems gains may depend on specific positional encodings, tool ecosystems, or task types.
- Broader replication in live workflows is still limited.
3) Technical synthesis
- A notable split is emerging between model-centric fixes and system-centric fixes; today’s strongest empirical wins often come from the latter: validators, gates, patch logs, hidden graders, structured memory, and execution abstractions.
- Several papers use closed-loop adaptation as the core training recipe: failures generate new tasks (SENTINEL), orchestration artifacts generate reward labels (Orch-RM), and retrieved analogies densify RL signal (RA-RFT).
- Judge dependence is everywhere: DailyReport, AuthorityBench, StakeBench, peer-review gaming, and conformal Elo all rely on LLM judges, but multiple papers also show why raw judge outputs need calibration, decomposition, or adversarial testing.
- Memory work is converging on three distinct layers: write-time protection (Containment Gap), storage-time compression/forgetting (MemRefine, value-based memory), and version-time evolution tracking (EvoMem).
- Search-agent benchmarks increasingly separate step-level competence from chain-level competence; chain metrics are much harsher and better expose brittleness under long trajectories or evolving environments.
- Multiple papers show that aggregate accuracy can hide targeted harm: memory poisoning preserved overall accuracy under complex policy while increasing subgroup wrongful denials; stakeholder-centric prompt injection similarly reveals covert harms missed by ASR alone.
- There is growing use of structured intermediate artifacts as training/eval primitives: review logs, orchestration plans, patch histories, executable trajectories, and decomposition trees.
- Several methods improve performance by compressing or hiding low-level execution from the main reasoning trace: HyperTool folds deterministic tool chains, memory RL compresses dialogue into bounded memory, and MiniPIC reuses spans independent of position.
- Benchmark design is shifting toward contamination resistance via live-web freshness, version matching, KG uniqueness checks, and future-dated evidence requirements.
- A recurring engineering lesson: small deterministic mechanisms can dominate large-model differences when they directly target the failure mode.
4) Top 5 papers (with “why now”)
The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
- Audits LangChain, AutoGPT, and OpenAI Agents SDK against six containment principles and finds no native default compliance; memory integrity is absent in all three.
- Shows a single poisoned memory write can drive persistent targeted corruption across backends, including GPT-4o and Claude Haiku 4.5.
- Demonstrates two deterministic defenses—a memory validator and tool-call policy gate—that eliminate observed attacks with sub-millisecond overhead.
- Why now: agent deployment is moving into public-facing workflows, and this paper gives a concrete checklist plus cheap mitigations rather than abstract safety advice.
- Skepticism: runtime experiments were only executed on LangChain, and the validator is fragile to semantic/adaptive attacks.
LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
- Builds a KG-driven benchmark that explicitly controls search-space size and structural complexity, avoiding the saturation seen in human-authored search sets.
- Top performance is still low: GPT-5.5 reaches 34.74%, and graph-structured questions are harder than tree-structured ones.
- Shows correct trajectories are much longer than on BrowseComp and that current context-management tricks yield only modest gains.
- Why now: many search-agent claims are benchmark-limited; this is a cleaner stress test for whether systems can actually sustain long-horizon browsing.
- Skepticism: uniqueness is only formally guaranteed within the KG, and some questions could still admit alternative answers outside it.
No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions
- Demonstrates that visible, legitimate presentation edits alone can raise AI review scores by +1.21 on average, with 75.1% attack success rate.
- Finds narrative restructuring—not superficial polishing—is the main driver, exposing a structural weakness in reviewer models.
- Includes transfer tests across reviewer models/templates and a contamination-free rolling benchmark.
- Why now: AI reviewing is already being trialed in real venues, and this attack is harder to ban than hidden-text prompt injection because it looks like normal revision.
- Skepticism: semantic preservation is imperfect; only 66.7% of audited pairs met the preservation threshold.
The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems
- Provides a reproducible benchmark with 300 realistic targets built from 30 RCE CVEs and benign background services.
- Evaluates 19 models and finds non-trivial autonomous penetration success rates from 10.7% to 69.3%.
- Shows strong correlation between general model capability and penetration success, with tool use as the main bottleneck.
- Why now: this is concrete evidence that offensive cyber capability is becoming an end-to-end agent property, not just a theoretical concern.
- Skepticism: scope stops at initial shell access in controlled Docker environments and uses a fixed toolset.
From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation
- Replaces hard win/tie/loss labels with calibrated soft preference probabilities from judge score differences.
- Cuts mean held-out Elo MAE to 17.9 and reduces conformal interval widths by 39–70% while maintaining near-target coverage.
- Keeps the standard Bradley–Terry pipeline, making it easy to adopt in existing leaderboard infrastructure.
- Why now: as LLM-as-judge becomes default, calibration of Elo distances matters as much as rank order.
- Skepticism: guarantees are marginal and depend on exchangeability; it does not solve deeper BT assumptions or judge epistemic uncertainty.
5) Practical next steps
- Add write-time memory controls to any deployed agent stack: provenance checks, schema validation, demographic/targeting anomaly checks, and explicit policy-gated tool execution.
- Evaluate agents on chain-level and fresh-data benchmarks, not just static step-level sets; include at least one contamination-resistant search benchmark and one evolving-environment benchmark.
- Instrument memory systems separately for retention, compression, and evolution: measure what gets forgotten, what gets merged, and whether prior valid states remain recoverable after updates.
- For search/report agents, track claim-reference alignment rather than citation count; weak factuality despite references is now a repeated failure mode.
- Red-team web agents with stakeholder-aware metrics: measure ASR, task deviation, and behavioral irregularity to distinguish covert parasitism from obvious disruption.
- Replace unconstrained answer generation with extraction-first or structured-output modes in safety-critical domains, especially when source text is authoritative and auditable.
- In RL or post-training pipelines, shift from static task pools to failure-targeted curricula or retrieval of reasoning-analogous traces; generic RL appears to leave easy gains on the table.
- Calibrate evaluation stacks: use soft preference signals, conformal intervals, or auxiliary consistency checks before trusting leaderboard deltas or automated review scores.
- For multi-agent systems, audit coalition-level vulnerabilities rather than single-agent failures; small compromised coalitions can dominate system risk.
- Treat environment design as part of safety: use hidden evaluators, isolated sandboxes, explicit budgets, and audit logs so agents cannot tamper with their own measurement loop.
Generated from per-paper analyses; no external browsing.