Takeaways

Agent reliability is increasingly bottlenecked by **system design choices outside the base model**: containment boundaries, memory policies, tool-execution abstractions, environment engineering, and evaluation harnesses repeatedly mattered as much as or more than raw model scale.
Several papers show that **persistent memory is now a primary failure surface**: single poisoned writes can permanently corrupt agent behavior, naive forgetting collapses useful state, and version-unaware memory breaks under evolving environments. Patch histories, learned retention, and explicit validation are emerging as practical fixes.
Search/web agents remain far from robust deployment: new benchmarks make this clear from different angles—**long-horizon search is still hard**, daily-report generation is weak on factuality despite citations, and evolving/fresh benchmarks sharply reduce apparent performance versus static sets.

Start with: The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

Why it catches my eye: It turns agent safety into an actionable systems checklist and shows cheap deterministic controls can block persistent failures.

Read skeptically for: Runtime attacks were tested mainly on one stack, and semantic attackers may evade the proposed validator.

agent-safety containment memory-integrity deployment

arXiv PDF

Themes

Memory as the new control plane Multiple papers converge on memory as both a capability multiplier and a safety liability. The same persistent state that enables long-horizon behavior also creates durable attack surfaces, forgetting errors, and brittleness under environment change.

Search agents need harder, fresher, more user-centered evaluation Static or human-authored search benchmarks are saturating or leaking into model parameters, while real user tasks demand fresh retrieval, long trajectories, and evidence-grounded synthesis. New benchmarks show current systems still underperform on factuality, calibration, and long-horizon browsing.

Security failures are increasingly architectural, not just model-level The strongest security papers argue that many agent failures stem from missing boundaries, unsafe tool interfaces, and weak environment controls. This shifts the defense agenda from “align the model better” to “constrain the system correctly.”

Signal Safety is becoming architectural. Containment audits, stakeholder prompt-injection tests, and autonomous cyber evaluations all show system boundaries and tool controls dominate risk.

Tension Memory helps and destabilizes. Memory papers improve retention and compression, but containment and dynamic-environment results show persistent state is now a major attack surface.

Bet Closed-loop training will win. Failure-driven RL, orchestration reward models, and retrieval-augmented RL all target observed bottlenecks instead of relying on static supervision.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

Useful if you deploy agents: it identifies missing containment guarantees and offers low-overhead mitigations for persistent corruption.

Why now: Public-facing agent stacks are shipping before their memory and tool boundaries are well specified.
Skepticism: Evidence is strongest on audited frameworks and one runtime stack, not every production architecture.

arXiv PDF

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

A strong companion read because it shows how much current search-agent competence drops under genuinely hard long-horizon evaluation.

Why now: Search agents are being overclaimed on saturated benchmarks while fresh, structurally hard tasks remain unsolved.
Skepticism: Benchmark uniqueness is bounded by its knowledge graph, so some answers may exist outside the constructed space.

arXiv PDF

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

Worth opening for a concrete training recipe that improves tool agents by learning directly from their observed failures.

Why now: Post-training is shifting from generic RL toward targeted adaptation for real agent bottlenecks.
Skepticism: Generalization beyond its evaluated domains and tool settings is still uncertain.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 306
Selected: 30
Deepread completed: 30
Window (UTC): 2026-06-11T00:00:00Z → 2026-06-12T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2606.13385`	Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents PDF	cs.CR, cs.AI, cs.CY, cs.HC, cs.MM	95	Stakeholder-aware prompt injection benchmark for web agents; strong real-world safety framing.	agent-safety, prompt-injection, web-agents, benchmark, security
`2606.12797`	The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements PDF	cs.AI	94	Audits major agent frameworks and finds missing containment guarantees; highly actionable safety result.	agent-safety, frameworks, containment, memory-integrity, audit
`2606.12908`	SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents PDF	cs.CL	93	Failure-driven RL for tool agents; strong relevance to reliable agent training and adaptation.	agents, tool-use, reinforcement-learning, reliability, post-training
`2606.12918`	MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems PDF	cs.CR, cs.AI	92	Targets collusive failures in multi-agent systems with principled red-teaming via Shapley guidance.	multi-agent, red-teaming, security, collusion, evaluation
`2606.12897`	SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings PDF	cs.CL	92	Hallucination-resistant extraction for safety-critical RAG; directly targets reliability/compliance risks.	RAG, hallucination, safety, reliability, grounding
`2606.13079`	The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems PDF	cs.CR, cs.AI	91	Evaluates autonomous penetration capability, a key frontier risk threshold for agentic AI systems.	cybersecurity, agents, dangerous-capabilities, evaluation, autonomy
`2606.13663`	HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents PDF	cs.CL	91	New tool-execution abstraction reduces trace burden; important for scalable, efficient agent tool use.	agents, tool-use, efficiency, interfaces, MCP
`2606.13598`	Reward Modeling for Multi-Agent Orchestration PDF	cs.AI, cs.CL, cs.LG, cs.MA	91	Self-supervised reward modeling for multi-agent orchestration; strong agent-training relevance.	multi-agent, reward-modeling, orchestration, agents, test-time-scaling
`2606.13044`	No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions PDF	cs.CL	90	Shows AI peer review can be gamed by presentation-only edits, exposing subtle evaluation failure modes.	evaluation, robustness, peer-review, adversarial, llm-safety
`2606.13649`	Operadic consistency: a label-free signal for compositional reasoning failures in LLMs PDF	cs.CL, cs.LG	90	Strong label-free signal for reasoning failure detection across many LLMs; useful for runtime monitoring.	reasoning, uncertainty, evaluation, monitoring, reliability
`2606.12837`	LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling PDF	cs.CL	90	Hard new benchmark for long-horizon search agents beyond saturated prior evals.	benchmark, agents, search, evaluation, long-horizon
`2606.12809`	MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs PDF	cs.AI, cs.LG	89	Large benchmark for lifelong unlearning in MLLMs; highlights cumulative failures in current methods.	unlearning, multimodal, benchmark, privacy, safety
`2606.13221`	From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation PDF	cs.LG	89	Calibrates LLM-as-a-judge Elo with uncertainty; directly useful for reliable model evaluation.	evaluation, llm-as-a-judge, calibration, elo, uncertainty
`2606.13104`	Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models PDF	cs.LG	88	Large benchmark on citation-induced epistemic bias; directly relevant to trust and factuality.	factuality, benchmark, citation-bias, epistemics, reliability
`2606.13662`	EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery PDF	cs.AI, cs.CL	88	Argues environment engineering is key for autonomous discovery; includes reward-hacking concerns.	agents, scientific-discovery, environment-design, safety, reward-hacking
`2606.12871`	DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks PDF	cs.AI	87	Open-ended benchmark for search agents with fine-grained rubrics on realistic daily tasks.	agents, benchmark, search, evaluation, real-world
`2606.13680`	Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning PDF	cs.CL, cs.AI	87	RAG plus reinforcement fine-tuning for analogy-based reasoning; promising reasoning advance.	reasoning, RAG, reinforcement-learning, retrieval, post-training
`2606.13126`	MiniPIC: Flexible Position-Independent Caching in <100LOC PDF	cs.LG, cs.AI, cs.CL	87	Practical position-independent KV caching for RAG/agents; high efficiency and deployment impact.	inference, kv-cache, efficiency, rag, agents, long-context
`2606.13681`	EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments PDF	cs.CL	86	Dynamic-environment benchmark for agent memory evolution; useful for realistic agent reliability testing.	agents, benchmark, memory, dynamic-environments, evaluation
`2606.13602`	EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis PDF	cs.AI	86	Verifiable benchmark exposes weak agent performance on real scientific analysis workflows.	agents, benchmark, evaluation, science, tool-use
`2606.12945`	Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory PDF	cs.AI	86	Cognitively grounded memory value model for long-running agents under budget constraints.	agents, memory, long-context, cognitive-modeling, efficiency
`2606.13220`	LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis PDF	cs.AI, cs.CE, cs.ET, cs.LG, cs.MA	85	Evidence-first diagnosis tackles user-driven sycophancy, improving robustness in interactive agents.	agents, sycophancy, robustness, reasoning, interactive
`2606.13349`	From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent PDF	cs.CL	85	Proactive review agent with structured evidence gathering; notable agentic reasoning framework.	agents, scientific-review, reasoning, mdp, evidence-tracking
`2606.13120`	EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge PDF	cs.CL	84	Contamination-resistant benchmark for search agents on evolving knowledge with bilingual coverage.	search-agents, benchmark, retrieval, evaluation, contamination
`2606.13643`	Recursive Agent Harnesses PDF	cs.CL	84	Studies recursive agent harnesses with tools/subagents; relevant to frontier agent design and risks.	agents, recursion, tool-use, long-horizon, systems
`2606.13177`	MemRefine: LLM-Guided Compression for Long-Term Agent Memory PDF	cs.CL, cs.AI, cs.LG	84	LLM-guided compression for long-term agent memory with explicit storage-budget framing.	agents, memory, compression, long-context, retrieval
`2606.13037`	DIG: Oracle-Guided Directed Input Generation for One-Day Vulnerabilities PDF	cs.CR, cs.SE	83	Security-focused input generation for one-day vulns; notable for agentic reasoning failure mitigation.	security, vulnerabilities, agents, fuzzing, software
`2606.12941`	Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL PDF	cs.CL	83	Memory-augmented RL improves multi-turn reasoning when context is fragmented across turns.	reasoning, memory, reinforcement-learning, multi-turn, long-context
`2606.13449`	Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests PDF	cs.SE, cs.AI	83	Large empirical study of instruction files and agentic PR outcomes; actionable for coding agents.	coding-agents, software-engineering, instructions, evaluation, agentic-pr
`2606.13608`	AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility PDF	cs.AI, cs.LG	82	Open, standardized, agent-agnostic evaluation framework could improve reproducibility across agents.	agents, evaluation, reproducibility, standards, framework

AI Paper Insight Brief

2026-06-13

0) Executive takeaways (read this first)

Agent reliability is increasingly bottlenecked by system design choices outside the base model: containment boundaries, memory policies, tool-execution abstractions, environment engineering, and evaluation harnesses repeatedly mattered as much as or more than raw model scale.
Several papers show that persistent memory is now a primary failure surface: single poisoned writes can permanently corrupt agent behavior, naive forgetting collapses useful state, and version-unaware memory breaks under evolving environments. Patch histories, learned retention, and explicit validation are emerging as practical fixes.
Search/web agents remain far from robust deployment: new benchmarks make this clear from different angles—long-horizon search is still hard, daily-report generation is weak on factuality despite citations, and evolving/fresh benchmarks sharply reduce apparent performance versus static sets.
A strong pattern across safety/security papers is that lightweight deterministic controls can eliminate major failure modes cheaply: policy gates, memory validators, hidden evaluators, constrained extraction, and structured interfaces often delivered large gains with negligible overhead.
Training is shifting from static supervision toward closed-loop adaptation: failure-driven RL, orchestration reward models, retrieval-augmented RL with reasoning analogies, and memory-augmented RL all improve performance by targeting the agent’s actual bottlenecks rather than generic data.
Evaluation itself is under pressure: papers expose vulnerabilities in AI peer review, citation authority bias, prompt injection, and judge calibration, suggesting that many current automated assessments are easier to game or miscalibrated than leaderboard numbers imply.

2) Key themes (clusters)

Theme: Memory as the new control plane

Why it matters: Multiple papers converge on memory as both a capability multiplier and a safety liability. The same persistent state that enables long-horizon behavior also creates durable attack surfaces, forgetting errors, and brittleness under environment change.
Representative papers:
Common approach:
- Treat memory as a first-class subsystem with explicit write/update/retrieval policies rather than passive context storage.
- Add structure around memory operations: validators, patch histories, learned value functions, or post-hoc compression.
- Evaluate memory in realistic regimes: blind forgetting, persistent evolution, fixed storage budgets, or adversarial poisoning.
Open questions / failure modes:
- Semantic validators may be more robust than regex/schema checks, but add latency and their own attack surface.
- Many methods are benchmark-limited or tested on a small set of agent stacks.
- Compression and forgetting still optimize proxies like evidence retention rather than end-task correctness.
- Long-horizon compound attacks and multi-agent memory interactions remain underexplored.

Theme: Search agents need harder, fresher, more user-centered evaluation

Why it matters: Static or human-authored search benchmarks are saturating or leaking into model parameters, while real user tasks demand fresh retrieval, long trajectories, and evidence-grounded synthesis. New benchmarks show current systems still underperform on factuality, calibration, and long-horizon browsing.
Representative papers:
Common approach:
- Generate harder tasks automatically using knowledge graphs, live-web synthesis, or trending-topic pipelines.
- Decompose evaluation into interpretable dimensions such as instruction following, factuality, rationality, or chain-level success.
- Stress contamination resistance by using fresh knowledge, non-popular evidence, or structurally complex multi-hop constraints.
Open questions / failure modes:
- LLM judges still assess many of these benchmarks, leaving room for judging noise and shortcut success.
- Human uniqueness verification remains incomplete in some datasets.
- Better context management helps, but gains are modest on truly long-horizon tasks.
- Search traces often include citations without real claim-reference grounding.

Theme: Security failures are increasingly architectural, not just model-level

Why it matters: The strongest security papers argue that many agent failures stem from missing boundaries, unsafe tool interfaces, and weak environment controls. This shifts the defense agenda from “align the model better” to “constrain the system correctly.”
Representative papers:
Common approach:
- Model attacks at the system level: cross-agent collusion, stakeholder-specific harms, persistent poisoning, or end-to-end offensive workflows.
- Use sandboxed environments with executable tasks and explicit success conditions.
- Measure not just attack success, but stealth, disruption, coalition effects, or downstream action harms.
Open questions / failure modes:
- Many threat models assume strong attacker visibility into system structure or messages.
- Benchmarks are still mostly sandboxed and domain-bounded.
- Existing guardrails often miss trajectory-level or coalition-level attacks.
- Tool capability itself can amplify offensive capability even when the base model is unchanged.

Theme: Better agent training comes from targeting actual failure modes

Why it matters: Several papers show that generic RL or static curricula underperform compared with methods that adapt to the agent’s observed weaknesses, retrieve useful analogies, or score orchestration plans directly. The common lesson: training signal quality matters more than just more rollouts.
Representative papers:
Common approach:
- Close the loop between observed failures and future training data.
- Replace expensive full-trajectory supervision with cheaper intermediate signals such as orchestration rewards or retrieved analogies.
- Train policies to externalize or compress state explicitly when context is fragmented or long-horizon.
Open questions / failure modes:
- Most evaluations are still narrow: one domain, one model family, or one benchmark.
- Coverage depends on what failures the current policy already exposes.
- Offline judges/retrievers introduce extra cost and possible bias.
- Gains may weaken when orchestration diversity or retrieval quality is low.

Theme: Evaluation pipelines themselves are vulnerable and miscalibrated

Why it matters: A recurring meta-theme is that automated evaluation can be gamed, biased by authority cues, or poorly calibrated. That undermines both benchmarking and deployment decisions if scores are taken at face value.
Representative papers:
Common approach:
- Probe hidden failure modes in judges: presentation sensitivity, citation authority bias, uncertainty collapse, or compositional inconsistency.
- Add calibration layers or auxiliary signals rather than trusting raw judge outputs.
- Use pairwise or selective-prediction analyses to separate ranking quality from confidence quality.
Open questions / failure modes:
- Many judge pipelines still cannot verify sources independently.
- Robustness to distribution shift and new model families remains unresolved.
- Some attacks preserve enough semantics to evade simple policy filters.
- Label-free confidence signals are promising but still task-structure dependent.

Theme: Constraining generation and execution beats unconstrained free-form behavior

Why it matters: Across clinical QA, tool use, inference systems, and scientific discovery, papers repeatedly find that constrained interfaces outperform unconstrained generation. The practical pattern is to narrow what the model is allowed to emit while preserving useful flexibility.
Representative papers:
Common approach:
- Replace free-form rewriting with extraction, line selection, or hidden deterministic subroutines.
- Move complexity into the environment, runtime, or interface rather than the prompt.
- Preserve auditability through verbatim evidence, hidden evaluators, or structured execution blocks.
Open questions / failure modes:
- Constrained outputs can trade recall for precision and omit critical details.
- Hidden execution improves efficiency but reduces interpretability.
- Systems gains may depend on specific positional encodings, tool ecosystems, or task types.
- Broader replication in live workflows is still limited.

3) Technical synthesis

A notable split is emerging between model-centric fixes and system-centric fixes; today’s strongest empirical wins often come from the latter: validators, gates, patch logs, hidden graders, structured memory, and execution abstractions.
Several papers use closed-loop adaptation as the core training recipe: failures generate new tasks (SENTINEL), orchestration artifacts generate reward labels (Orch-RM), and retrieved analogies densify RL signal (RA-RFT).
Judge dependence is everywhere: DailyReport, AuthorityBench, StakeBench, peer-review gaming, and conformal Elo all rely on LLM judges, but multiple papers also show why raw judge outputs need calibration, decomposition, or adversarial testing.
Memory work is converging on three distinct layers: write-time protection (Containment Gap), storage-time compression/forgetting (MemRefine, value-based memory), and version-time evolution tracking (EvoMem).
Search-agent benchmarks increasingly separate step-level competence from chain-level competence; chain metrics are much harsher and better expose brittleness under long trajectories or evolving environments.
Multiple papers show that aggregate accuracy can hide targeted harm: memory poisoning preserved overall accuracy under complex policy while increasing subgroup wrongful denials; stakeholder-centric prompt injection similarly reveals covert harms missed by ASR alone.
There is growing use of structured intermediate artifacts as training/eval primitives: review logs, orchestration plans, patch histories, executable trajectories, and decomposition trees.
Several methods improve performance by compressing or hiding low-level execution from the main reasoning trace: HyperTool folds deterministic tool chains, memory RL compresses dialogue into bounded memory, and MiniPIC reuses spans independent of position.
Benchmark design is shifting toward contamination resistance via live-web freshness, version matching, KG uniqueness checks, and future-dated evidence requirements.
A recurring engineering lesson: small deterministic mechanisms can dominate large-model differences when they directly target the failure mode.

4) Top 5 papers (with “why now”)

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

Audits LangChain, AutoGPT, and OpenAI Agents SDK against six containment principles and finds no native default compliance; memory integrity is absent in all three.
Shows a single poisoned memory write can drive persistent targeted corruption across backends, including GPT-4o and Claude Haiku 4.5.
Demonstrates two deterministic defenses—a memory validator and tool-call policy gate—that eliminate observed attacks with sub-millisecond overhead.
Why now: agent deployment is moving into public-facing workflows, and this paper gives a concrete checklist plus cheap mitigations rather than abstract safety advice.
Skepticism: runtime experiments were only executed on LangChain, and the validator is fragile to semantic/adaptive attacks.

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

Builds a KG-driven benchmark that explicitly controls search-space size and structural complexity, avoiding the saturation seen in human-authored search sets.
Top performance is still low: GPT-5.5 reaches 34.74%, and graph-structured questions are harder than tree-structured ones.
Shows correct trajectories are much longer than on BrowseComp and that current context-management tricks yield only modest gains.
Why now: many search-agent claims are benchmark-limited; this is a cleaner stress test for whether systems can actually sustain long-horizon browsing.
Skepticism: uniqueness is only formally guaranteed within the KG, and some questions could still admit alternative answers outside it.

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

Demonstrates that visible, legitimate presentation edits alone can raise AI review scores by +1.21 on average, with 75.1% attack success rate.
Finds narrative restructuring—not superficial polishing—is the main driver, exposing a structural weakness in reviewer models.
Includes transfer tests across reviewer models/templates and a contamination-free rolling benchmark.
Why now: AI reviewing is already being trialed in real venues, and this attack is harder to ban than hidden-text prompt injection because it looks like normal revision.
Skepticism: semantic preservation is imperfect; only 66.7% of audited pairs met the preservation threshold.

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

Provides a reproducible benchmark with 300 realistic targets built from 30 RCE CVEs and benign background services.
Evaluates 19 models and finds non-trivial autonomous penetration success rates from 10.7% to 69.3%.
Shows strong correlation between general model capability and penetration success, with tool use as the main bottleneck.
Why now: this is concrete evidence that offensive cyber capability is becoming an end-to-end agent property, not just a theoretical concern.
Skepticism: scope stops at initial shell access in controlled Docker environments and uses a fixed toolset.

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

Replaces hard win/tie/loss labels with calibrated soft preference probabilities from judge score differences.
Cuts mean held-out Elo MAE to 17.9 and reduces conformal interval widths by 39–70% while maintaining near-target coverage.
Keeps the standard Bradley–Terry pipeline, making it easy to adopt in existing leaderboard infrastructure.
Why now: as LLM-as-judge becomes default, calibration of Elo distances matters as much as rank order.
Skepticism: guarantees are marginal and depend on exchangeability; it does not solve deeper BT assumptions or judge epistemic uncertainty.

5) Practical next steps

Add write-time memory controls to any deployed agent stack: provenance checks, schema validation, demographic/targeting anomaly checks, and explicit policy-gated tool execution.
Evaluate agents on chain-level and fresh-data benchmarks, not just static step-level sets; include at least one contamination-resistant search benchmark and one evolving-environment benchmark.
Instrument memory systems separately for retention, compression, and evolution: measure what gets forgotten, what gets merged, and whether prior valid states remain recoverable after updates.
For search/report agents, track claim-reference alignment rather than citation count; weak factuality despite references is now a repeated failure mode.
Red-team web agents with stakeholder-aware metrics: measure ASR, task deviation, and behavioral irregularity to distinguish covert parasitism from obvious disruption.
Replace unconstrained answer generation with extraction-first or structured-output modes in safety-critical domains, especially when source text is authoritative and auditable.
In RL or post-training pipelines, shift from static task pools to failure-targeted curricula or retrieval of reasoning-analogous traces; generic RL appears to leave easy gains on the table.
Calibrate evaluation stacks: use soft preference signals, conformal intervals, or auxiliary consistency checks before trusting leaderboard deltas or automated review scores.
For multi-agent systems, audit coalition-level vulnerabilities rather than single-agent failures; small compromised coalitions can dominate system risk.
Treat environment design as part of safety: use hidden evaluators, isolated sandboxes, explicit budgets, and audit logs so agents cannot tamper with their own measurement loop.

Generated from per-paper analyses; no external browsing.

Agent safety moves upstream.

Takeaways

Start with: The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

Themes

Papers Worth Your Reading Time

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

AI Paper Insight Brief

2026-06-13

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Memory as the new control plane

Theme: Search agents need harder, fresher, more user-centered evaluation

Theme: Security failures are increasingly architectural, not just model-level

Theme: Better agent training comes from targeting actual failure modes

Theme: Evaluation pipelines themselves are vulnerable and miscalibrated

Theme: Constraining generation and execution beats unconstrained free-form behavior

3) Technical synthesis

4) Top 5 papers (with “why now”)

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

5) Practical next steps