May 28, 2026 Research Brief

Agent safety moves inline.

Today’s strongest papers argue that agent safety now depends on runtime control, provenance, and long-horizon evaluation, because models often detect risk without changing unsafe behavior.

Takeaways

  1. **Agent safety is shifting from prompt filtering to runtime control and information-flow enforcement.** Several papers converge on the same lesson: detecting bad inputs or contradictions is not enough; systems need inline enforcement over tool calls, provenance, memory, and retrieval-to-action pathways.
  2. **Multi-turn and long-horizon settings expose failure modes that single-turn evaluations miss.** Distribution shift in dialogue RL, persistent-cache RAG failures, harness sensitivity, and long-horizon security tasks all show that deployment-time trajectories matter more than static benchmark snapshots.
  3. **A recurring “monitoring–control gap” appears across domains.** Models can detect contradictions, suspicious evidence, or risky intent yet still proceed unsafely; this shows up in RAG poisoning, prompt injection, and agent control benchmarks.
#1

Start with: ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

Why it catches my eye: It offers a deployable runtime control primitive for tool agents, with formal guarantees and strong live attack reduction.

Read skeptically for: Protection depends heavily on manifest quality, and guarantees do not cover covert channels or hidden-state bypasses.

agents tool safety runtime control permissions

Themes

Runtime control beats detection-only safety Multiple papers argue that recognizing danger is insufficient if the model can still act on unsafe information. The strongest defenses enforce constraints at execution time: on tool calls, parameter provenance, retrieval-to-synthesis flow, or runtime authority.
Multi-turn interaction creates new distribution shifts and control failures Systems trained or evaluated on static contexts can look safe and capable while failing once they generate their own histories, accumulate evidence, or operate over long trajectories. This is becoming a central failure mode for dialogue agents, RAG systems, and agent harnesses.
Jailbreaks and covert channels are diversifying faster than defenses The attack surface is broadening beyond classic prompt tricks. New work shows vulnerabilities in activation space, self-conditioned reasoning, chain-of-thought behavior, and poisoned fine-tuning data, suggesting many current defenses are too narrow.
Signal Runtime control is replacing detection-only safety. ChainCaps, AUTHGRAPH, Cordon-MAS, and FinHarness all enforce constraints at execution time because detection alone repeatedly fails to stop unsafe actions.
Tension Models can notice danger and still proceed. The monitoring-control gap in RAG, regime-dependent prompt-injection detection, and pen-test lessons all show recognition does not guarantee safe intervention.
Bet Long-horizon evaluation will reshape agent design. Harness sensitivity, MemFail, VitaBench 2.0, and SEC-bench Pro suggest static or single-turn tests miss the failures that matter in deployment.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

#1

A practical and formal answer to permission laundering in tool-using agents, with strong utility retention.

Why now
Production agents increasingly compose tools, making runtime authority control a near-term deployment need.
Skepticism
Security and benign completion both degrade sharply when manifests are weak or incomplete.

Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents

#2

Worth opening for its fine-grained provenance-plus-authorization framing of indirect prompt injection defense.

Why now
Agent attacks are shifting from obvious malicious calls to subtle parameter-source corruption across tool chains.
Skepticism
Same-source poisoning and graph-construction errors could weaken the claimed protection.

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

#3

It isolates a crucial deployment failure: models can acknowledge contradictions yet still act unsafely.

Why now
Many RAG systems now use persistent context, while safety evaluation still overweights single-turn detection metrics.
Skepticism
The scenarios are synthetic, and automated judging may overstate absolute risk levels.

Chinese version: [中文]

Run stats

  • Candidates: 350
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-26T00:00:00Z → 2026-05-27T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.26497Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
PDF
cs.CR96Strong agent security: provenance+authorization defense for indirect prompt injection in tool use.agent-safety, prompt-injection, tool-use, authorization, provenance, security
2605.26754Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control
PDF
cs.CR, cs.AI95High-value RAG safety defense against knowledge poisoning with architectural information-flow control.RAG, knowledge-poisoning, agent-safety, information-flow-control, multi-agent, security
2605.27355Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
PDF
cs.AI, cs.CL, cs.LG95Identifies RLHF data-generation vulnerability that can amplify hidden biases during alignment.alignment, RLHF, bias, preference-modeling, safety
2605.27042Lessons from Penetration Tests on Large-Scale Agent Systems
PDF
cs.CR, cs.AI95Pen-test lessons on large-scale agent systems; directly targets real-world agent security failures.agent-security, penetration-testing, ai-safety, vulnerabilities, deployment
2605.26999Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals
PDF
cs.CL, cs.CR95Deployment-aware prompt injection detection eval with interpretable signals; directly relevant to agent security.prompt-injection, security, evaluation, OOD, interpretable-features, deployment
2605.27110BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning
PDF
cs.CR, cs.CL95Strong jailbreak attack exposing self-conditioned disclosure pathways across major safety benchmarks.jailbreak, llm-safety, red-teaming, security, evaluation
2605.26409Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models
PDF
cs.CR, cs.AI, cs.LG94Strong jailbreak-defense paper with efficient susceptibility prediction and defense transfer at scale.jailbreak, security, evaluation, robustness, defense-transfer
2605.26595Cordyceps: Covert Control Attacks on LLMs via Data Poisoning
PDF
cs.CR, cs.AI, cs.LG93Novel LLM poisoning threat: covert control via semantic hiding, with broad security implications.data-poisoning, backdoor, LLM-security, covert-control, fine-tuning, adversarial-ml
2605.26542ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
PDF
cs.CR, cs.AI93Practical runtime safety for tool-using agents; prevents permission laundering via composition.agents, tool-use, security, permissions, runtime-safety
2605.26731It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
PDF
cs.AI, cs.CL93Shows harness complexity can hurt frontier agents; actionable reliability insight for agent deployment.agents, reliability, evaluation, harness-design, benchmark, deployment
2605.26537Conceptual Steganography
PDF
cs.CL93Novel CoT steganography threat robust to paraphrasing; important for oversight and monitoring safety.steganography, chain-of-thought, oversight, alignment, security
2605.26667MemFail: Stress-Testing Failure Modes of LLM Memory Systems
PDF
cs.AI, cs.LG92Diagnostic benchmark for LLM memory failure modes; highly relevant to long-horizon agent reliability.llm-agents, memory, benchmark, reliability, evaluation
2605.26494The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
PDF
cs.AI, cs.CL, cs.LG92Large agent-native MoE LLM with verifiable trajectories and RL system; likely impactful frontier model release.frontier-llm, MoE, agents, RL-post-training, coding, long-horizon
2605.27333FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents
PDF
cs.CL91Practical inline safety harness for finance agents with stepwise monitoring and intervention.agent-safety, tool-monitoring, runtime-guardrails, finance, LLM-judge, workflow-safety
2605.27157Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
PDF
cs.AI91Shows RAG models detect contradictions yet fail to act safely; important deployment evaluation gap.RAG, safety, evaluation, reliability, multi-turn
2605.26526Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
PDF
cs.LG, cs.CR90Important negative result: open-weight LLM fine-tuning defenses fail under simple jailbreak-style attacks.jailbreaks, open-weight-llms, defenses, red-teaming, adversarial-attacks, safeguards
2605.27016Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
PDF
cs.CL, cs.AI, cs.LG, stat.ML90Systematic study of when uncertainty estimates track hallucinations; important for reliable LLM deployment.hallucination, uncertainty, reliability, evaluation, calibration, LLMs
2605.27288It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty
PDF
cs.CL, cs.AI, cs.LG90Disentangles sycophancy from uncertainty-driven conformity with a useful LLM reliability eval framework.sycophancy, uncertainty, evaluation, reliability, alignment
2605.27141VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
PDF
cs.AI89Benchmark for personalized, proactive agents in long-term interactions; useful for realistic agent eval.agents, benchmark, personalization, long-horizon, evaluation
2605.27358MobileMoE: Scaling On-Device Mixture of Experts
PDF
cs.LG, cs.AI, cs.CL89On-device MoE scaling law plus strong Pareto claims make this notable frontier LLM efficiency work.moe, scaling-laws, efficiency, on-device, llm
2605.27117Position: AI Safety Requires Effective Controllability
PDF
cs.AI88Clear safety framing shift from alignment to controllability for deployable tool-using agents.AI-safety, controllability, agents, interruptibility, governance, position-paper
2605.26952Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement
PDF
cs.CL88Improves agentic RL for tool use by learning when tools are needed, reducing reward hacking.agentic-RL, tool-use, LLM-agents, reward-hacking, efficiency
2605.26606Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
PDF
cs.LG, cs.AI88Cuts RL post-training rollout waste via online allocation; strong practical value for LLM training efficiency.RLHF, post-training, efficiency, rollouts, policy-optimization, LLMs
2605.26548SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
PDF
cs.CR, cs.LG87Useful benchmark for long-horizon software security agents with validated real-world vulnerabilities.benchmark, agents, software-security, long-horizon, evaluation, vulnerability-discovery
2605.27140StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
PDF
cs.AI87Step-level preference distillation for agent RL addresses credit assignment in multi-turn agents.agent-rl, preference-learning, distillation, credit-assignment, post-training
2605.27220The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
PDF
cs.CL, cs.IR87Production RAG study with concrete traffic data on routing failures, cost, and retrieval cascades.rag, retrieval, production, evaluation, efficiency
2605.27083On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning
PDF
cs.CL, cs.CR86Important unlearning critique: counterfactual tuning can induce conflicts and broader hallucination.unlearning, hallucination, knowledge-editing, evaluation, reliability
2605.26403From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
PDF
cs.AI86Interactive RL for dialogue with calibrated simulator tackles multi-turn distribution shift.dialogue-agents, interactive-rl, distribution-shift, alignment, simulators
2605.26784Ratio-Variance Regularized Policy Optimization
PDF
cs.LG, cs.AI86Principled alternative to clipping in policy optimization with LLM-scale evals; promising RL training advance.reinforcement-learning, policy-optimization, trust-region, LLMs, training
2605.27068QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
PDF
cs.CL, cs.AI, cs.MA85Audits grounding and utterance consistency in multimodal social deduction agents; strong eval utility.agent-evaluation, multimodal, grounding, auditing, social-deduction, benchmark

AI Paper Insight Brief

2026-05-28

0) Executive takeaways (read this first)

  • Agent safety is shifting from prompt filtering to runtime control and information-flow enforcement. Several papers converge on the same lesson: detecting bad inputs or contradictions is not enough; systems need inline enforcement over tool calls, provenance, memory, and retrieval-to-action pathways.
  • Multi-turn and long-horizon settings expose failure modes that single-turn evaluations miss. Distribution shift in dialogue RL, persistent-cache RAG failures, harness sensitivity, and long-horizon security tasks all show that deployment-time trajectories matter more than static benchmark snapshots.
  • A recurring “monitoring–control gap” appears across domains. Models can detect contradictions, suspicious evidence, or risky intent yet still proceed unsafely; this shows up in RAG poisoning, prompt injection, and agent control benchmarks.
  • RL post-training is getting more compute-aware and step-aware. New work reallocates rollouts to informative prompts, regularizes policy-ratio variance instead of clipping, and adds step-level or tool-boundary supervision to improve sample efficiency and stability.
  • Open-weight and aligned models remain vulnerable to simple or novel jailbreak channels. Gradient-free attacks, boundary-guided disclosure, conceptual steganography, and poisoning-induced semantic covert channels all bypass common defenses.
  • Benchmarks are becoming more diagnostic, not just harder. New evaluations isolate memory failures, grounding failures in multimodal agents, personalization/proactiveness gaps, and realistic software-security workflows rather than reporting only aggregate win rates.

2) Key themes (clusters)

Theme: Runtime control beats detection-only safety

  • Why it matters: Multiple papers argue that recognizing danger is insufficient if the model can still act on unsafe information. The strongest defenses enforce constraints at execution time: on tool calls, parameter provenance, retrieval-to-synthesis flow, or runtime authority.
  • Representative papers:
  • Common approach:
    • Build an explicit runtime representation of allowed behavior: authorization graphs, capability budgets, claim cards, or per-step risk heads.
    • Restrict how untrusted information can flow into effectful actions or final synthesis.
    • Check safety at the granularity of parameters, sinks, or individual tool steps rather than only whole traces.
    • Preserve utility by allowing least-privilege replanning, selective declassification, or advisory feedback instead of blanket blocking.
  • Open questions / failure modes:
    • Trusted manifests/plans are a bottleneck; poor manifests sharply degrade protection in ChainCaps.
    • Same-source poisoning and multi-document collusion remain hard because the “authoritative” source itself may be compromised.
    • Runtime overhead is real: AUTHGRAPH adds ~1.87× runtime; CORDON-MAS adds 2.2× latency and 2.8× cost.
    • Most guarantees cover explicit flows visible to the proxy, not covert channels, hidden state, or OS-level bypasses.

Theme: Multi-turn interaction creates new distribution shifts and control failures

Theme: Jailbreaks and covert channels are diversifying faster than defenses

  • Why it matters: The attack surface is broadening beyond classic prompt tricks. New work shows vulnerabilities in activation space, self-conditioned reasoning, chain-of-thought behavior, and poisoned fine-tuning data, suggesting many current defenses are too narrow.
  • Representative papers:
  • Common approach:
    • Exploit model internals or reasoning structure rather than only surface-form prompts.
    • Use multi-turn escalation, semantic hiding, or gradient-free weight edits to bypass refusal behavior.
    • Test against existing defenses such as paraphrasing, fine-tuning safeguards, sanitizers, and prompt-injection detectors.
    • Measure both attack success and utility preservation to show stealth/practicality.
  • Open questions / failure modes:
    • Many defenses suppress refusal behavior rather than removing harmful knowledge, leaving models exploitable.
    • Strategy-aware or semantics-aware defenses help, but only when they know what channel to target.
    • Poisoning-induced semantic channels are hard to detect with lexical or perplexity-based sanitizers.
    • Stealth and adaptive attacker evaluations remain incomplete in several papers.

Theme: RL for agents is becoming more selective, structured, and compute-efficient

Theme: Evaluation is moving toward causal diagnosis of agent subsystems

Theme: Deployment-aware robustness depends on regime, not one-size-fits-all heuristics

3) Technical synthesis

  • A strong cross-paper pattern is moving from scalar labels to structured state: authorization graphs, capability budgets, claim cards, memory-operation taxonomies, and step-centered segments all outperform coarse end-to-end judgments for diagnosis or control.
  • Several papers independently identify a detection/action dissociation: RAG models acknowledge contradictions yet act unsafely; prompt-injection detectors can rank well but fail at low-FPR deployment points; agents can appear compliant while continuing restricted trajectories.
  • Information-flow control is re-emerging as a core agent-safety primitive, applied to tools (ChainCaps), provenance (AUTHGRAPH), and RAG synthesis (CORDON-MAS), suggesting a common systems-security lens for LLM agents.
  • In RL, there is a shared move toward variance-aware optimization: Pilot-Commit targets high reward-variance prompts, R2VPO regularizes ratio variance, and StepOPSD/AKBE reshape credit toward causally informative steps or tool-boundary decisions.
  • Multiple works show that more capability does not monotonically improve safety behavior: larger Qwen models widen the monitoring–control gap in RAG, stronger chat models can be more harness-sensitive, and well-aligned frontier models remain vulnerable to BAIT.
  • On-policy data matters across both alignment and efficiency papers: Calibrated Interactive RL, AKBE, and StepOPSD all rely on current-policy trajectories rather than static logs or offline supervision.
  • Several benchmarks replace naive success criteria with verifier-backed attribution: SEC-bench Pro uses vulnerable/fixed/latest images, QUACK verifies claims against replay logs, and MemFail attributes failures to storage/summarization/retrieval.
  • A recurring limitation is OOD fragility of the control mechanism itself: simulators fail off-distribution, manifests are brittle, rule-based structural signals are regime-dependent, and strategy-aware defenses only work when the strategy class is known.
  • There is growing evidence that surface-form defenses are insufficient: conceptual steganography survives paraphrase, SHuSh bypasses lexical sanitizers, and gradient-free attacks bypass fine-tuning defenses without retraining.
  • Production-oriented papers increasingly optimize cost-quality-security jointly, not separately: post-retrieval cascades, DKPS probe reduction, FinHarness routing, and MobileMoE all treat compute budget as part of the safety/deployment problem.

4) Top 5 papers (with “why now”)

  • ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
    • Formalizes “permission laundering” and enforces a simple invariant: sink authority can only shrink as values compose.
    • Delivers strong live results across five frontier models: ASR drops from 25–68% to 0–4.8% while benign completion stays 96–100%.
    • Practical deployment story is strong: transparent MCP proxy, low median latency (~0.13 ms), no agent/tool changes required.
    • Why now: tool-using agents are moving into production, and this is one of the clearest runtime enforcement designs with both theorem and live-system evidence.
    • Skepticism: effectiveness depends heavily on manifest quality; naive manifests collapse both security and benign completion.
  • Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
    • Introduces a clean separation between what the agent actually used (IRG) and what the user-authorized plan permits (AG).
    • Catches both out-of-envelope tool use and parameter-source pollution, reducing ASR to near-zero on AgentDojo/AgentDyn while preserving utility.
    • The per-parameter ParamPolicy is more fine-grained than many prior plan-checking defenses.
    • Why now: indirect prompt injection is increasingly about subtle provenance corruption, not just obvious malicious tool calls.
    • Skepticism: same-observation pollution and graph-builder attribution errors remain unresolved.
  • Detecting Is Not Resolving: The Monitoring–Control Gap in Retrieval-Augmented LLMs
    • Shows that multi-turn persistent-cache RAG can become unsafe even when models explicitly acknowledge contradictions.
    • Demonstrates that prompt interventions raise acknowledgement to 88–99% without reliably improving safety, and the gap can widen with scale.
    • Adds mechanism evidence pointing to action selection rather than missing contradiction representation.
    • Why now: many production RAG systems maintain persistent context and are evaluated with single-turn tests that this paper suggests are misleading.
    • Skepticism: scenarios are synthetic and automated judges overestimate absolute danger.
  • Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
    • Shows that simple gradient-free attacks—especially Abliteration—can jailbreak open-weight safeguards without any fine-tuning.
    • Demonstrates very large ASR increases across model families and sizes, with TAR more resistant but still vulnerable.
    • Proposes ART as a lightweight mitigation layer that reduces, but does not eliminate, the vulnerability.
    • Why now: open-weight deployment is accelerating, and many teams may be overestimating protection from fine-tuning-resistant safeguards.
    • Skepticism: ART only partially closes the gap, and stronger adaptive attacks may do even better.
  • SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
    • Provides a realistic benchmark of 183 validated JS-engine vulnerabilities with reproducible vulnerable/fixed/latest environments.
    • Uses three-image execution plus LLM judging to avoid crash-only overcounting; naive grading would inflate success by ~43.6%.
    • Finds frontier coding agents still top out below 40% single-agent verified success, with complementary coverage across agents.
    • Why now: capability discussions around autonomous vulnerability research need harder, attributable, long-horizon evaluations rather than harness-heavy or leak-prone tasks.
    • Skepticism: current instantiation is limited to V8 and SpiderMonkey, and open-weight evaluation is narrower.

5) Practical next steps

  • Add runtime information-flow controls to agent stacks before relying on prompt-level defenses alone: provenance checks, sink budgets, or claim-only synthesis boundaries.
  • Evaluate RAG and agent systems under persistent multi-turn caches and timing attacks, not just single-turn contradiction or poisoning tests.
  • For tool-using agents, instrument parameter provenance and composition paths so you can detect cross-tool pollution and permission laundering.
  • In RL post-training, test variance-aware rollout allocation and step-level credit shaping before scaling rollout budgets uniformly.
  • For open-weight safety, expand red-teaming to include gradient-free activation/weight attacks, prefilling, and multi-turn self-conditioned jailbreaks.
  • Replace aggregate benchmark scores with subsystem diagnostics: memory summarization/storage/retrieval attribution, claim grounding, and verifier-backed exploit attribution.
  • In production RAG, prefer post-retrieval cascades over query-only routing when augmentation need depends on retrieval outcomes.
  • Track low-FPR deployment metrics and calibration, not just ROC-AUC, for prompt-injection and jailbreak detectors.
  • Separate uncertainty-driven deference from pure sycophancy in evaluation, especially in high-stakes decision support.
  • If deploying long-horizon agents, build an explicit control plane: stoppability, overrideability, persistent control state, and auditable intervention logs.

Generated from per-paper analyses; no external browsing.