May 26, 2026 Research Brief

Agent safety moves runtime.

Today’s strongest papers argue that agent security and reliability depend less on detecting bad inputs than on controlling provenance, authority, and action at execution time.

Takeaways

  1. **Agent security is shifting from prompt filtering to runtime control of information flow and authority.** The strongest papers today enforce provenance, authorization, or capability attenuation during execution rather than trying to classify bad prompts alone.
  2. **Detection is repeatedly shown to be insufficient without control.** This appears in RAG poisoning, prompt injection, and multi-turn contradiction settings: systems can recognize risk or conflict yet still take unsafe actions.
  3. **Long-horizon agent training is moving toward finer-grained credit assignment and smarter sampling.** Several RL papers improve efficiency by reallocating rollouts or assigning step-level credit using graphs or hindsight rescoring instead of blunt trajectory-level rewards.
#1

Start with: ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

Why it catches my eye: It offers a reusable runtime control pattern for tool agents and reports strong attack reduction without collapsing benign task completion.

Read skeptically for: Its guarantees rely on trusted manifests and visible flows, so hidden channels or bad policy specs can still break safety.

agent-safety tool-use runtime-guardrails

Themes

Runtime security for tool-using and retrieval agents The most credible defenses today are not just better classifiers; they are execution-time mechanisms that constrain what information can flow where and which actions can be taken. This is especially relevant for agents with tools, persistent memory, or external data access.
Monitoring-control gaps in RAG and prompt security Multiple papers show that recognizing danger, contradiction, or injection structure does not guarantee safe behavior. This weakens confidence in detector-only defenses and benchmark setups that stop at awareness metrics.
Jailbreaks, covert channels, and poisoning beyond standard threat models Safety defenses optimized for obvious prompts or fine-tuning attacks are being bypassed by attacks that exploit model internals, reasoning traces, or training data. The attack surface is broader than “bad prompt in, bad answer out.”
Signal Runtime control is replacing prompt filtering. ChainCaps, Dual-Graph Defense, Cordon-MAS, and FinHarness all constrain provenance, permissions, or action flow during execution rather than only classifying prompts.
Tension Detection often fails to change behavior. Prompt injection and RAG papers show systems can detect contradictions or risky structure yet still act unsafely under deployment constraints.
Bet Agent training will get more local. Rollout allocation, graph-based credit assignment, and step-aware preference distillation all shift RL from blunt trajectory rewards toward step-level signal use.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

#1

A practical runtime defense against permission laundering in tool agents, with a clear systems abstraction and strong live-eval results.

Why now
MCP-style tool ecosystems are scaling faster than robust permission models.
Skepticism
Trusted manifests and proxy-visible flows are strong assumptions in messy deployments.

Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents

#2

A complementary security primitive that checks where tool arguments came from, not just whether a tool call looks allowed.

Why now
Indirect prompt injection is increasingly about cross-tool contamination and provenance loss.
Skepticism
It depends on accurate graph attribution and does not solve same-observation poisoning.

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

#3

A realistic benchmark that sharply lowers apparent agent capability and exposes how weak current grading shortcuts are.

Why now
Security agents are being marketed aggressively, but realistic long-horizon evaluation is still scarce.
Skepticism
The benchmark currently centers on two JavaScript engines, limiting breadth.

Chinese version: [中文]

Run stats

  • Candidates: 350
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-26T00:00:00Z → 2026-05-27T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.27110BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning
PDF
cs.CR, cs.CL96Strong jailbreak method exploiting self-conditioned reasoning; directly relevant to LLM security evals.jailbreak, LLM-security, red-teaming, prompting, safety-evaluation
2605.26497Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
PDF
cs.CR95Dual-graph defense targets indirect prompt injection with provenance-aware authorization checks.agent-safety, prompt-injection, tool-use, authorization, provenance, security
2605.26409Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models
PDF
cs.CR, cs.AI, cs.LG95Strong jailbreak eval+mitigation transfer framework with major probe-efficiency gains across many models.jailbreaks, safety-evaluation, robustness, defense-transfer, behavioral-geometry
2605.27042Lessons from Penetration Tests on Large-Scale Agent Systems
PDF
cs.CR, cs.AI95Pen-test lessons for large-scale agents; directly targets real-world agent security failures.agent-security, penetration-testing, autonomy, system-security, ai-safety
2605.26999Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals
PDF
cs.CL, cs.CR95Deployment-aware prompt injection detection with interpretable signals; directly relevant to agent security.prompt-injection, agent-safety, security, evaluation, OOD, detection
2605.26754Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control
PDF
cs.CR, cs.AI94Architectural RAG defense against knowledge poisoning; strong safety framing and reusable design.RAG, knowledge-poisoning, information-flow-control, multi-agent, security, grounding
2605.26542ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
PDF
cs.CR, cs.AI94Practical runtime safety for tool agents; prevents permission laundering via composition-safe capabilities.agent-safety, tool-use, permissions, sandboxing, runtime-guardrails
2605.26537Conceptual Steganography
PDF
cs.CL94CoT steganography via reasoning patterns, robust to paraphrasing; important hidden-channel safety risk.steganography, chain-of-thought, oversight, misalignment, security
2605.26595Cordyceps: Covert Control Attacks on LLMs via Data Poisoning
PDF
cs.CR, cs.AI, cs.LG93Introduces stealthy poisoning-based covert control attacks on LLMs across models and defenses.data-poisoning, backdoors, LLM-security, covert-control, adversarial-ml
2605.26667MemFail: Stress-Testing Failure Modes of LLM Memory Systems
PDF
cs.AI, cs.LG93Diagnostic benchmark for LLM memory failure modes; highly relevant to long-horizon agent reliability.llm-agents, memory, benchmark, reliability, evaluation
2605.26731It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
PDF
cs.AI, cs.CL93Shows harness complexity can hurt frontier agents; actionable reliability insight for agent deployment.agents, reliability, evaluation, deployment, harness, benchmark
2605.27355Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
PDF
cs.AI, cs.CL, cs.LG92Identifies RLHF data-generation vulnerability where models can steer preferences toward misaligned biases.alignment, RLHF, preference-modeling, bias, data-generation-risks
2605.26494The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
PDF
cs.AI, cs.CL, cs.LG92Large agent-native MoE LLM with RL/data pipeline details; likely impactful frontier agent progress.frontier-llm, agents, MoE, RL, post-training, coding
2605.27333FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents
PDF
cs.CL91Inline safety harness for finance agents monitors intent drift and risky tool calls before action.agent-safety, finance, tool-monitoring, runtime-guardrails, LLM-judge, security
2605.27288It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty
PDF
cs.CL, cs.AI, cs.LG91Disentangles sycophancy from uncertainty-driven conformity; useful for alignment diagnosis and evals.alignment, sycophancy, uncertainty, evaluation, reliability
2605.26526Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
PDF
cs.LG, cs.CR90Shows open-weight fine-tuning defenses fail under simple jailbreak-style attacks; high practical impact.jailbreaks, open-weight-llms, defenses, red-teaming, misuse, security
2605.27157Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
PDF
cs.AI90Shows RAG models detect contradictions yet fail to act safely; important gap for agentic deployment.RAG, reliability, monitoring, multi-turn-evaluation, safety
2605.27141VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
PDF
cs.AI90Benchmark for personalized, proactive agents in long-term interactions; useful for realistic agent eval.agents, benchmark, personalization, proactivity, long-horizon
2605.27016Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
PDF
cs.CL, cs.AI, cs.LG, stat.ML90Systematic study of when uncertainty estimates track LLM hallucinations; strong reliability relevance.hallucination, uncertainty, reliability, evaluation, factuality, LLM
2605.27358MobileMoE: Scaling On-Device Mixture of Experts
PDF
cs.LG, cs.AI, cs.CL90On-device MoE scaling law plus strong deployment-oriented models; notable frontier LLM efficiency work.MoE, scaling-laws, efficient-LLMs, on-device, architecture
2605.26691Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
PDF
cs.AI89Studies unsafe tool failures in medical agents and instance-wise selection under imperfect tools.tool-use, medical-agents, safety, reliability, decision-making
2605.26548SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
PDF
cs.CR, cs.LG88Realistic benchmark for long-horizon agentic software security tasks with validated vulnerabilities.benchmark, agents, software-security, evaluation, long-horizon, bug-hunting
2605.26918Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation
PDF
cs.CL88Benchmark for educational validity and safety of video models; useful eval framing beyond generic safety.benchmark, video-models, safety, evaluation, multimodal, education
2605.27220The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
PDF
cs.CL, cs.IR88Production RAG study with concrete traffic evidence on routing/augmentation failures and cost tradeoffs.RAG, retrieval, evaluation, production-systems, efficiency
2605.27083On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning
PDF
cs.CL, cs.CR87Important unlearning critique: counterfactual tuning can induce conflicts and broader hallucination spillover.unlearning, hallucination, reliability, knowledge-editing, benchmark
2605.27140StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
PDF
cs.AI87Step-level preference distillation for agent RL addresses credit assignment in multi-turn agents.agent-rl, preference-learning, distillation, multi-turn, training
2605.26606Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
PDF
cs.LG, cs.AI87Improves rollout allocation for RL post-training of LLMs; practical efficiency for frontier training.RLHF, post-training, efficiency, LLM, rollouts, optimization
2605.26684Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning
PDF
cs.LG, cs.AI87Improves step-level credit assignment for agentic RL using graph structure; promising for agent training.agents, reinforcement-learning, credit-assignment, LLM-agents, reasoning
2605.27068QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
PDF
cs.CL, cs.AI, cs.MA86Audits grounding and utterance consistency in multimodal social deduction agents beyond win rates.agent-evaluation, multimodal, auditing, grounding, social-deduction, benchmark
2605.26784Ratio-Variance Regularized Policy Optimization
PDF
cs.LG, cs.AI86Principled PPO-style alternative with ratio-variance control, evaluated across LLM scales.reinforcement-learning, post-training, optimization, LLM, policy-optimization

AI Paper Insight Brief

2026-05-26

0) Executive takeaways (read this first)

  • Agent security is shifting from prompt filtering to runtime control of information flow and authority. The strongest papers today enforce provenance, authorization, or capability attenuation during execution rather than trying to classify bad prompts alone.
  • Detection is repeatedly shown to be insufficient without control. This appears in RAG poisoning, prompt injection, and multi-turn contradiction settings: systems can recognize risk or conflict yet still take unsafe actions.
  • Long-horizon agent training is moving toward finer-grained credit assignment and smarter sampling. Several RL papers improve efficiency by reallocating rollouts or assigning step-level credit using graphs or hindsight rescoring instead of blunt trajectory-level rewards.
  • Benchmarks are getting more deployment-shaped—and they are lowering apparent capability. Realistic evaluations in software security, personalization, memory, social grounding, and production RAG all show weaker performance than headline benchmark numbers would suggest.
  • Open-weight and aligned models still expose simple attack surfaces. Gradient-free jailbreaks, self-conditioned disclosure attacks, covert channels in CoT, and poisoning-based covert control all bypass defenses that look stronger under narrower threat models.
  • Sparse efficiency is becoming practical at both ends of the stack. One paper pushes low-activation MoE for frontier agentic systems; another shows MoE can now be viable on phones with real deployment measurements.

2) Key themes (clusters)

Theme: Runtime security for tool-using and retrieval agents

Theme: Monitoring-control gaps in RAG and prompt security

Theme: Jailbreaks, covert channels, and poisoning beyond standard threat models

  • Why it matters: Safety defenses optimized for obvious prompts or fine-tuning attacks are being bypassed by attacks that exploit model internals, reasoning traces, or training data. The attack surface is broader than “bad prompt in, bad answer out.”
  • Representative papers:
  • Common approach:
    • Exploit reasoning structure or latent refusal directions instead of relying on explicit jailbreak strings
    • Use semantic or conceptual carriers that survive paraphrasing and simple sanitization
    • Show attacks remain effective under low-cost, gradient-free, or low-poison-ratio settings
    • Test against existing defenses designed for narrower assumptions, such as adversarial fine-tuning or lexical triggers
  • Open questions / failure modes:
    • Stealth and detectability remain under-measured for several covert-channel attacks
    • Some mitigations help but do not restore no-attack baselines
    • Results often depend on capable oracles, shared knowledge, or specific defense families
    • Multi-turn and real deployment interfaces may change attack success in ways not yet fully measured

Theme: RL for agents is becoming more selective, local, and structure-aware

Theme: Benchmarks are exposing hidden weaknesses in memory, personalization, grounding, and security

Theme: Efficiency and deployment realism are driving architecture choices

3) Technical synthesis

  • A recurring design pattern is separating observation from authority: AuthGraph separates execution provenance from clean authorization, ChainCaps separates value budgets from tool permissions, and CORDON-MAS separates raw evidence readers from final synthesizers.
  • Several security papers converge on information-flow control as the right abstraction for agents and RAG, replacing content moderation-style thinking with provenance, sink constraints, and certified evidence paths.
  • Detector-only evaluation is being challenged across domains: prompt injection detection varies sharply by regime and threshold; contradiction acknowledgement in RAG does not predict safe action; contradiction-aware prompt defenses still fail under poisoning.
  • RL papers increasingly optimize where signal lives, not just how to optimize it: Pilot-Commit targets high-variance prompts, GraphGPO targets graph-local progress, and StepOPSD targets action-centered step spans.
  • There is a broad move from trajectory-level to localized supervision: graph edges, step segments, parameter sources, claim cards, and memory-operation failures all reflect finer-grained decomposition.
  • Multiple papers show benchmark realism lowers apparent capability: SEC-bench Pro keeps top single-agent success below 40%; VitaBench 2.0 tops out around 0.5 Avg@4 even with full context; QUACK shows high-win agents still hallucinate socially grounded facts.
  • Several works highlight non-monotonicity: harness complexity does not scale cleanly with model tier, stronger internal models can worsen memory systems, and larger RAG models can show worse monitoring-control gaps.
  • Safety and utility trade-offs are increasingly measured with deployment-native metrics: low-FPR TPR, benign completion, approval rate, answerability, advanced-judge routing counts, and real latency on phones or production traffic.
  • Across jailbreak and poisoning work, the common failure is overfitting defenses to a narrow attack model—fine-tuning defenses miss abliteration/prefill, paraphrasing misses conceptual channels, and prompt defenses miss semantic covert control.
  • Sparse systems work is bifurcating into two regimes: frontier agentic MoE for long-horizon capability and mobile MoE for edge deployment, but both rely on careful routing, training stability, and runtime-aware design.

4) Top 5 papers (with “why now”)

  • ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
    • Reframes agent safety around permission laundering, where individually allowed tool calls compose into unsafe end-to-end behavior.
    • Implements a practical transparent MCP proxy with monotonic budget propagation and a non-amplification theorem.
    • Reports large live-eval gains: attack success drops from 25.2–67.8% to 0.0–4.8% with 96–100% benign completion.
    • Useful now because MCP-style tool ecosystems are expanding faster than robust runtime policy layers.
    • Skepticism / limitation: guarantees depend on trusted manifests and proxy-visible explicit flows; manifest quality is the main deployment bottleneck.
  • Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
    • Adds a strong missing primitive for agent security: parameter-source authorization, not just tool-call validation.
    • Separates manipulated execution traces from a clean authorization graph, then checks both tool sequence and parameter provenance.
    • On AgentDojo and AgentDyn, reduces ASR to around 0.01–0.02 while preserving relatively high utility.
    • Useful now because indirect prompt injection is increasingly about cross-tool contamination, not only overt malicious calls.
    • Skepticism / limitation: does not handle same-observation poisoning and depends on graph-builder attribution quality.
  • SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
    • Introduces a realistic, Dockerized benchmark for bug hunting on V8 and SpiderMonkey, with vulnerable/fixed/latest images and attribution-aware grading.
    • Shows frontier agents remain far from robust: best single-agent success is 32.0% on V8 and 38.8% on SpiderMonkey.
    • Demonstrates that crash-only grading would inflate success by 43.6%, which is a major warning for current eval practice.
    • Useful now because software security agents are being marketed aggressively, but realistic measurement is lagging.
    • Skepticism / limitation: current instantiation is limited to two JavaScript engines and one open-weight baseline is only partially evaluated.
  • Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
    • Offers a simple but high-leverage systems idea: use pilot rollouts to estimate prompt informativeness, then commit budget only where variance is useful.
    • Reaches target accuracy with 1.5–1.9x fewer rollouts than GRPO and 2.3–4.0x fewer than DAPO in ample-budget settings.
    • Includes practical machinery—binding, replay, solved-prompt eviction—that makes it more deployable than a purely theoretical proposal.
    • Useful now because rollout generation is one of the main cost centers in reasoning-model post-training.
    • Skepticism / limitation: currently tailored to binary-verifiable rewards and math-style tasks.
  • The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
    • Provides rare production evidence that synthetic evals can badly mislead routing policy: synthetic data suggests augmentation is almost always needed, real traffic says only 27.8% of queries need it.
    • Shows pre-retrieval routing from query text alone largely fails in this entity-heavy reference setting.
    • A simple post-retrieval cascade improves quality over Always-HyDE while cutting latency by 31.8%.
    • Useful now because many teams are overusing expensive LLM augmentation based on benchmark assumptions rather than traffic reality.
    • Skepticism / limitation: findings are tightly tied to one encyclopedia deployment and a deferral-heavy policy.

5) Practical next steps

  • Treat agent safety reviews as runtime systems design, not just prompt-defense tuning: add provenance checks, sink policies, and pre-execution gates for irreversible actions.
  • For tool-using agents, audit whether you can answer: which observation supplied each tool argument, and was that source authorized?
  • Add multi-turn, persistent-cache evaluations to any RAG safety suite; single-turn contradiction tests are likely overstating safety.
  • If you run GRPO-style post-training, test variance-aware rollout allocation and step-local credit shaping before scaling raw rollout budgets.
  • Benchmark memory systems by failure mode decomposition—summarization, storage, retrieval—not only end-task accuracy.
  • For open-weight safety, include gradient-free attacks such as abliteration and prefilling in every defense evaluation; adversarial fine-tuning alone is too narrow.
  • For CoT monitoring, assume paraphrasing is insufficient; test whether reasoning traces can carry behavior-level covert channels that survive lexical rewriting.
  • In production RAG, validate routing and augmentation policies on real traffic distributions, and consider post-retrieval cascades before query-only routers.
  • For high-stakes domains like finance or healthcare, measure benign approval / utility alongside ASR, and prefer inline controls that can intervene before state-changing tool calls.
  • When evaluating personalization or proactive assistants, compare full-context vs memory-backed settings explicitly; if memory hurts, the bottleneck is likely retrieval/update quality rather than model reasoning alone.

Generated from per-paper analyses; no external browsing.