Takeaways

**Agent security is shifting from prompt filtering to runtime control of information flow and authority.** The strongest papers today enforce provenance, authorization, or capability attenuation during execution rather than trying to classify bad prompts alone.
**Detection is repeatedly shown to be insufficient without control.** This appears in RAG poisoning, prompt injection, and multi-turn contradiction settings: systems can recognize risk or conflict yet still take unsafe actions.
**Long-horizon agent training is moving toward finer-grained credit assignment and smarter sampling.** Several RL papers improve efficiency by reallocating rollouts or assigning step-level credit using graphs or hindsight rescoring instead of blunt trajectory-level rewards.

Start with: ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

Why it catches my eye: It offers a reusable runtime control pattern for tool agents and reports strong attack reduction without collapsing benign task completion.

Read skeptically for: Its guarantees rely on trusted manifests and visible flows, so hidden channels or bad policy specs can still break safety.

agent-safety tool-use runtime-guardrails

arXiv PDF

Themes

Runtime security for tool-using and retrieval agents The most credible defenses today are not just better classifiers; they are execution-time mechanisms that constrain what information can flow where and which actions can be taken. This is especially relevant for agents with tools, persistent memory, or external data access.

Monitoring-control gaps in RAG and prompt security Multiple papers show that recognizing danger, contradiction, or injection structure does not guarantee safe behavior. This weakens confidence in detector-only defenses and benchmark setups that stop at awareness metrics.

Jailbreaks, covert channels, and poisoning beyond standard threat models Safety defenses optimized for obvious prompts or fine-tuning attacks are being bypassed by attacks that exploit model internals, reasoning traces, or training data. The attack surface is broader than “bad prompt in, bad answer out.”

Signal Runtime control is replacing prompt filtering. ChainCaps, Dual-Graph Defense, Cordon-MAS, and FinHarness all constrain provenance, permissions, or action flow during execution rather than only classifying prompts.

Tension Detection often fails to change behavior. Prompt injection and RAG papers show systems can detect contradictions or risky structure yet still act unsafely under deployment constraints.

Bet Agent training will get more local. Rollout allocation, graph-based credit assignment, and step-aware preference distillation all shift RL from blunt trajectory rewards toward step-level signal use.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

A practical runtime defense against permission laundering in tool agents, with a clear systems abstraction and strong live-eval results.

Why now: MCP-style tool ecosystems are scaling faster than robust permission models.
Skepticism: Trusted manifests and proxy-visible flows are strong assumptions in messy deployments.

arXiv PDF

Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents

A complementary security primitive that checks where tool arguments came from, not just whether a tool call looks allowed.

Why now: Indirect prompt injection is increasingly about cross-tool contamination and provenance loss.
Skepticism: It depends on accurate graph attribution and does not solve same-observation poisoning.

arXiv PDF

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

A realistic benchmark that sharply lowers apparent agent capability and exposes how weak current grading shortcuts are.

Why now: Security agents are being marketed aggressively, but realistic long-horizon evaluation is still scarce.
Skepticism: The benchmark currently centers on two JavaScript engines, limiting breadth.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 350
Selected: 30
Deepread completed: 30
Window (UTC): 2026-05-26T00:00:00Z → 2026-05-27T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2605.27110`	BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning PDF	cs.CR, cs.CL	96	Strong jailbreak method exploiting self-conditioned reasoning; directly relevant to LLM security evals.	jailbreak, LLM-security, red-teaming, prompting, safety-evaluation
`2605.26497`	Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents PDF	cs.CR	95	Dual-graph defense targets indirect prompt injection with provenance-aware authorization checks.	agent-safety, prompt-injection, tool-use, authorization, provenance, security
`2605.26409`	Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models PDF	cs.CR, cs.AI, cs.LG	95	Strong jailbreak eval+mitigation transfer framework with major probe-efficiency gains across many models.	jailbreaks, safety-evaluation, robustness, defense-transfer, behavioral-geometry
`2605.27042`	Lessons from Penetration Tests on Large-Scale Agent Systems PDF	cs.CR, cs.AI	95	Pen-test lessons for large-scale agents; directly targets real-world agent security failures.	agent-security, penetration-testing, autonomy, system-security, ai-safety
`2605.26999`	Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals PDF	cs.CL, cs.CR	95	Deployment-aware prompt injection detection with interpretable signals; directly relevant to agent security.	prompt-injection, agent-safety, security, evaluation, OOD, detection
`2605.26754`	Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control PDF	cs.CR, cs.AI	94	Architectural RAG defense against knowledge poisoning; strong safety framing and reusable design.	RAG, knowledge-poisoning, information-flow-control, multi-agent, security, grounding
`2605.26542`	ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation PDF	cs.CR, cs.AI	94	Practical runtime safety for tool agents; prevents permission laundering via composition-safe capabilities.	agent-safety, tool-use, permissions, sandboxing, runtime-guardrails
`2605.26537`	Conceptual Steganography PDF	cs.CL	94	CoT steganography via reasoning patterns, robust to paraphrasing; important hidden-channel safety risk.	steganography, chain-of-thought, oversight, misalignment, security
`2605.26595`	Cordyceps: Covert Control Attacks on LLMs via Data Poisoning PDF	cs.CR, cs.AI, cs.LG	93	Introduces stealthy poisoning-based covert control attacks on LLMs across models and defenses.	data-poisoning, backdoors, LLM-security, covert-control, adversarial-ml
`2605.26667`	MemFail: Stress-Testing Failure Modes of LLM Memory Systems PDF	cs.AI, cs.LG	93	Diagnostic benchmark for LLM memory failure modes; highly relevant to long-horizon agent reliability.	llm-agents, memory, benchmark, reliability, evaluation
`2605.26731`	It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers PDF	cs.AI, cs.CL	93	Shows harness complexity can hurt frontier agents; actionable reliability insight for agent deployment.	agents, reliability, evaluation, deployment, harness, benchmark
`2605.27355`	Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases PDF	cs.AI, cs.CL, cs.LG	92	Identifies RLHF data-generation vulnerability where models can steer preferences toward misaligned biases.	alignment, RLHF, preference-modeling, bias, data-generation-risks
`2605.26494`	The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence PDF	cs.AI, cs.CL, cs.LG	92	Large agent-native MoE LLM with RL/data pipeline details; likely impactful frontier agent progress.	frontier-llm, agents, MoE, RL, post-training, coding
`2605.27333`	FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents PDF	cs.CL	91	Inline safety harness for finance agents monitors intent drift and risky tool calls before action.	agent-safety, finance, tool-monitoring, runtime-guardrails, LLM-judge, security
`2605.27288`	It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty PDF	cs.CL, cs.AI, cs.LG	91	Disentangles sycophancy from uncertainty-driven conformity; useful for alignment diagnosis and evals.	alignment, sycophancy, uncertainty, evaluation, reliability
`2605.26526`	Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks PDF	cs.LG, cs.CR	90	Shows open-weight fine-tuning defenses fail under simple jailbreak-style attacks; high practical impact.	jailbreaks, open-weight-llms, defenses, red-teaming, misuse, security
`2605.27157`	Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs PDF	cs.AI	90	Shows RAG models detect contradictions yet fail to act safely; important gap for agentic deployment.	RAG, reliability, monitoring, multi-turn-evaluation, safety
`2605.27141`	VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions PDF	cs.AI	90	Benchmark for personalized, proactive agents in long-term interactions; useful for realistic agent eval.	agents, benchmark, personalization, proactivity, long-horizon
`2605.27016`	Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination PDF	cs.CL, cs.AI, cs.LG, stat.ML	90	Systematic study of when uncertainty estimates track LLM hallucinations; strong reliability relevance.	hallucination, uncertainty, reliability, evaluation, factuality, LLM
`2605.27358`	MobileMoE: Scaling On-Device Mixture of Experts PDF	cs.LG, cs.AI, cs.CL	90	On-device MoE scaling law plus strong deployment-oriented models; notable frontier LLM efficiency work.	MoE, scaling-laws, efficient-LLMs, on-device, architecture
`2605.26691`	Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents PDF	cs.AI	89	Studies unsafe tool failures in medical agents and instance-wise selection under imperfect tools.	tool-use, medical-agents, safety, reliability, decision-making
`2605.26548`	SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks? PDF	cs.CR, cs.LG	88	Realistic benchmark for long-horizon agentic software security tasks with validated vulnerabilities.	benchmark, agents, software-security, evaluation, long-horizon, bug-hunting
`2605.26918`	Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation PDF	cs.CL	88	Benchmark for educational validity and safety of video models; useful eval framing beyond generic safety.	benchmark, video-models, safety, evaluation, multimodal, education
`2605.27220`	The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System PDF	cs.CL, cs.IR	88	Production RAG study with concrete traffic evidence on routing/augmentation failures and cost tradeoffs.	RAG, retrieval, evaluation, production-systems, efficiency
`2605.27083`	On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning PDF	cs.CL, cs.CR	87	Important unlearning critique: counterfactual tuning can induce conflicts and broader hallucination spillover.	unlearning, hallucination, reliability, knowledge-editing, benchmark
`2605.27140`	StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning PDF	cs.AI	87	Step-level preference distillation for agent RL addresses credit assignment in multi-turn agents.	agent-rl, preference-learning, distillation, multi-turn, training
`2605.26606`	Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training PDF	cs.LG, cs.AI	87	Improves rollout allocation for RL post-training of LLMs; practical efficiency for frontier training.	RLHF, post-training, efficiency, LLM, rollouts, optimization
`2605.26684`	Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning PDF	cs.LG, cs.AI	87	Improves step-level credit assignment for agentic RL using graph structure; promising for agent training.	agents, reinforcement-learning, credit-assignment, LLM-agents, reasoning
`2605.27068`	QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents PDF	cs.CL, cs.AI, cs.MA	86	Audits grounding and utterance consistency in multimodal social deduction agents beyond win rates.	agent-evaluation, multimodal, auditing, grounding, social-deduction, benchmark
`2605.26784`	Ratio-Variance Regularized Policy Optimization PDF	cs.LG, cs.AI	86	Principled PPO-style alternative with ratio-variance control, evaluated across LLM scales.	reinforcement-learning, post-training, optimization, LLM, policy-optimization

AI Paper Insight Brief

2026-05-26

0) Executive takeaways (read this first)

Agent security is shifting from prompt filtering to runtime control of information flow and authority. The strongest papers today enforce provenance, authorization, or capability attenuation during execution rather than trying to classify bad prompts alone.
Detection is repeatedly shown to be insufficient without control. This appears in RAG poisoning, prompt injection, and multi-turn contradiction settings: systems can recognize risk or conflict yet still take unsafe actions.
Long-horizon agent training is moving toward finer-grained credit assignment and smarter sampling. Several RL papers improve efficiency by reallocating rollouts or assigning step-level credit using graphs or hindsight rescoring instead of blunt trajectory-level rewards.
Benchmarks are getting more deployment-shaped—and they are lowering apparent capability. Realistic evaluations in software security, personalization, memory, social grounding, and production RAG all show weaker performance than headline benchmark numbers would suggest.
Open-weight and aligned models still expose simple attack surfaces. Gradient-free jailbreaks, self-conditioned disclosure attacks, covert channels in CoT, and poisoning-based covert control all bypass defenses that look stronger under narrower threat models.
Sparse efficiency is becoming practical at both ends of the stack. One paper pushes low-activation MoE for frontier agentic systems; another shows MoE can now be viable on phones with real deployment measurements.

2) Key themes (clusters)

Theme: Runtime security for tool-using and retrieval agents

Why it matters: The most credible defenses today are not just better classifiers; they are execution-time mechanisms that constrain what information can flow where and which actions can be taken. This is especially relevant for agents with tools, persistent memory, or external data access.
Representative papers:
Common approach:
- Separate clean authorization/planning from observed execution provenance
- Track parameter sources, sink permissions, or certified evidence rather than only tool names or final outputs
- Insert inline checks before irreversible actions, often with bounded-history or low-latency routing
- Treat safety as an information-flow problem: who saw what, what was derived from it, and what can be done next
Open questions / failure modes:
- Same-source poisoning or collusion can still pass provenance checks
- Manifest/policy quality is a major bottleneck for deployment
- Proxy-visible enforcement misses covert channels, shell laundering, or effects outside the monitored boundary
- Utility can drop in multi-hop or ambiguous settings when evidence is aggressively filtered

Theme: Monitoring-control gaps in RAG and prompt security

Why it matters: Multiple papers show that recognizing danger, contradiction, or injection structure does not guarantee safe behavior. This weakens confidence in detector-only defenses and benchmark setups that stop at awareness metrics.
Representative papers:
Common approach:
- Evaluate under deployment constraints: low-FPR thresholds, OOD regimes, persistent caches, or real attack traces
- Compare detection/awareness signals against actual downstream action safety
- Use interpretable structural features or multi-turn protocols to expose failure modes hidden by aggregate metrics
- Validate with human checks, ablations, or penetration tests rather than only synthetic leaderboard scores
Open questions / failure modes:
- Detector quality is highly regime-dependent; no single model dominates across settings
- Multi-turn accumulation can create failures absent in single-turn tests
- Human-meaningful interpretability does not automatically improve control
- Real systems remain vulnerable to classic security failures like overprivileged tools and weak isolation

Theme: Jailbreaks, covert channels, and poisoning beyond standard threat models

Why it matters: Safety defenses optimized for obvious prompts or fine-tuning attacks are being bypassed by attacks that exploit model internals, reasoning traces, or training data. The attack surface is broader than “bad prompt in, bad answer out.”
Representative papers:
Common approach:
- Exploit reasoning structure or latent refusal directions instead of relying on explicit jailbreak strings
- Use semantic or conceptual carriers that survive paraphrasing and simple sanitization
- Show attacks remain effective under low-cost, gradient-free, or low-poison-ratio settings
- Test against existing defenses designed for narrower assumptions, such as adversarial fine-tuning or lexical triggers
Open questions / failure modes:
- Stealth and detectability remain under-measured for several covert-channel attacks
- Some mitigations help but do not restore no-attack baselines
- Results often depend on capable oracles, shared knowledge, or specific defense families
- Multi-turn and real deployment interfaces may change attack success in ways not yet fully measured

Theme: RL for agents is becoming more selective, local, and structure-aware

Why it matters: Long-horizon agent RL is moving away from uniform rollout budgets and trajectory-level credit. The new pattern is to spend compute where variance or causal leverage is highest and to shape updates around steps or graph structure.
Representative papers:
Common approach:
- Use online prompt informativeness estimates to allocate rollouts instead of sampling uniformly
- Replace trajectory-level reward broadcasting with step-level or graph-based credit
- Stabilize optimization with local trust-region surrogates or clipped hindsight shaping
- Reuse data more effectively via off-policy replay or delayed teacher references
Open questions / failure modes:
- Many methods assume binary/verifiable rewards or deterministic environments
- State matching and step extraction can be brittle in noisy, high-dimensional settings
- Hyperparameters remain task-dependent, especially shaping strength and thresholds
- Better rollout efficiency does not always imply better data efficiency per consumed prompt

Theme: Benchmarks are exposing hidden weaknesses in memory, personalization, grounding, and security

Why it matters: More realistic benchmarks are revealing that current agents often fail on the exact capabilities needed for deployment: persistent user modeling, memory reliability, grounded communication, and long-horizon bug hunting.
Representative papers:
Common approach:
- Move from black-box success metrics to diagnostic decomposition: attribution, retrieval vs summarization vs storage, utterance grounding, or preference updating
- Use replayable environments, Dockerized targets, or temporally ordered user histories
- Measure complementarity and failure signatures, not just single best-model scores
- Emphasize realistic uncertainty and sparse hints rather than over-informative benchmark inputs
Open questions / failure modes:
- Many benchmarks remain synthetic or scoped to a few domains
- Memory systems often degrade when added, rather than helping
- Strong models can still win while making unsupported claims
- Single-agent scores remain low on realistic software security tasks, suggesting large headroom

Theme: Efficiency and deployment realism are driving architecture choices

Why it matters: Efficiency work is no longer just about FLOPs; it is tied to real deployment regimes, from 192K-context agent systems to smartphone inference and production RAG latency.
Representative papers:
Common approach:
- Use sparse activation to decouple total capacity from active compute
- Optimize for real serving constraints: memory budgets, latency, cache behavior, and tool-loop throughput
- Prefer post-retrieval or runtime routing when query-only prediction is too weak
- Validate on real devices or production traffic, not only synthetic benchmarks
Open questions / failure modes:
- Internal or proprietary benchmarks limit reproducibility
- Full attention and custom kernels still carry infrastructure cost
- Production findings may be highly distribution-specific
- Some gains depend on scaffolds or deployment assumptions that may not transfer

3) Technical synthesis

A recurring design pattern is separating observation from authority: AuthGraph separates execution provenance from clean authorization, ChainCaps separates value budgets from tool permissions, and CORDON-MAS separates raw evidence readers from final synthesizers.
Several security papers converge on information-flow control as the right abstraction for agents and RAG, replacing content moderation-style thinking with provenance, sink constraints, and certified evidence paths.
Detector-only evaluation is being challenged across domains: prompt injection detection varies sharply by regime and threshold; contradiction acknowledgement in RAG does not predict safe action; contradiction-aware prompt defenses still fail under poisoning.
RL papers increasingly optimize where signal lives, not just how to optimize it: Pilot-Commit targets high-variance prompts, GraphGPO targets graph-local progress, and StepOPSD targets action-centered step spans.
There is a broad move from trajectory-level to localized supervision: graph edges, step segments, parameter sources, claim cards, and memory-operation failures all reflect finer-grained decomposition.
Multiple papers show benchmark realism lowers apparent capability: SEC-bench Pro keeps top single-agent success below 40%; VitaBench 2.0 tops out around 0.5 Avg@4 even with full context; QUACK shows high-win agents still hallucinate socially grounded facts.
Several works highlight non-monotonicity: harness complexity does not scale cleanly with model tier, stronger internal models can worsen memory systems, and larger RAG models can show worse monitoring-control gaps.
Safety and utility trade-offs are increasingly measured with deployment-native metrics: low-FPR TPR, benign completion, approval rate, answerability, advanced-judge routing counts, and real latency on phones or production traffic.
Across jailbreak and poisoning work, the common failure is overfitting defenses to a narrow attack model—fine-tuning defenses miss abliteration/prefill, paraphrasing misses conceptual channels, and prompt defenses miss semantic covert control.
Sparse systems work is bifurcating into two regimes: frontier agentic MoE for long-horizon capability and mobile MoE for edge deployment, but both rely on careful routing, training stability, and runtime-aware design.

4) Top 5 papers (with “why now”)

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
- Reframes agent safety around permission laundering, where individually allowed tool calls compose into unsafe end-to-end behavior.
- Implements a practical transparent MCP proxy with monotonic budget propagation and a non-amplification theorem.
- Reports large live-eval gains: attack success drops from 25.2–67.8% to 0.0–4.8% with 96–100% benign completion.
- Useful now because MCP-style tool ecosystems are expanding faster than robust runtime policy layers.
- Skepticism / limitation: guarantees depend on trusted manifests and proxy-visible explicit flows; manifest quality is the main deployment bottleneck.
Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
- Adds a strong missing primitive for agent security: parameter-source authorization, not just tool-call validation.
- Separates manipulated execution traces from a clean authorization graph, then checks both tool sequence and parameter provenance.
- On AgentDojo and AgentDyn, reduces ASR to around 0.01–0.02 while preserving relatively high utility.
- Useful now because indirect prompt injection is increasingly about cross-tool contamination, not only overt malicious calls.
- Skepticism / limitation: does not handle same-observation poisoning and depends on graph-builder attribution quality.
SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
- Introduces a realistic, Dockerized benchmark for bug hunting on V8 and SpiderMonkey, with vulnerable/fixed/latest images and attribution-aware grading.
- Shows frontier agents remain far from robust: best single-agent success is 32.0% on V8 and 38.8% on SpiderMonkey.
- Demonstrates that crash-only grading would inflate success by 43.6%, which is a major warning for current eval practice.
- Useful now because software security agents are being marketed aggressively, but realistic measurement is lagging.
- Skepticism / limitation: current instantiation is limited to two JavaScript engines and one open-weight baseline is only partially evaluated.
Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
- Offers a simple but high-leverage systems idea: use pilot rollouts to estimate prompt informativeness, then commit budget only where variance is useful.
- Reaches target accuracy with 1.5–1.9x fewer rollouts than GRPO and 2.3–4.0x fewer than DAPO in ample-budget settings.
- Includes practical machinery—binding, replay, solved-prompt eviction—that makes it more deployable than a purely theoretical proposal.
- Useful now because rollout generation is one of the main cost centers in reasoning-model post-training.
- Skepticism / limitation: currently tailored to binary-verifiable rewards and math-style tasks.
The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
- Provides rare production evidence that synthetic evals can badly mislead routing policy: synthetic data suggests augmentation is almost always needed, real traffic says only 27.8% of queries need it.
- Shows pre-retrieval routing from query text alone largely fails in this entity-heavy reference setting.
- A simple post-retrieval cascade improves quality over Always-HyDE while cutting latency by 31.8%.
- Useful now because many teams are overusing expensive LLM augmentation based on benchmark assumptions rather than traffic reality.
- Skepticism / limitation: findings are tightly tied to one encyclopedia deployment and a deferral-heavy policy.

5) Practical next steps

Treat agent safety reviews as runtime systems design, not just prompt-defense tuning: add provenance checks, sink policies, and pre-execution gates for irreversible actions.
For tool-using agents, audit whether you can answer: which observation supplied each tool argument, and was that source authorized?
Add multi-turn, persistent-cache evaluations to any RAG safety suite; single-turn contradiction tests are likely overstating safety.
If you run GRPO-style post-training, test variance-aware rollout allocation and step-local credit shaping before scaling raw rollout budgets.
Benchmark memory systems by failure mode decomposition—summarization, storage, retrieval—not only end-task accuracy.
For open-weight safety, include gradient-free attacks such as abliteration and prefilling in every defense evaluation; adversarial fine-tuning alone is too narrow.
For CoT monitoring, assume paraphrasing is insufficient; test whether reasoning traces can carry behavior-level covert channels that survive lexical rewriting.
In production RAG, validate routing and augmentation policies on real traffic distributions, and consider post-retrieval cascades before query-only routers.
For high-stakes domains like finance or healthcare, measure benign approval / utility alongside ASR, and prefer inline controls that can intervene before state-changing tool calls.
When evaluating personalization or proactive assistants, compare full-context vs memory-backed settings explicitly; if memory hurts, the bottleneck is likely retrieval/update quality rather than model reasoning alone.

Generated from per-paper analyses; no external browsing.

Agent safety moves runtime.

Takeaways

Start with: ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

Themes

Papers Worth Your Reading Time

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

AI Paper Insight Brief

2026-05-26

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Runtime security for tool-using and retrieval agents

Theme: Monitoring-control gaps in RAG and prompt security

Theme: Jailbreaks, covert channels, and poisoning beyond standard threat models

Theme: RL for agents is becoming more selective, local, and structure-aware

Theme: Benchmarks are exposing hidden weaknesses in memory, personalization, grounding, and security

Theme: Efficiency and deployment realism are driving architecture choices

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps