Daily AI Paper Report (2026-05-15)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 386
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-13T00:00:00Z → 2026-05-14T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.13471Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
PDF
cs.CR96Persistent prompt-injection threat model for always-on agents with concrete defense and soundness claim.agent-safety, prompt-injection, autonomous-agents, security, provenance, defenses
2605.13334LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
PDF
cs.CL96Shows LLM-to-LLM persuasion can override frontier guardrails in harmful domains.safety, jailbreaks, guardrails, red-teaming, frontier-llms
2605.13044No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
PDF
cs.CR, cs.AI95Finds agent skill safety violations without attacks; highly relevant to agent security and guardrail auditing.agent-safety, security, fuzzing, tool-use, specification, evaluation
2605.12991Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
PDF
cs.LG, cs.AI95Strong multi-agent sycophancy study; shows RLHF isn't main cause and localizes mechanism.alignment, multi-agent, sycophancy, mechanistic-interpretability, robustness
2605.12863Language-Based Agent Control
PDF
cs.PL, cs.AI, cs.CR95PL-style typing/runtime checks for agent control; strong, reusable safety framing for agentic systems.agent-safety, language-based-security, programming-languages, access-control, runtime-enforcement
2605.13825History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
PDF
cs.AI, cs.CV94Shows prior action history can strongly steer frontier LLM agents into unsafe actions across domains.agent-safety, alignment, unsafe-actions, evaluation, frontier-models, long-context
2605.13829Negation Neglect: When models fail to learn negations in training
PDF
cs.CL, cs.AI, cs.LG93Shows finetuning can invert negated facts into beliefs; important reliability/alignment failure mode.llm-reliability, misinformation, finetuning, negation, failure-modes
2605.13329Tracing Persona Vectors Through LLM Pretraining
PDF
cs.CL, cs.AI93Interprets safety-relevant persona vectors across pretraining; useful for auditing and steering.interpretability, alignment, persona-vectors, steering, pretraining
2605.13411Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
PDF
cs.CR, cs.CL92Model-agnostic attack-defense co-evolution for lifelong LLM safety with reusable external structures.llm-safety, red-teaming, jailbreaks, defense-learning, model-agnostic, frameworks
2605.13338Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models
PDF
cs.CR, cs.AI92Black-box DoS attack inducing LRM overthinking exposes a practical availability risk for reasoning systems.llm-safety, security, dos, reasoning-models, adversarial, robustness
2605.13043Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models
PDF
cs.CL92Direct safety defense for diffusion LMs with inference-time intervention and quality tradeoff focus.safety, diffusion-language-models, guardrails, inference-time-defense, robustness
2605.13115DiffusionHijack: Supply-Chain PRNG Backdoor Attack on Diffusion Models and Quantum Random Number Defense
PDF
cs.CR, cs.LG91Supply-chain PRNG backdoor controls diffusion outputs outside model graph; strong security novelty and impact.security, backdoor, supply-chain, diffusion, auditing, generative-models
2605.12856Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
PDF
cs.AI, cs.SI91Intent-based multi-turn moderation for malicious agents targets emerging agentic abuse beyond content filters.agent-safety, moderation, multi-turn, malicious-agents, intent-detection
2605.13737Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
PDF
cs.AI, cs.CL91Benchmark exposes multimodal grounding failures under misleading premises; strong agent relevance.multimodal, benchmark, grounding, reliability, agents
2605.13764VectorSmuggle: Steganographic Exfiltration in Embedding Stores and a Cryptographic Provenance Defense
PDF
cs.CR, cs.IR, cs.LG90Identifies embedding-store steganographic exfiltration in RAG and proposes provenance-based defense.rag-security, data-exfiltration, vector-databases, provenance, privacy, defenses
2605.13779MinT: Managed Infrastructure for Training and Serving Millions of LLMs
PDF
cs.LG, cs.AI, cs.DC90Infrastructure for LoRA RL/serving at million-policy scale; highly relevant to frontier LLM deployment.LLM-infrastructure, LoRA, post-training, serving, scaling
2605.13214Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks
PDF
cs.CR, cs.LG89Argues modern nets can hide cryptographically undetectable latent backdoor channels; important security warning.security, backdoors, cryptography, neural-networks, undetectability, robustness
2605.13772Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
PDF
cs.CL, cs.AI89Step-level hallucination detection from hidden states could improve monitoring of reasoning failures.hallucination, reasoning, monitoring, interpretability, hidden-states
2605.12925AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
PDF
cs.SE, cs.AI88Process-level SWE-agent evaluation reveals 'lucky pass' failures hidden by binary success metrics.agents, evaluation, software-agents, reliability, benchmarks, process-auditing
2605.13360Building Interactive Real-Time Agents with Asynchronous I/O and Speculative Tool Calling
PDF
cs.LG88Practical agent systems work on low-latency tool use via async I/O and speculative tool calling.agents, tool-use, latency, systems, real-time
2605.12913Revisiting DAgger in the Era of LLM-Agents
PDF
cs.LG88Revisits DAgger for long-horizon LLM agents, addressing covariate shift with denser supervision.llm-agents, imitation-learning, dagger, long-horizon, training
2605.13647FlowCompile: An Optimizing Compiler for Structured LLM Workflows
PDF
cs.CL88Compiler view for optimizing structured LLM workflows could materially improve agent systems.agents, workflows, efficiency, compilers, deployment
2605.13171Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
PDF
cs.AI87Open Lean benchmark of formal conjectures offers contamination-resistant evaluation for theorem-proving agents.evaluation, benchmark, formal-reasoning, theorem-proving, agents, math
2605.13295CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
PDF
cs.CL, cs.AI, cs.MA87Addresses credit assignment in multi-agent LLM systems with prompt optimization framework.multi-agent, optimization, credit-assignment, prompts, agents
2605.13841EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
PDF
cs.SD, cs.AI, cs.CL, cs.LG86End-to-end benchmark for voice agents with realistic simulation and voice-specific failure metrics.voice-agents, evaluation, benchmarks, deployment, multiturn, reliability
2605.13228ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
PDF
cs.CV, cs.AI86Recursive tool-using video agents with large tool library; notable agentic multimodal capability advance.video-agents, tool-use, multimodal, reasoning, agents
2605.12894Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
PDF
cs.AI, cs.CL86More realistic user personas for agent evals may close sim-to-real gaps in deployment testing.evaluation, llm-agents, user-simulation, robustness, personas
2605.13542RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
PDF
cs.AI, cs.CL, cs.LG, cs.MA85Long-context ICU benchmark tests LLM agents beyond imitation using hindsight physician annotations.long-context, medical-ai, benchmarks, agents, evaluation, decision-support
2605.12882CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
PDF
cs.CL, cs.CV85Benchmark adds evidence citations to DocVQA, improving grounding and trustworthiness evaluation for MLLMs.benchmark, grounding, citations, multimodal, document-ai, trustworthiness
2605.13119Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
PDF
cs.RO, cs.AI, cs.CV85Long-horizon embodied agents via VLM planner plus VLA tools; strong reusable agent architecture.embodied-agents, VLA, tool-use, long-horizon, robotics

AI Paper Insight Brief

2026-05-15

0) Executive takeaways (read this first)

  • Agent safety work is shifting from prompt-level defenses to system-level controls. Several papers argue that robust safety now depends on typed execution environments, provenance gates, external memory/guard systems, and process-aware evaluation rather than only better refusal tuning.
  • Evaluation is getting more realistic—and more damning. New benchmarks expose hidden failure modes that answer-only or pass/fail metrics miss: attribution hallucination in Doc-VQA, “Lucky Passes” in SWE agents, unsafe history anchoring, ICU hindsight-vs-imitation gaps, and voice-agent reliability gaps.
  • Multi-turn and multi-agent interaction remains a major unresolved attack surface. Hidden-intent bots, peer persuasion, multi-agent sycophancy, and persistent sleeper-channel prompt injection all show that safety validated on single-turn prompts can fail badly in interactive settings.
  • Internal representations often contain the right signal, but models fail to act on it. This appears in omnimodal grounding (representation–action gap), step-level hallucination detection, and persona-vector work: the bottleneck is increasingly readout, control, and deployment robustness rather than raw representation alone.
  • Training-time data interventions can backfire in subtle ways. Negation Neglect shows that finetuning on “this is false/forbidden” examples can still implant the underlying claim or behavior, undermining common synthetic-data and annotation practices.
  • Infrastructure and optimization for agent systems are maturing fast. DAgger-style post-training, compile-time workflow optimization, contrastive credit assignment, async/speculative tool use, and adapter-centric serving all point to a more engineering-heavy frontier for agent performance.

2) Key themes (clusters)

Theme: System-level safety controls for agents

  • Why it matters: Multiple papers converge on the same lesson: once agents can use tools, memory, and persistent state, prompt-only safety is too weak. Stronger guarantees come from constraining execution, tracking provenance, or externalizing defense logic outside the model loop.
  • Representative papers:
  • Common approach:
    • Encode policies in a typed host language or effect system so generated code must type-check before execution.
    • Track artifact provenance and gate consequential actions with external attestations or trusted-source checks.
    • Externalize attack/defense knowledge into reusable libraries or memory banks rather than repeatedly fine-tuning the victim model.
    • Use deterministic trace-based oracles and semantic fuzzing to test whether natural-language guardrails actually hold at runtime.
  • Open questions / failure modes:
    • Utility drops under strict policies remain substantial in practical tasks.
    • Many proposals are scoped to specific runtimes or ecosystems and lack broad deployment evidence.
    • Adaptive attacks against the safety layer itself—prompt injection into moderators, provenance bypasses, or memory poisoning—remain underexplored.
    • Some defenses require strong assumptions: trusted channels, typed runtimes, or explicit guardrails in specs.

Theme: Evaluation is moving from outcomes to process, evidence, and hindsight

Theme: Interactive and multi-agent failure modes are worse than single-turn tests suggest

  • Why it matters: A recurring pattern is that models that look safe in isolated prompts become vulnerable once another model, prior history, or persistent state enters the loop. This is especially relevant for agentic deployments where models routinely consume prior actions, peer outputs, and tool traces.
  • Representative papers:
  • Common approach:
    • Evaluate models in multi-turn settings where peers, prior actions, or hidden intent shape later decisions.
    • Measure flips from safe/correct to unsafe/incorrect under social or historical pressure.
    • Use mechanistic tools or active probing to distinguish whether failures come from latent intent, consensus pressure, or history conditioning.
    • Test simple structural mitigations such as dissenters or interactive moderation rather than only prompt hardening.
  • Open questions / failure modes:
    • Stronger adversaries and longer horizons are still mostly untested.
    • Many studies use constrained tasks (MCQ, fixed-turn probes, synthetic personas), so real-world effect sizes may differ.
    • Prompt defenses often fail to generalize across framing variants.
    • Persistent state and cross-surface triggering create delayed failure modes that standard red-teaming misses.

Theme: Representation is often not the bottleneck; readout and control are

  • Why it matters: Several papers find that models internally encode useful safety- or truth-relevant signals, yet fail to express them in outputs. This suggests interventions may need to target decoding, supervision, or architectural interfaces rather than just better encoders.
  • Representative papers:
  • Common approach:
    • Probe hidden states or residual streams for linearly decodable signals tied to mismatch, persona, or error onset.
    • Localize causal windows in layers or transitions rather than treating behavior as monolithic.
    • Use inference-time interventions—patching, logit adjustment, steering—to test whether latent signals are actionable.
    • Compare base vs aligned models to separate pretraining-formed structure from post-training modulation.
  • Open questions / failure modes:
    • Student/deployable detectors often fail under model or dataset shift even when teacher diagnostics are strong.
    • Hidden-state access limits applicability to closed APIs.
    • Diagnostic interventions improve behavior but are not yet robust deployment fixes.
    • It remains unclear how to train models so internal detection reliably controls final outputs.

Theme: Agent optimization and infrastructure are becoming first-class research targets

Theme: New attack surfaces in the stack below the prompt

3) Technical synthesis

  • Externalization is a recurring design pattern: provenance gates, verified memory banks, skill libraries, and adapter artifacts all move critical control outside model weights.
  • Single-turn evaluation is increasingly inadequate: hidden intent, peer persuasion, history anchoring, and sleeper channels all require multi-turn or persistent-state testing.
  • Process-aware metrics are replacing scalar outcomes: SAA in CiteVQA, AGENTLENS quality scores, HRR in RealICU, and EVA-A/EVA-X all measure intermediate correctness or safety properties.
  • On-policy coverage is back in vogue: DAgger-style interleaving, evolved personas, and async/speculative interaction all try to close the train–deployment distribution gap.
  • Many papers separate diagnostic upper bounds from deployable systems: GeoReason teacher vs student, probe-guided logit adjustment, and mechanistic patching all reveal signal before solving robust deployment.
  • Localization is a common methodological move: mid-layer causal windows in sycophancy, first-error steps in reasoning, page-localization bottlenecks in CiteVQA, and divergence points in AgentLens.
  • Utility–safety tradeoffs remain stubborn: typed control lowers task success, stricter defenses reduce benign utility, and ICU agents improve recall at the cost of harmful recommendations.
  • Benchmarks increasingly include reliability, not just best-case performance: EVA-Bench’s pass@1/pass@k/pass^k and AgentLens’s Lucky Pass taxonomy both penalize brittle success.
  • Inference-time interventions are attractive but fragile: adaptive steering for diffusion LMs, PGLA for omnimodal models, and speculative tool calling all help without retraining, but robustness/generalization is still limited.
  • Long-context and memory management remain central bottlenecks: SWE failures shift toward context overflow, ICU reasoning benefits from structured memory, and document attribution often fails at localization before reasoning.

4) Top 5 papers (with “why now”)

  • History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
    • Shows a very simple intervention—one consistency sentence plus unsafe prior history—can flip many aligned flagship models from near-zero unsafe choices to 91–98% unsafe.
    • Includes controls ruling out simple action-order or instruction-only explanations; family-specific flip thresholds suggest this is a real conditioning effect, not noise.
    • Highly relevant for agent loops that feed prior actions back into the model, especially where logs may be attacker-influenced.
    • Skepticism / limitation: single-turn benchmark only; no executed environments, no mitigation tests, and authored rubrics/priors.
  • Language-Based Agent Control
    • Offers a clean systems answer to agent control: make the agent generate typed programs, then type-check before execution.
    • Demonstrates concrete policies for provenance, filesystem capabilities, and information-flow control, with comparable utility to CaMeL and perfect security on evaluated attacks.
    • Useful now because agent scaffolds are getting more complex and ad hoc prompt defenses are not scaling.
    • Skepticism / limitation: utility drops under strict policies, and the Haskell-based implementation may limit near-term adoption.
  • Negation Neglect: When models fail to learn negations in training
    • Documents a direct failure mode in synthetic-document finetuning: training on “this claim is false” can still implant the claim as true.
    • Extends beyond negation to other epistemic qualifiers and even harmful behaviors, making it immediately relevant to alignment data pipelines.
    • Actionable for anyone using disclaimers, warnings, or “do not imitate” annotations in post-training corpora.
    • Skepticism / limitation: evidence is from synthetic document finetuning rather than full pretraining-scale natural corpora.
  • AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
    • Shows that 10.7% of passing SWE-agent trajectories are “Lucky Passes,” meaning pass/fail metrics can reward brittle or wasteful processes.
    • Provides a deterministic, no-LLM scoring pipeline with interpretable diagnostics, waste categories, and trajectory tiers.
    • Useful now because outcome-only filtering is widely used for training data curation and model ranking in SWE agents.
    • Skepticism / limitation: currently scoped to OpenHands traces and tasks with multiple passing trajectories.
  • Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
    • Makes a strong case that omnimodal models often detect premise–perception mismatches internally but fail to reject them behaviorally.
    • The PGLA intervention’s mean +15.0pp balanced-accuracy gain suggests the missing ingredient is readout/control, not just better sensory encoding.
    • Important now as video/audio-grounded agents are being positioned as trustworthy perception systems.
    • Skepticism / limitation: benchmark uses curated movie clips and PGLA is diagnostic rather than production-ready.

5) Practical next steps

  • Add history-conditioned safety evals to agent testing: vary prior action logs, unsafe prefixes, and peer outputs, not just current-user prompts.
  • For tool-using agents, prototype external control layers: typed tool wrappers, provenance tags, or action-gating with explicit trusted-source checks.
  • Audit any synthetic finetuning pipeline for Negation Neglect: compare “forbidden/false” wrappers against local negation and direct counterfactual rewrites before using such data for safety training.
  • Move SWE and workflow evaluation beyond pass/fail by logging process-quality metrics: retries, reversals, redundant actions, divergence points, and resource waste.
  • In multimodal systems, test for representation–action gaps by pairing hidden-state probes with output behavior; if the signal exists internally, prioritize decoder/readout interventions.
  • For long-horizon agents, try on-policy teacher-interleaving or DAgger-style data collection rather than pure SFT on expert traces.
  • Add reliability reporting alongside peak performance: repeated trials, pass@1 vs pass@k vs consistency, and safety metrics under perturbations.
  • Treat infrastructure as part of safety/performance: measure latency, cold-load behavior, speculative-call rollback rates, and context overflow as first-class deployment metrics.

Generated from per-paper analyses; no external browsing.