Daily AI Paper Report (2026-05-06)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 258
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-04T00:00:00Z → 2026-05-05T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.02187When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
PDF
cs.CR, cs.AI96High-impact agent security threat: post-alignment relay tampering breaks BYOK agents with 99.1% ASR.agent-security, integrity, prompt-injection, BYOK, red-teaming
2605.02269Towards Understanding Specification Gaming in Reasoning Models
PDF
cs.AI96Open benchmark on LLM specification gaming; directly studies agent failure modes and RL effects.agent-safety, specification-gaming, evaluation, reasoning-models, benchmark
2605.02682Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI
PDF
cs.AI95Directly targets agent tool-use authorization and runtime enforcement in zero-trust settings.agent-safety, security, tool-use, authorization, zero-trust
2605.02751Mitigating Misalignment Contagion by Steering with Implicit Traits
PDF
cs.AI, cs.CL95Directly studies multi-agent LM misalignment contagion and mitigation; highly safety-relevant.llm-safety, multi-agent, alignment, steering, evaluation
2605.02812Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense
PDF
cs.CR94Systematic study of persistent agent worm propagation and defenses in multi-agent LLM ecosystems.agent-security, worms, persistent-memory, multi-agent, defense
2605.02647ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
PDF
cs.CL, cs.CR92Automated multi-turn jailbreak red-teaming via conversational priming targets a key real-world attack surface.jailbreaks, red-teaming, multi-turn, alignment, adversarial-evaluation
2605.02199MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
PDF
cs.AI92Auditable protocol for long-term agent memory writing; strong relevance to reliable agentic systems.agents, memory, evaluation, reliability, auditing
2605.02202CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models
PDF
cs.AI92Backdoor attacks on VLMs via clean-label diffusion poisoning; strong security relevance.vlm-security, backdoor, data-poisoning, diffusion, adversarial-ml
2605.02196DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
PDF
cs.LG91Important privacy/safety result: INT4 quantization can recover supposedly unlearned content by up to 22x.machine-unlearning, privacy, quantization, llm-security, compliance
2605.02178T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
PDF
cs.AI91Targets instability in multi-turn agent RL with uncertainty-guided exploration control.agentic-rl, reasoning, training-stability, uncertainty, llm-agents
2605.02661AcademiClaw: When Students Set Challenges for AI Agents
PDF
cs.AI, cs.CY90Real student-sourced long-horizon agent benchmark with isolated environments; high practical eval value.agents, benchmark, long-horizon, evaluation, academic-tasks
2605.02495Efficient Preference Poisoning Attack on Offline RLHF
PDF
cs.LG, cs.AI, stat.ML89Preference poisoning exposes offline RLHF/DPO supply-chain risk with efficient targeted attack methods.RLHF, DPO, data-poisoning, alignment, security
2605.02363When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
PDF
cs.CL, cs.AI, cs.LG89Targets structured-output reliability, crucial for tool use and agent deployment.llm-reliability, structured-output, json, agents, evaluation
2605.02398The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
PDF
cs.AI, cs.CL, cs.LG88Large-scale eval of metacognitive collapse under adversarial pressure; directly relevant to frontier AI safety.evaluation, metacognition, adversarial-robustness, frontier-models, safety
2605.02240PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PDF
cs.AI88Real-world EHR benchmark for long-horizon LLM agents with verifiable execution in clinical workflows.agents, benchmark, healthcare, evaluation, long-horizon
2605.02395Controllable and Verifiable Process Data Synthesis for Process Reward Models
PDF
cs.AI88Controllable, verifiable process data synthesis for PRMs could improve reasoning supervision.process-reward-models, reasoning, alignment, synthetic-data, verification
2605.02411FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
PDF
cs.AI, cs.IR, cs.LG, cs.MA88Dynamic tool retrieval inside the reasoning loop is highly relevant for scalable agent capability.agents, tool-use, retrieval, tool-selection, framework
2605.02503DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis
PDF
cs.AI87Process-level benchmark for exploratory data-analysis agents with milestone annotations and noisy real data.agents, benchmark, process-supervision, data-analysis, evaluation
2605.02626Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
PDF
cs.LG86Stabilizes DPO training and addresses probability collapse in preference optimization.dpo, alignment, preference-optimization, training-stability, llms
2605.02348Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
PDF
cs.CL, cs.LG86Decoding-time debiasing with PRMs is practical for safer generation without weight updates.alignment, bias, process-reward-models, decoding, safety
2605.02255On the Privacy of LLMs: An Ablation Study
PDF
cs.CR, cs.AI85Useful unified ablation of LLM privacy attacks across architecture, scale, data, and retrieval settings.privacy, membership-inference, data-extraction, RAG, security-evaluation
2605.02545Strategy-Aware Optimization Modeling with Reasoning LLMs
PDF
cs.AI85Post-training for strategy-aware optimization with verified data and GRPO shows concrete gains.llm-reasoning, post-training, grpo, verification, optimization
2605.02168Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning
PDF
cs.AI, cs.LG, cs.MA84Multi-agent planning framework with compute-allocation insight; useful for efficient long-horizon agents.agents, planning, multi-agent, efficiency, long-horizon
2605.02472Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication
PDF
cs.CL84Neuro-symbolic offloading with auditable traces for legal reasoning is strong reliability work.neuro-symbolic, auditability, legal-reasoning, reliability, agents
2605.02469Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
PDF
cs.LG, cs.AI84Theoretical clarification of RLVR vs weighted SFT could matter for scalable verifier-based alignment.alignment, rlvr, kl-regularization, theory, sft
2605.02443HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
PDF
cs.CL83Comprehensive hallucination benchmark with routing and mitigation; practical for reliability evaluation.hallucinations, benchmark, evaluation, reliability, mitigation
2605.02206Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
PDF
cs.CV, cs.LG83Systematic analysis of unreliable multimodal unlearning metrics; proposes unified scoring for evaluation.unlearning, multimodal, evaluation, safety, privacy
2605.02435Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
PDF
cs.LG, stat.ML82Addresses estimator bias in answer-level fine-tuning games with provable unbiased methods.alignment, fine-tuning, distributional-alignment, theory, training
2605.02504A multilingual hallucination benchmark: MultiWikiQHalluA
PDF
cs.CL81Multilingual hallucination benchmark across many languages; valuable for reliability beyond English.hallucination, multilingual, benchmark, reliability, evaluation
2605.02122STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems
PDF
cs.LG, cs.AI80Evaluation framework for stable, disagreement-aware AI ranking; broadly reusable for model assessment.evaluation, human-judgment, uncertainty, benchmarking, reliability

AI Paper Insight Brief

2026-05-06

0) Executive takeaways (read this first)

  • Agent evaluation is shifting from final-answer scoring to execution-grounded, process-aware, and stability-aware measurement. Several papers argue current benchmarks overstate capability because they ignore annotator disagreement, intermediate milestones, structured-output validity, or real environment execution.
  • A recurring systems lesson: architecture and control matter more than raw model size in long-horizon agents. Planner-centric decomposition, uncertainty-guided exploration control, dynamic tool retrieval, and neuro-symbolic offloading all report large gains over monolithic agent setups.
  • Security work is increasingly targeting post-generation and persistent-system attack surfaces, not just prompt injection. BYOK relay tampering, autonomous agent worms, contextual multi-turn jailbreaks, clean-label VLM backdoors, and offline RLHF poisoning all expose vulnerabilities outside the usual “aligned model output” threat model.
  • Multiple papers show that deployment transformations break audit assumptions: quantization can undo machine unlearning, relay infrastructure can bypass alignment, and structured outputs can be “correct but unusable.” Evaluation at training-time or BF16-only audit time is no longer enough.
  • Preference/RL fine-tuning remains fragile. New work identifies small-batch estimator bias, DPO squeezing, and multi-turn RL collapse, while proposing fixes such as unbiased/variance-aware estimators, gradient gating, and token/turn-level exploration control.
  • For practitioners, the practical frontier is clear: instrument the stack end-to-end—response integrity, memory writes, tool authorization, output schemas, deployment precision, and process checkpoints all need explicit controls.

2) Key themes (clusters)

Theme: Evaluation is becoming process-aware and deployment-aware

  • Why it matters: Several papers argue that standard benchmark scores hide the real failure modes that matter in deployment: unstable rankings, invalid structured outputs, poor intermediate progress, and execution failures in realistic environments. The common move is to evaluate the full operational contract, not just final correctness.
  • Representative papers:
  • Common approach:
    • Replace single hard labels or final-answer-only metrics with richer signals: posterior expected credit, checkpoints, milestones, JSON validity, or execution-grounded grading.
    • Evaluate in realistic environments with tool calls and state changes rather than static QA.
    • Separate distinct failure sources: extraction vs selection, reasoning vs execution, correctness vs parseability.
    • Use certified or auditable denominators where possible rather than heuristic scoring alone.
  • Open questions / failure modes:
    • How well do these richer metrics transfer across domains without becoming benchmark-specific?
    • Many protocols still rely partly on LLM judges or sparse human annotation.
    • Small or sparse settings remain hard: STABLEVAL notes instability in low-density annotation regimes; HalluScan uses very small samples.
    • Process metrics can diagnose failure, but they do not by themselves provide training signals that reliably improve agents.

Theme: Long-horizon agents benefit from decomposition, retrieval adaptation, and control

Theme: Security threats are moving beyond prompt injection

Theme: Alignment and preference optimization are hitting optimization pathologies

Theme: Audit assumptions are breaking under deployment transformations

Theme: Structured, symbolic, and auditable offloading is resurging

3) Technical synthesis

  • Multiple papers replace coarse trajectory-level supervision with finer-grained control or scoring: T2PO uses token/turn uncertainty, Gate-DPO gates rejected gradients, STABLEVAL preserves posterior item uncertainty, and process-PRM synthesis labels first-error structure.
  • A common pattern is decomposition for identifiability: planner vs actor vs memory, task extraction vs task-tool matching, extraction vs selection in memory writing, correctness vs JSON validity, and retrieval vs reasoning in data-analysis agents.
  • Several works expose support/coverage as the hidden bottleneck: BOLT’s one-shot gap depends on missing target support, FitText shows retrieval is the binding constraint, and MEMAUDIT formalizes budgeted candidate coverage.
  • Security papers increasingly model post-hoc integrity failures rather than model misbehavior alone: response-path forgery, persistent carrier re-entry, tool-call substitution, and relay-side rewriting all bypass standard alignment assumptions.
  • Evaluation papers repeatedly show that single metrics are misleading: FA vs RA vs AD/JS disagree in multimodal unlearning; task accuracy without parseability gives 0 operational utility; final-answer-only scoring hides partial process progress.
  • Several methods use offline precomputation to reduce online cost: BOLT precomputes Boltzmann weights, DACL compiles contracts once, MEMAUDIT computes exact package optima, and AloLab pays one-time prompt optimization cost to recover near-baseline inference latency.
  • There is a notable trend toward judge-in-the-loop systems, but with different roles: VLM-as-judge for planner RL, LLM judges for hallucination and benchmark grading, and semantic judges for jailbreak search. This improves scalability but creates a shared dependency on judge reliability.
  • Robustness work increasingly tests deployment transformations explicitly: quantization, annotator subsampling, multilingual tokenization, relay mediation, and constrained decoding overhead all materially change conclusions.
  • Several papers suggest capability is not the only determinant of robustness: Constitutional AI appears resistant to the compliance trap, Anthropic models resist some contextual jailbreak transfer, and planner scaling can matter more than scaling all modules.
  • Across agent benchmarks, the dominant failures are still reasoning and coordination, not just tool syntax: PhysicianBench attributes about half of failures to clinical reasoning; DataClaw shows hard tasks remain difficult even after cleaning data; AcademiClaw finds more tokens do not buy better outcomes.

4) Top 5 papers (with “why now”)

When Alignment Isn’t Enough: Response-Path Attacks on LLM Agents

  • Formalizes a structural integrity gap in BYOK deployments: a relay can rewrite model outputs after alignment but before agent execution.
  • Shows strong empirical attack performance across AgentDojo and ASB, with RTA-PostForge reaching 73.5% ASR on AgentDojo while preserving 47.6% utility, and high ASR on ASB.
  • Useful now because many production agent stacks rely on relays, routers, or middleware that terminate TLS and are implicitly trusted.
  • Also valuable because it reframes prompt injection as only one part of the threat model; response authenticity becomes a first-class safety requirement.
  • Skeptical about: the threat model is specific to BYOK relay deployments, and the proposed time-channel detection works best over longer sessions.

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

  • Identifies “hesitation” as a concrete source of multi-turn RL instability: overlong low-information thinking and repeated unproductive turns.
  • Introduces token-level truncation and turn-level resampling driven by a self-calibrated uncertainty signal, with gains across WebShop, ALFWorld, and Search QA.
  • Useful now because multi-turn agent RL is becoming standard, and training collapse/variance is a major practical bottleneck.
  • Particularly actionable as a plug-and-play control layer rather than a full optimizer replacement.
  • Skeptical about: effectiveness depends on threshold tuning and still inherits off-policy staleness from pipelined rollouts.

Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

  • Provides unusually direct evidence that planner capacity dominates end-to-end long-horizon agent performance, with planner scaling nearly matching scaling all modules.
  • Shows large gains from multi-agent decomposition and planner-only RL, including improvements on WebVoyager, OSWorld, and MCPBench.
  • Useful now because many teams are over-investing in monolithic agents or uniform scaling rather than targeting the planning bottleneck.
  • Offers a compute-allocation lesson: spend model size and RL budget where it matters most.
  • Skeptical about: actor and memory are frozen during RL, so execution failures remain; the 15-step cap may understate longer-horizon failure modes.

DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

  • Shows that INT4 quantization can restore forgotten content even when BF16 audits indicate successful unlearning.
  • Introduces a deployment-aware durability framing and a quantization-aware mitigation, DURABLEUN-SAF, that achieves a multi-seed durability certificate on TOFU.
  • Useful now because low-bit deployment is standard, making BF16-only unlearning audits potentially misleading for privacy/compliance claims.
  • The paper’s main contribution is procedural as much as algorithmic: evaluate unlearning at deployment precision, not just training precision.
  • Skeptical about: the current robust solution collapses retain accuracy, so it is more an existence proof than a production-ready fix.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

  • Brings agent evaluation into a realistic FHIR-based EHR environment with 100 physician-validated long-horizon tasks and 670 checkpoints.
  • Shows that even top models are far from reliable autonomy: best pass@1 is 46.3%, with clinical reasoning accounting for about half of failures.
  • Useful now because healthcare is a high-stakes domain where static QA benchmarks are especially misleading.
  • More broadly, it exemplifies the benchmark shift toward execution-grounded, stateful, domain-realistic evaluation.
  • Skeptical about: scope is limited to e-consult-style EHR workflows and excludes broader multimodal or collaborative clinical settings.

5) Practical next steps

  • Add response integrity checks to agent stacks: log and verify the exact model output consumed by the executor, especially in relay/BYOK architectures.
  • Evaluate any unlearning or privacy-sensitive model at deployment precision (INT8/INT4), not just BF16; add Q-INT4/Q-INT8-style reporting to audits.
  • For long-horizon agents, run ablations that separately scale planner, actor, retrieval, and memory to identify the true bottleneck before spending more compute.
  • Instrument agent training with token/turn-level diagnostics: average think length, repeated-turn rate, uncertainty trajectories, and collapse indicators across seeds.
  • Replace final-answer-only benchmarking with process checkpoints and operational metrics: parseability, tool-state correctness, milestone completion, and execution-grounded success.
  • For tool-using agents, deploy zero-trust interception: verify tool definitions, tool-call provenance, parameter integrity, and task-tool semantic alignment before execution.
  • Stress-test agents against multi-turn contextual jailbreaks and persistent-memory attacks, not just single-turn prompt injection.
  • Where domains are structured and high-stakes, prototype compile-once neuro-symbolic offloading or typed intermediate representations instead of repeated free-form runtime reasoning.
  • In preference optimization, monitor chosen-response likelihood and mass dynamics, not just pairwise margin improvements, to catch DPO-style squeezing early.
  • For memory systems, separate write-time quality from retrieval/reader quality; audit whether the writer is extracting the right facts, selecting under budget, or simply overproducing unusable notes.

Generated from per-paper analyses; no external browsing.