Daily AI Paper Report (2026-05-08)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 278
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-06T00:00:00Z → 2026-05-07T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.04785AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
PDF
cs.AI, cs.CR95Runtime interception for agent tool use is highly deployment-relevant safety work with concrete controls.agent-safety, tool-use, runtime-monitoring, guardrails, security
2605.05116On the Hardness of Junking LLMs
PDF
cs.LG95Studies jailbreaks via promptless token triggers/natural backdoors; highly relevant to LLM safety.llm-safety, jailbreaks, backdoors, adversarial-prompts, robustness
2605.04431Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
PDF
cs.SE, cs.AI94Targets fragile RL post-training with a new failure benchmark and automatic failure management.LLM post-training, RFT, reliability, benchmark, automation
2605.04808DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
PDF
cs.AI93Controllable red-teaming platform for AI agents targets realistic, reproducible agent security evaluation.agent-safety, red-teaming, evaluation, agents, security-benchmarks
2605.04992You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
PDF
cs.CR93Restores safety after unsafe LoRA merges while preserving skills; practical open-model guardrail work.alignment, lora, safety-restoration, open-source-llms, guardrails
2605.05058SoK: Robustness in Large Language Models against Jailbreak Attacks
PDF
cs.CR, cs.AI92Systematizes jailbreak robustness and proposes a multidimensional evaluation framework for LLM security.jailbreaks, llm-safety, survey, evaluation, security
2605.04572From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
PDF
cs.AI, cs.LG91Analyzes fine-tuning safety degradation dynamics and scores risky samples; strong alignment relevance.alignment, fine-tuning, safety-degradation, risk-scoring, llm-reliability
2605.05112Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
PDF
cs.LG91Targets agentic RL efficiency for SWE-bench-style systems with a clear control principle and gains.agentic-rl, evaluation, reasoning, training-efficiency, SWE-bench
2605.04454Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
PDF
cs.AI, cs.HC, cs.LG, cs.SE90Important alignment argument: model-level benchmarks alone miss deployment-level alignment evidence.alignment, evaluation, deployment, reliability, ai-safety
2605.04543UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding
PDF
cs.CL, cs.LG90Unified speculative decoding for multi-step/multi-draft trees; strong LLM inference efficiency relevance.LLM inference, speculative decoding, efficiency, decoding, transformers
2605.04446Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
PDF
cs.CR89Input-only attack on MoE routing exposes a practical new safety/security failure mode for hosted LLMs.moe, adversarial-attacks, jailbreaks, llm-security, frontier-models
2605.05090Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
PDF
cs.CL, cs.AI89Automated auditing pipeline finds intended and unintended behavior changes after LM interventions.auditing, evaluation, model-interventions, unlearning, knowledge-editing
2605.05040Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PDF
cs.LG, cs.AI88Preference-based self-distillation for LLMs; directly relevant to post-training stability and reasoning.LLM, alignment, self-distillation, preference-learning, reasoning
2605.04615Beyond Retrieval: A Multitask Benchmark and Model for Code Search
PDF
cs.SE, cs.AI88Contamination-limited multitask code search benchmark and reranker for realistic retrieval pipelines.code search, benchmark, retrieval, reranking, evaluation
2605.05003Misaligned by Reward: Socially Undesirable Preferences in LLMs
PDF
cs.CL, cs.AI, cs.CY87Probes reward models for socially undesirable preferences across safety, bias, morality, and ethics.reward-models, alignment, social-preferences, safety-evaluation, rlhf
2605.05025Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
PDF
cs.CL87Lightweight single-pass hallucination detection using internal attention signals; practical reliability angle.LLM, hallucination, uncertainty, interpretability, reliability
2605.05166The First Token Knows: Single-Decode Confidence for Hallucination Detection
PDF
cs.CL, cs.AI86Single-decode first-token confidence for hallucination detection is efficient and deployment-friendly.hallucination, uncertainty, factuality, evaluation, confidence
2605.04458DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation
PDF
cs.CL, cs.IR86Automates QA nugget generation for evaluating long-form citation-backed RAG reports.RAG, evaluation, long-form generation, QA nuggets, report assessment
2605.04700Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
PDF
cs.CR, cs.AI, cs.CL, cs.LG, cs.SD85Shows sparse token-aware jailbreaks on audio language models, extending multimodal attack understanding.audio-language-models, jailbreaks, multimodal-safety, adversarial-attacks, security
2605.04530SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting
PDF
cs.NI, cs.AI84Agentic troubleshooting with explicit phase-gated policy; useful for reliable tool-using agents.agents, tool-use, reliability, workflow, evaluation
2605.05134Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
PDF
cs.LG, math.DS84Black-box hallucination detection without sampling or retrieval; useful if results hold across domains.LLM, hallucination, black-box, uncertainty, factuality
2605.05103Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement
PDF
cs.CL, cs.AI, cs.CY84Black-box, corpus-attributable hallucination and novelty metric with uncertainty for groundedness checks.hallucination, groundedness, evaluation, uncertainty, black-box
2605.05007Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
PDF
cs.AI83Unified routing/decomposition policy for multi-agent systems with strong efficiency and benchmark gains.agents, orchestration, routing, efficiency, multi-agent
2605.04960EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
PDF
cs.LG, cs.AI83Improves GRPO credit assignment for LLM reasoning with dense self-supervised guidance.llm-reasoning, rlvr, grpo, post-training, credit-assignment
2605.04956KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
PDF
cs.LG, cs.PF83Benchmark for LLM-generated GPU kernels with failure analysis; reusable eval for code-generation limits.benchmark, LLM, code-generation, evaluation, efficiency
2605.04893Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics
PDF
cs.LG, cs.CL, stat.ML82Theoretical attention diagnostic work tied to hallucination failure modes and information-flow asymmetry.interpretability, attention, hallucination, theory, diagnostics
2605.05191LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
PDF
cs.AI81Adaptive context orchestration for long-horizon search agents could improve scalable agent reliability.agents, long-context, search, context-management, reasoning
2605.04719Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL
PDF
cs.CL81Step-level credit assignment for tool-integrated Text-to-SQL improves process supervision.tool-use, agents, process-supervision, text-to-sql, reinforcement-learning
2605.04453StableI2I: Spotting Unintended Changes in Image-to-Image Transition
PDF
cs.CV, cs.AI81Evaluates unintended changes in image-to-image systems; strong benchmark value for model reliability.evaluation, multimodal, image-to-image, robustness, benchmark
2605.05000Agentic Vulnerability Reasoning on Windows COM Binaries
PDF
cs.CR, cs.LG80Agentic vulnerability reasoning with debugger-verified PoCs is impactful for cyber-agent capability and risk.cybersecurity, agents, vulnerability-discovery, tool-use, offensive-ai

AI Paper Insight Brief

2026-05-08

0) Executive takeaways (read this first)

  • Evaluation is shifting from model-only scores to system- and process-level measurement. Several papers argue that deployment behavior depends on scaffolding, context, tools, and interaction design—not just model weights—and back this with new benchmarks for agent security, post-training failures, code search, image editing fidelity, and intervention side-effects.
  • Agent/tool safety is now a first-class operational problem. The strongest security papers focus on runtime interception, realistic red-teaming environments, and end-to-end exploit validation rather than prompt-only attacks. This suggests safety work is moving closer to deployment controls and adversarial operations.
  • Credit assignment is becoming the bottleneck in RL-style post-training. Multiple papers attack the same failure mode from different angles: step-level rewards for tool use, token-level advantages for reasoning RL, pass-rate control for binary-reward rollouts, and automated failure diagnosis for RFT pipelines.
  • Cheap internal or single-pass uncertainty signals are improving. Hallucination detection papers show that attention-derived or first-token confidence signals can rival more expensive sampling-based methods, but they require either white-box access or are currently scoped to narrow QA settings.
  • Routing and orchestration are emerging as both a capability lever and a security surface. MoE routing can be exploited via input-only attacks, while selective delegation and elastic context orchestration improve cost/accuracy for multi-agent and long-horizon systems.
  • Many “fixes” remain partial. Automatic remediation for post-training failures is unstable, routing defenses for MoE are weak, and several benchmark papers show that correctness often fails to translate into deployment utility, efficiency, or robustness.

2) Key themes (clusters)

Theme: System-level evaluation is replacing model-only evaluation

Theme: Runtime agent safety is moving from red-teaming to interception

Theme: RL/post-training reliability is now about process control, not just reward design

Theme: Hallucination detection is getting cheaper and more mechanistic

Theme: Benchmarks are getting more realistic—and more punishing

Theme: Routing and context management are becoming core infrastructure

3) Technical synthesis

  • Telemetry is becoming a training primitive. RFT-FM uses reward/KL/entropy/returns as invariants; EP-GRPO uses token entropy and policy divergence; Rollout Pass-Rate Control uses group pass-rate as a control target. Across papers, optimization is increasingly steered by process observables rather than only end rewards.
  • Stepwise structure is the dominant fix for long-horizon learning. FineStep, EP-GRPO, SADE, UNO-ORCHESTRA, and LongSeeker all impose intermediate structure—skills, step rewards, turn-level credit, meta-ops, or decomposition—to reduce trial-and-error behavior.
  • Verifiable judges are replacing free-form evaluation. DTAP, DoGMaTiQ, SLYP, and AgentTrust all rely on deterministic or structured validation signals tied to environment state, benchmark outcomes, or executable artifacts.
  • Several papers separate “correctness” from “usefulness.” KernelBench-X shows correct kernels are often slower than PyTorch; COREB shows retrieval-only evaluation misses reranking failures; StableI2I shows perceptually good edits can still violate source fidelity.
  • Transfer is now a central stress test. Misrouter studies surrogate-to-service transfer; SQSD transfers across architectures/scales; DTAP shows same backbone can vary sharply by harness; deployment-alignment work shows scaffold effects are model-dependent.
  • Sparse signals often outperform dense heuristics. TAGO updates only high-gradient audio-token regions; first-token entropy rivals semantic self-consistency; attention-divergence probes use sparse informative heads; RFT-FM relies on a small set of invariants.
  • Benchmarks are increasingly designed to expose hidden confounds. COREB targets contamination and trivial qrels; StableI2I targets source-conditioned drift; Deployment-Relevant Alignment audits missing interaction dimensions; Security Cube adds stability, transferability, and disruption depth beyond ASR.
  • Closed-loop automation is promising but immature. RFT-FM can detect and diagnose faults well but remediation is unstable; AgentTrust can intercept actions quickly but has static-analysis limits; SLYP shows end-to-end exploit validation is possible but expensive and context-heavy.
  • The field is converging on “system behavior = model + scaffold + environment.” This appears in alignment evaluation, agent red-teaming, orchestration, and runtime safety papers alike.
  • Inference optimization is becoming more principled. UniVer gives OT-based guarantees for speculative decoding, while UNO-ORCHESTRA and LongSeeker optimize cost through routing and context control rather than only model compression.

4) Top 5 papers (with “why now”)

1. DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

  • Provides a full-stack agent security platform: 50+ environments across 14 domains, an autonomous red-teaming agent, and a 6,682-task policy-grounded benchmark.
  • Surfaces deployment-relevant vulnerabilities across frameworks and backbones, including high ASRs in both indirect and direct threat models.
  • Useful now because agent security evaluation is bottlenecked by unrealistic environments and weak automation; DTAP offers a reusable substrate for both benchmarking and defense testing.
  • Skepticism / limitation: Many attacks were optimized against a surrogate victim, so some results should be read as matched-generation upper bounds rather than pure transfer performance.

2. Agentic Vulnerability Reasoning on Windows COM Binaries

  • Demonstrates end-to-end agentic vulnerability discovery plus debugger-verified PoC generation on closed-source binaries.
  • Strong practical impact: 28 previously unknown vulnerabilities confirmed by MSRC, 16 CVEs, and $140K in bounties.
  • Useful now because it shows agentic security systems can move beyond triage into validated exploit evidence, which is much closer to real security workflows.
  • Skepticism / limitation: The approach is expensive, depends on decompiler quality, and remains specialized to COM race-condition bugs.

3. Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

  • Introduces the first structured benchmark for RFT anomalies and an end-to-end detect/diagnose/remediate pipeline.
  • Detection is strong on benchmarked faults (F1 87.96% easy, 73.88% hard), and diagnosis is useful enough to support automated intervention experiments.
  • Useful now because post-training reliability is becoming a major cost center, and most labs still debug RLHF/RFT failures manually.
  • Skepticism / limitation: Remediation is not yet reliable; overall median severity change is negative, and subtle faults remain hard.

4. Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

  • Makes a sharp methodological claim: deployment alignment lives at the interaction/system level, not the model-only level.
  • Backs the claim with a dual-coded audit of 16 benchmarks and a blinded stress test showing scaffold effects are strongly model-dependent.
  • Useful now because many alignment claims still overgeneralize from response-level benchmarks to deployed systems.
  • Skepticism / limitation: The stress test is intentionally small and proof-of-principle; broader generalization across domains and dimensions is still open.

5. AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

  • Offers a deployable runtime interceptor with deobfuscation, policy rules, chain-aware risk tracking, safe-fix suggestions, and optional LLM judging.
  • Achieves high verdict accuracy with low-millisecond latency on its benchmarks, making it one of the more operationally plausible safety layers in the batch.
  • Useful now because tool-using agents need pre-execution controls, not just post-hoc evaluation.
  • Skepticism / limitation: The rule-only path is fundamentally limited on runtime semantics and deep obfuscation; coverage will require continual extension.

5) Practical next steps

  • Instrument post-training runs like production systems. Log reward, KL, entropy, returns, generation quality, and environment/tool feedback in a form suitable for anomaly detection and root-cause attribution.
  • Add runtime action interception for agents before broader deployment. Typed action schemas, shell normalization, policy rules, and fail-safe review modes are now table stakes for tool use.
  • Evaluate alignment claims at the scaffold level, not just the model level. For any deployment-critical workflow, test multiple system prompts, verification scaffolds, and UI/tool configurations against the same model.
  • Adopt richer robustness metrics than ASR alone. Include transferability, stability across runs, utility loss, latency/cost overhead, and where possible representational or trajectory-level disruption signals.
  • For RL with binary rewards, monitor rollout pass-rate distribution. If groups are mostly all-pass or all-fail, you are likely wasting rollout budget; test replay or curriculum mechanisms that move training toward higher-information regimes.
  • Use step-level rewards where tool traces are available. SQL, code, and agent tasks with executor feedback are good candidates for process rewards and per-step advantage estimation.
  • Benchmark full pipelines, not isolated components. For retrieval, include reranking; for image editing, include source fidelity; for kernels, separate compile/correctness/efficiency; for agents, include environment outcomes.
  • Treat routing as both optimization target and threat surface. If you deploy MoE or multi-worker systems, test routing-aware attacks and monitor whether orchestration policies create predictable exploit paths.

Generated from per-paper analyses; no external browsing.