Daily AI Paper Report (2026-05-08)
Published:
Chinese version: [中文]
Run stats
- Candidates: 278
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-05-06T00:00:00Z → 2026-05-07T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2605.04785 | AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use | cs.AI, cs.CR | 95 | Runtime interception for agent tool use is highly deployment-relevant safety work with concrete controls. | agent-safety, tool-use, runtime-monitoring, guardrails, security |
2605.05116 | On the Hardness of Junking LLMs | cs.LG | 95 | Studies jailbreaks via promptless token triggers/natural backdoors; highly relevant to LLM safety. | llm-safety, jailbreaks, backdoors, adversarial-prompts, robustness |
2605.04431 | Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning | cs.SE, cs.AI | 94 | Targets fragile RL post-training with a new failure benchmark and automatic failure management. | LLM post-training, RFT, reliability, benchmark, automation |
2605.04808 | DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents | cs.AI | 93 | Controllable red-teaming platform for AI agents targets realistic, reproducible agent security evaluation. | agent-safety, red-teaming, evaluation, agents, security-benchmarks |
2605.04992 | You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation | cs.CR | 93 | Restores safety after unsafe LoRA merges while preserving skills; practical open-model guardrail work. | alignment, lora, safety-restoration, open-source-llms, guardrails |
2605.05058 | SoK: Robustness in Large Language Models against Jailbreak Attacks | cs.CR, cs.AI | 92 | Systematizes jailbreak robustness and proposes a multidimensional evaluation framework for LLM security. | jailbreaks, llm-safety, survey, evaluation, security |
2605.04572 | From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning | cs.AI, cs.LG | 91 | Analyzes fine-tuning safety degradation dynamics and scores risky samples; strong alignment relevance. | alignment, fine-tuning, safety-degradation, risk-scoring, llm-reliability |
2605.05112 | Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime | cs.LG | 91 | Targets agentic RL efficiency for SWE-bench-style systems with a clear control principle and gains. | agentic-rl, evaluation, reasoning, training-efficiency, SWE-bench |
2605.04454 | Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone | cs.AI, cs.HC, cs.LG, cs.SE | 90 | Important alignment argument: model-level benchmarks alone miss deployment-level alignment evidence. | alignment, evaluation, deployment, reliability, ai-safety |
2605.04543 | UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding | cs.CL, cs.LG | 90 | Unified speculative decoding for multi-step/multi-draft trees; strong LLM inference efficiency relevance. | LLM inference, speculative decoding, efficiency, decoding, transformers |
2605.04446 | Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs | cs.CR | 89 | Input-only attack on MoE routing exposes a practical new safety/security failure mode for hosted LLMs. | moe, adversarial-attacks, jailbreaks, llm-security, frontier-models |
2605.05090 | Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models | cs.CL, cs.AI | 89 | Automated auditing pipeline finds intended and unintended behavior changes after LM interventions. | auditing, evaluation, model-interventions, unlearning, knowledge-editing |
2605.05040 | Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization | cs.LG, cs.AI | 88 | Preference-based self-distillation for LLMs; directly relevant to post-training stability and reasoning. | LLM, alignment, self-distillation, preference-learning, reasoning |
2605.04615 | Beyond Retrieval: A Multitask Benchmark and Model for Code Search | cs.SE, cs.AI | 88 | Contamination-limited multitask code search benchmark and reranker for realistic retrieval pipelines. | code search, benchmark, retrieval, reranking, evaluation |
2605.05003 | Misaligned by Reward: Socially Undesirable Preferences in LLMs | cs.CL, cs.AI, cs.CY | 87 | Probes reward models for socially undesirable preferences across safety, bias, morality, and ethics. | reward-models, alignment, social-preferences, safety-evaluation, rlhf |
2605.05025 | Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals | cs.CL | 87 | Lightweight single-pass hallucination detection using internal attention signals; practical reliability angle. | LLM, hallucination, uncertainty, interpretability, reliability |
2605.05166 | The First Token Knows: Single-Decode Confidence for Hallucination Detection | cs.CL, cs.AI | 86 | Single-decode first-token confidence for hallucination detection is efficient and deployment-friendly. | hallucination, uncertainty, factuality, evaluation, confidence |
2605.04458 | DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation | cs.CL, cs.IR | 86 | Automates QA nugget generation for evaluating long-form citation-backed RAG reports. | RAG, evaluation, long-form generation, QA nuggets, report assessment |
2605.04700 | Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization | cs.CR, cs.AI, cs.CL, cs.LG, cs.SD | 85 | Shows sparse token-aware jailbreaks on audio language models, extending multimodal attack understanding. | audio-language-models, jailbreaks, multimodal-safety, adversarial-attacks, security |
2605.04530 | SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting | cs.NI, cs.AI | 84 | Agentic troubleshooting with explicit phase-gated policy; useful for reliable tool-using agents. | agents, tool-use, reliability, workflow, evaluation |
2605.05134 | Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction | cs.LG, math.DS | 84 | Black-box hallucination detection without sampling or retrieval; useful if results hold across domains. | LLM, hallucination, black-box, uncertainty, factuality |
2605.05103 | Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement | cs.CL, cs.AI, cs.CY | 84 | Black-box, corpus-attributable hallucination and novelty metric with uncertainty for groundedness checks. | hallucination, groundedness, evaluation, uncertainty, black-box |
2605.05007 | Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation | cs.AI | 83 | Unified routing/decomposition policy for multi-agent systems with strong efficiency and benchmark gains. | agents, orchestration, routing, efficiency, multi-agent |
2605.04960 | EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance | cs.LG, cs.AI | 83 | Improves GRPO credit assignment for LLM reasoning with dense self-supervised guidance. | llm-reasoning, rlvr, grpo, post-training, credit-assignment |
2605.04956 | KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels | cs.LG, cs.PF | 83 | Benchmark for LLM-generated GPU kernels with failure analysis; reusable eval for code-generation limits. | benchmark, LLM, code-generation, evaluation, efficiency |
2605.04893 | Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics | cs.LG, cs.CL, stat.ML | 82 | Theoretical attention diagnostic work tied to hallucination failure modes and information-flow asymmetry. | interpretability, attention, hallucination, theory, diagnostics |
2605.05191 | LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents | cs.AI | 81 | Adaptive context orchestration for long-horizon search agents could improve scalable agent reliability. | agents, long-context, search, context-management, reasoning |
2605.04719 | Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL | cs.CL | 81 | Step-level credit assignment for tool-integrated Text-to-SQL improves process supervision. | tool-use, agents, process-supervision, text-to-sql, reinforcement-learning |
2605.04453 | StableI2I: Spotting Unintended Changes in Image-to-Image Transition | cs.CV, cs.AI | 81 | Evaluates unintended changes in image-to-image systems; strong benchmark value for model reliability. | evaluation, multimodal, image-to-image, robustness, benchmark |
2605.05000 | Agentic Vulnerability Reasoning on Windows COM Binaries | cs.CR, cs.LG | 80 | Agentic vulnerability reasoning with debugger-verified PoCs is impactful for cyber-agent capability and risk. | cybersecurity, agents, vulnerability-discovery, tool-use, offensive-ai |
AI Paper Insight Brief
2026-05-08
0) Executive takeaways (read this first)
- Evaluation is shifting from model-only scores to system- and process-level measurement. Several papers argue that deployment behavior depends on scaffolding, context, tools, and interaction design—not just model weights—and back this with new benchmarks for agent security, post-training failures, code search, image editing fidelity, and intervention side-effects.
- Agent/tool safety is now a first-class operational problem. The strongest security papers focus on runtime interception, realistic red-teaming environments, and end-to-end exploit validation rather than prompt-only attacks. This suggests safety work is moving closer to deployment controls and adversarial operations.
- Credit assignment is becoming the bottleneck in RL-style post-training. Multiple papers attack the same failure mode from different angles: step-level rewards for tool use, token-level advantages for reasoning RL, pass-rate control for binary-reward rollouts, and automated failure diagnosis for RFT pipelines.
- Cheap internal or single-pass uncertainty signals are improving. Hallucination detection papers show that attention-derived or first-token confidence signals can rival more expensive sampling-based methods, but they require either white-box access or are currently scoped to narrow QA settings.
- Routing and orchestration are emerging as both a capability lever and a security surface. MoE routing can be exploited via input-only attacks, while selective delegation and elastic context orchestration improve cost/accuracy for multi-agent and long-horizon systems.
- Many “fixes” remain partial. Automatic remediation for post-training failures is unstable, routing defenses for MoE are weak, and several benchmark papers show that correctness often fails to translate into deployment utility, efficiency, or robustness.
2) Key themes (clusters)
Theme: System-level evaluation is replacing model-only evaluation
- Why it matters: A recurring message today is that benchmark scores on isolated prompts are increasingly insufficient. What matters in deployment is the full stack: scaffolding, tools, memory, UI, retrieval, and runtime controls.
- Representative papers:
- Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
- DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
- Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
- SoK: Robustness in Large Language Models against Jailbreak Attacks
- Common approach:
- Build richer evaluation objects: interaction-level rubrics, simulated environments, verifiable judges, or statistically validated natural-language hypotheses.
- Measure properties that standard benchmarks miss: verification support, process steerability, side effects, transferability, attack stability, and runtime overhead.
- Use blinded coding or deterministic judges to reduce evaluator leakage and improve reproducibility.
- Open questions / failure modes:
- How well do simulated environments and fixed scaffolds transfer to live deployments?
- Many evaluation pipelines still depend on prompt banks, surrogate victims, or judge models.
- Interactional metrics are more realistic but harder to standardize and compare across labs.
Theme: Runtime agent safety is moving from red-teaming to interception
- Why it matters: For tool-using agents, the main risk is no longer just unsafe text—it is unsafe action. The most practical work today focuses on stopping harmful side effects before execution.
- Representative papers:
- AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
- DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
- Agentic Vulnerability Reasoning on Windows COM Binaries
- Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
- Common approach:
- Interpose a runtime layer between model intent and tool execution.
- Use typed actions, deobfuscation, chain-aware risk tracking, and verifiable environment outcomes.
- Evaluate on realistic attack surfaces: shell commands, MCP tools, COM binaries, prompt/tool/skill/environment injections.
- Open questions / failure modes:
- Static or regex-heavy defenses hit a ceiling on runtime semantics and deep obfuscation.
- Attack transfer remains architecture- and harness-dependent.
- Strong red-teaming results do not yet imply robust, low-friction production defenses.
Theme: RL/post-training reliability is now about process control, not just reward design
- Why it matters: Several papers converge on the same operational reality: RL-style post-training fails because learning signals are sparse, misassigned, or generated in low-information regimes. Better process control may matter as much as better objectives.
- Representative papers:
- Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
- Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL
- EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
- Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
- Common approach:
- Replace uniform trajectory-level credit with step- or token-level signals.
- Use training telemetry or rollout statistics as first-class control variables.
- Add adaptive mechanisms: invariant-based anomaly detection, entropy gating, progress signals, or prefix replay to target informative rollouts.
- Open questions / failure modes:
- Methods are often domain-specific: SQL executors, binary rewards, or matched RFT telemetry.
- Automatic remediation is still brittle; in RFT-FM, interventions sometimes worsened severity.
- Better credit assignment may improve learning efficiency without solving objective misspecification.
Theme: Hallucination detection is getting cheaper and more mechanistic
- Why it matters: There is clear demand for low-cost hallucination signals that can run in one pass, especially for closed or latency-sensitive deployments.
- Representative papers:
- Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
- The First Token Knows: Single-Decode Confidence for Hallucination Detection
- Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
- Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement
- Common approach:
- Extract uncertainty from internal dynamics or output trajectories rather than repeated sampling.
- Use lightweight probes or geometric scores over attention, logits, embeddings, or corpus transition fields.
- Frame detection as selective classification or correctness prediction rather than full factual verification.
- Open questions / failure modes:
- White-box methods need access to attention/logits; black-box methods may depend on embedding choice or corpus coverage.
- Strong results are often task-specific: short-answer QA, summarization, or corpus-grounded settings.
- Detection remains easier than correction or abstention.
Theme: Benchmarks are getting more realistic—and more punishing
- Why it matters: New benchmarks are exposing failure modes hidden by older, cleaner tasks: contamination, hard negatives, structural fidelity, quantization edge cases, and realistic agent environments.
- Representative papers:
- Common approach:
- Add contamination controls, graded relevance, hard negatives, or source-conditioned fidelity checks.
- Evaluate full pipelines rather than isolated first-stage retrieval or perceptual quality.
- Release artifacts that support diagnosis, not just leaderboard ranking.
- Open questions / failure modes:
- Benchmarks remain bounded by their construction choices: synthetic rewrites, generated queries, or narrow domains.
- Better benchmark realism often lowers apparent performance and complicates cross-paper comparison.
- Some tasks remain largely unsolved despite high compile or retrieval rates.
Theme: Routing and context management are becoming core infrastructure
- Why it matters: As systems become multi-model, MoE-based, and long-horizon, routing decisions and context curation increasingly determine both capability and safety.
- Representative papers:
- Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
- LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
- Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
- UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding
- Common approach:
- Treat routing as an optimization problem over cost, capability, and structure.
- Learn decomposition/routing jointly or manage context with explicit meta-operations.
- Exploit or optimize hidden routing structure for either attack transfer or inference efficiency.
- Open questions / failure modes:
- Routing policies can become a new attack surface.
- Gains depend on worker pools, tree topology, or surrogate-target similarity.
- Context compression and delegation policies may overfit to teacher-generated supervision or current provider mixes.
3) Technical synthesis
- Telemetry is becoming a training primitive. RFT-FM uses reward/KL/entropy/returns as invariants; EP-GRPO uses token entropy and policy divergence; Rollout Pass-Rate Control uses group pass-rate as a control target. Across papers, optimization is increasingly steered by process observables rather than only end rewards.
- Stepwise structure is the dominant fix for long-horizon learning. FineStep, EP-GRPO, SADE, UNO-ORCHESTRA, and LongSeeker all impose intermediate structure—skills, step rewards, turn-level credit, meta-ops, or decomposition—to reduce trial-and-error behavior.
- Verifiable judges are replacing free-form evaluation. DTAP, DoGMaTiQ, SLYP, and AgentTrust all rely on deterministic or structured validation signals tied to environment state, benchmark outcomes, or executable artifacts.
- Several papers separate “correctness” from “usefulness.” KernelBench-X shows correct kernels are often slower than PyTorch; COREB shows retrieval-only evaluation misses reranking failures; StableI2I shows perceptually good edits can still violate source fidelity.
- Transfer is now a central stress test. Misrouter studies surrogate-to-service transfer; SQSD transfers across architectures/scales; DTAP shows same backbone can vary sharply by harness; deployment-alignment work shows scaffold effects are model-dependent.
- Sparse signals often outperform dense heuristics. TAGO updates only high-gradient audio-token regions; first-token entropy rivals semantic self-consistency; attention-divergence probes use sparse informative heads; RFT-FM relies on a small set of invariants.
- Benchmarks are increasingly designed to expose hidden confounds. COREB targets contamination and trivial qrels; StableI2I targets source-conditioned drift; Deployment-Relevant Alignment audits missing interaction dimensions; Security Cube adds stability, transferability, and disruption depth beyond ASR.
- Closed-loop automation is promising but immature. RFT-FM can detect and diagnose faults well but remediation is unstable; AgentTrust can intercept actions quickly but has static-analysis limits; SLYP shows end-to-end exploit validation is possible but expensive and context-heavy.
- The field is converging on “system behavior = model + scaffold + environment.” This appears in alignment evaluation, agent red-teaming, orchestration, and runtime safety papers alike.
- Inference optimization is becoming more principled. UniVer gives OT-based guarantees for speculative decoding, while UNO-ORCHESTRA and LongSeeker optimize cost through routing and context control rather than only model compression.
4) Top 5 papers (with “why now”)
1. DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
- Provides a full-stack agent security platform: 50+ environments across 14 domains, an autonomous red-teaming agent, and a 6,682-task policy-grounded benchmark.
- Surfaces deployment-relevant vulnerabilities across frameworks and backbones, including high ASRs in both indirect and direct threat models.
- Useful now because agent security evaluation is bottlenecked by unrealistic environments and weak automation; DTAP offers a reusable substrate for both benchmarking and defense testing.
- Skepticism / limitation: Many attacks were optimized against a surrogate victim, so some results should be read as matched-generation upper bounds rather than pure transfer performance.
2. Agentic Vulnerability Reasoning on Windows COM Binaries
- Demonstrates end-to-end agentic vulnerability discovery plus debugger-verified PoC generation on closed-source binaries.
- Strong practical impact: 28 previously unknown vulnerabilities confirmed by MSRC, 16 CVEs, and $140K in bounties.
- Useful now because it shows agentic security systems can move beyond triage into validated exploit evidence, which is much closer to real security workflows.
- Skepticism / limitation: The approach is expensive, depends on decompiler quality, and remains specialized to COM race-condition bugs.
3. Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
- Introduces the first structured benchmark for RFT anomalies and an end-to-end detect/diagnose/remediate pipeline.
- Detection is strong on benchmarked faults (F1 87.96% easy, 73.88% hard), and diagnosis is useful enough to support automated intervention experiments.
- Useful now because post-training reliability is becoming a major cost center, and most labs still debug RLHF/RFT failures manually.
- Skepticism / limitation: Remediation is not yet reliable; overall median severity change is negative, and subtle faults remain hard.
4. Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
- Makes a sharp methodological claim: deployment alignment lives at the interaction/system level, not the model-only level.
- Backs the claim with a dual-coded audit of 16 benchmarks and a blinded stress test showing scaffold effects are strongly model-dependent.
- Useful now because many alignment claims still overgeneralize from response-level benchmarks to deployed systems.
- Skepticism / limitation: The stress test is intentionally small and proof-of-principle; broader generalization across domains and dimensions is still open.
5. AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
- Offers a deployable runtime interceptor with deobfuscation, policy rules, chain-aware risk tracking, safe-fix suggestions, and optional LLM judging.
- Achieves high verdict accuracy with low-millisecond latency on its benchmarks, making it one of the more operationally plausible safety layers in the batch.
- Useful now because tool-using agents need pre-execution controls, not just post-hoc evaluation.
- Skepticism / limitation: The rule-only path is fundamentally limited on runtime semantics and deep obfuscation; coverage will require continual extension.
5) Practical next steps
- Instrument post-training runs like production systems. Log reward, KL, entropy, returns, generation quality, and environment/tool feedback in a form suitable for anomaly detection and root-cause attribution.
- Add runtime action interception for agents before broader deployment. Typed action schemas, shell normalization, policy rules, and fail-safe review modes are now table stakes for tool use.
- Evaluate alignment claims at the scaffold level, not just the model level. For any deployment-critical workflow, test multiple system prompts, verification scaffolds, and UI/tool configurations against the same model.
- Adopt richer robustness metrics than ASR alone. Include transferability, stability across runs, utility loss, latency/cost overhead, and where possible representational or trajectory-level disruption signals.
- For RL with binary rewards, monitor rollout pass-rate distribution. If groups are mostly all-pass or all-fail, you are likely wasting rollout budget; test replay or curriculum mechanisms that move training toward higher-information regimes.
- Use step-level rewards where tool traces are available. SQL, code, and agent tasks with executor feedback are good candidates for process rewards and per-step advantage estimation.
- Benchmark full pipelines, not isolated components. For retrieval, include reranking; for image editing, include source fidelity; for kernels, separate compile/correctness/efficiency; for agents, include environment outcomes.
- Treat routing as both optimization target and threat surface. If you deploy MoE or multi-worker systems, test routing-aware attacks and monitor whether orchestration policies create predictable exploit paths.
Generated from per-paper analyses; no external browsing.
