Daily AI Paper Report (2026-05-09)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 692
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-08T00:00:00Z → 2026-05-09T00:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.03619The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code
PDF
cs.CR93Measures LLM malware polymorphism with dual-agent pipeline; directly relevant to offensive capability risk.llm-safety, cybersecurity, malware, evaluation, agents
2605.03353SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
PDF
cs.CR, cs.AI92Portable skill compilation plus security hardening for cross-framework LLM agents.llm-agents, agent-security, prompt-engineering, compiler, skills
2605.04624AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
PDF
cs.AI, cs.SE92Agent-repair leaderboard instability from evaluator leakage; large trace corpus for auditing selection bias.agent-safety, evaluation, benchmark, auditing, repair
2605.02346APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
PDF
cs.CR, cs.AI90Autonomous OT pentesting/remediation with runtime controls; strong agent-security relevance.agent-security, cybersecurity, autonomous-agents, operational-technology, red-teaming
2605.03310Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
PDF
cs.MA, cs.LG, q-fin.TR90Principled coordination layer for LLM multi-agent failures; strong relevance to agent reliability.multi-agent, coordination, agent-architecture, reliability, evaluation
2605.03547Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
PDF
cs.CV, cs.AI89First benchmark for multimodal copyright unlearning in LVLMs; strong safety and evaluation relevance.unlearning, multimodal, LVLM, benchmark, copyright
2605.02815FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
PDF
cs.CL89Agentic text-to-SQL with flexible exploration, execution, and repair; strong relevance to tool-using LLMs.agents, text-to-sql, tool-use, reasoning, evaluation
2605.04003Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing
PDF
cs.MA, cs.AI, cs.IR88Traceable multi-agent decision support with safety bounds, provenance, and human approval.multi-agent, safety, provenance, human-in-the-loop, tool-use
2605.04874Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
PDF
cs.LG, cs.CL, cs.CV88Uncertainty-aware DPO for MLLM hallucination; directly relevant to multimodal alignment reliability.multimodal-llm, alignment, dpo, hallucination, uncertainty
2605.04831StoryAlign: Evaluating and Training Reward Models for Story Generation
PDF
cs.CL, cs.AI88Benchmarking and training reward models for story preferences; useful for alignment and RM evaluation.alignment, reward-models, evaluation, llms, preferences
2605.05017Position: Embodied AI Requires a Privacy-Utility Trade-off
PDF
cs.AI, cs.RO88Privacy-focused position on embodied AI lifecycle risks; strong safety relevance despite no empirical results.embodied-ai, privacy, safety, position-paper, deployment
2605.02765U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
PDF
cs.AI, cs.HC, cs.LG88User control and verification for LLM planning; directly relevant to reliable agent workflows.llm-planning, human-ai, verification, reliability, agents
2605.02709An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance
PDF
cs.AI87Empirical study of healthcare agent skills highlights governance, safety gaps, and deployment realities.agents, governance, healthcare, safety, empirical-study
2605.03900Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
PDF
cs.AI86Frames frontier AI failures as contextual objective selection; broad alignment relevance.alignment, objectives, agents, decision-making, theory
2605.03759Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks
PDF
cs.CV, cs.AI86Finds unlearning benchmarks fail when models never memorized; proposes stronger LVLM memorization benchmark.unlearning, privacy, LVLM, benchmark, evaluation
2605.02463When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
PDF
cs.MA, cs.AI, cs.CE86Targets robustness beyond robustness: stress-testing multi-agent LLMs for antifragility signals.multi-agent, robustness, evaluation, stress-testing, agents
2605.04906Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games
PDF
cs.AI86RL framework for strategic reasoning in multi-agent games; relevant to agentic reasoning and evaluation.llms, agents, reasoning, multi-agent, reinforcement-learning
2605.04373Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers
PDF
cs.NI, cs.AI, eess.SY86Finds worst-case failures in RL controllers and adds runtime protection; strong robustness/security angle.rl, robustness, runtime-protection, verification, networking
2605.03677Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
PDF
cs.LG86Unified on-policy distillation for LLMs/MLLMs with concrete bottlenecks and recipe.LLM, MLLM, distillation, post-training, optimization
2605.02741AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development
PDF
cs.SE, cs.AI86Audits maintainability risks in LLM/agent-generated code with concrete defect patterns and tradeoffs.llm-agents, software-engineering, evaluation, reliability, technical-debt
2605.02620Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
PDF
cs.CL, cs.LG85Agentic research reproduces NLP study fast; strong frontier-agent capability signal with eval implications.agents, evaluation, automation, llm-capabilities, reproducibility
2605.02624Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
PDF
cs.CL85Framework to evaluate realism of simulated users in multi-turn chats; useful for scalable agent evaluation.evaluation, user-simulation, multi-turn, chatbots, benchmark
2605.03476CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
PDF
cs.CL, cs.AI84GraphRAG multi-agent hallucination detection for medical summaries with evidence grounding.hallucination, graphrag, medical-llm, multi-agent, factuality
2605.02728ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling
PDF
cs.AI84Production-oriented agentic LLM system with modular data/spec elicitation; useful for real-world agent design.agents, LLM, optimization, tool-use, production
2605.04507Distilling Bayesian Belief States into Language Models for Auditable Negotiation
PDF
cs.CL84Makes negotiation agents auditable by distilling explicit Bayesian beliefs into LM outputs.auditing, interpretability, belief-state, negotiation, llm
2605.03571PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination
PDF
cs.CL, cs.AI84Real-world multi-turn benchmark for office actions and rebuttals; strong agentic/legal reasoning testbed.benchmark, agents, llms, legal-reasoning, retrieval
2605.02730Perceptual Flow Network for Visually Grounded Reasoning
PDF
cs.CV, cs.AI84Targets LVLM hallucination and language bias with reward-shaped grounded reasoning; frontier multimodal reliability.multimodal, hallucination, reasoning, vlm, reliability
2605.03824Reproducing Complex Set-Compositional Information Retrieval
PDF
cs.CL84Repro study + new benchmark for compositional retrieval; useful for RAG reasoning evaluation.RAG, retrieval, benchmark, evaluation, reasoning
2605.04922Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation
PDF
cs.MA, cs.AI84Structured multi-agent ideation via evolving graphs; notable for explicit coordination and evaluation claims.multi-agent, scientific-discovery, coordination, llm-systems, evaluation
2605.02735Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
PDF
cs.LG84Novel MLLM latent-reasoning pathology and fix; relevant to multimodal reasoning efficiency.multimodal, reasoning, latent-space, MLLMs, efficiency

AI Paper Insight Brief

2026-05-09

0) Executive takeaways (read this first)

  • Runtime structure is becoming the main reliability lever for agents. Across OT security, planning, manufacturing, coordination, and network control, papers repeatedly show that guardrails, critics, formal verifiers, typed IRs, and rule-based runtime interventions improve outcomes more than prompt tweaks alone.
  • Evaluation is shifting from average-case scores to failure-surface mapping. Several papers focus on worst-case discovery, evaluator-channel leakage, stress geometry, distributional realism, and architecture-specific failure signatures rather than just benchmark accuracy.
  • Grounding now means more than retrieval. Stronger systems increasingly combine retrieval with typed outputs, deterministic tools, graph structure, or formal checks: GraphRAG for patient-specific verification, database tool loops for text-to-SQL, knowledge graphs for manufacturing, and model checking for hard planning constraints.
  • Agent capability is expanding into operationally meaningful domains, but transfer remains the bottleneck. APIOT shows end-to-end exploit→patch→verify on bare-metal OT; ORPilot handles production-style optimization workflows; MAKA supports aerospace machining decisions. In each case, real-world deployment questions remain around physical transfer, semantic validation, or live operations.
  • Unlearning and detection remain shallow in multimodal/security settings. Copyright unlearning benchmarks show current methods either preserve utility or truly forget, but not both; LVLM unlearning benchmarks may be invalid if stage-1 memorization never happened; AI-text and offensive-code papers show detectors and static signatures are increasingly brittle under adaptive generation.
  • The practical frontier is “auditable autonomy.” The most decision-useful papers do not just improve task success; they expose provenance, uncertainty, evidence grades, cost-quality tradeoffs, or interpretable rules that let humans inspect and bound system behavior.

2) Key themes (clusters)

Theme: Runtime governance and verifiable control for agents

Theme: Evaluation is moving toward failure diagnostics, not just leaderboard scores

Theme: Grounded reasoning via tools, graphs, and typed intermediate representations

Theme: Security, misuse, and the erosion of static defenses

Theme: Alignment and preference learning are becoming more context- and token-aware

3) Technical synthesis

  • Typed intermediates are emerging as a core systems pattern: ORPilot’s JSON IR, SkCC’s SkIR, CuraView’s schema-bound outputs, and MAKA’s structured JSON routing all reduce ambiguity and make downstream validation possible.
  • Backtracking beats one-shot repair: FlexSQL explicitly revisits plan assumptions, not just SQL syntax; APIOT’s overseer enforces phase transitions; REGUARD iterates search-and-protect loops; this suggests robust agents need upstream correction, not only final-output patching.
  • Deterministic tools are being reserved for the parts models are worst at: numeric computation, protocol packet crafting, formal verification, solver execution, and physical compensation calculations are increasingly delegated away from free-form generation.
  • Evaluation is becoming architecture-aware: coordination papers hold model and information fixed to isolate orchestration effects; AuditRepairBench isolates selector/evaluator coupling; this is a useful template for future agent benchmarking.
  • Distributional realism matters more than sample realism: realsim evaluates user simulators over intent, feedback, identity, knowledge, and surface-form distributions, echoing the broader shift toward population-level validity.
  • Graph structure helps when evidence is relational, not just textual: CuraView’s per-patient GraphRAG and MAKA’s machining KG both outperform flatter retrieval setups by preserving entity relations and provenance.
  • Runtime protection is increasingly interpretable: REGUARD’s threshold rules, U-Define’s hard/soft split, and MAKA’s critic checks show a preference for auditable interventions over opaque policy changes.
  • Several papers expose a “semantic correctness gap”: ORPilot can compile and solve yet still be semantically wrong; style detectors can classify based on length confounds; unlearning methods can refuse without forgetting; benchmark wins can mask shallow mechanisms.
  • Test-time scaling remains useful when paired with diversity and verification: FlexSQL’s Majority@16 gains, Strat-Reasoner’s micro-rollouts, and CAFE’s architecture-specific stress patterns all point to structured exploration as a practical lever.
  • Many strongest results are still bounded by environment realism: OT emulation, digital twins, synthetic banking stress, synthetic copyrighted concepts, and synthetic identities all improve control and measurement, but transfer to live settings remains the key unresolved step.

4) Top 5 papers (with “why now”)

  • APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
    • Demonstrates autonomous discovery → exploitation → patching → verification on bare-metal MCU OT targets using protocol primitives rather than shell-centric tooling.
    • Shows runtime governance matters materially: overseer-on reached 100% mission success in the T1 ablation and cut completion time by 20.5%.
    • Useful now because it expands the threat model from Linux/web pentesting to industrial protocols and resource-constrained firmware.
    • Skeptical about: results are in QEMU/simulated environments with limited exploit scope and uncertain transfer to physical silicon.
  • FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
    • Reframes text-to-SQL as continual exploration plus plan/program backtracking, not one-shot schema linking and repair.
    • Achieves 65.44% Majority@16 on Spider2-Snow with gpt-oss-120b and shows large drops when Python support or diversity is removed.
    • Useful now because enterprise database interfaces increasingly fail on ambiguity and large schemas, exactly where fixed-stage pipelines break.
    • Skeptical about: the gains come with heavy tool-call overhead, and comparisons exclude closed-source top systems.
  • CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
    • Builds a patient-specific GraphRAG pipeline for sentence-level discharge-summary verification with structured evidence grades.
    • Reports E4 F1 of 0.831 with 0.909 recall on safety-critical contradictions, outperforming flat-retrieval baselines by about 0.19–0.20 F1.
    • Useful now because clinical deployment needs patient-grounded factuality checks, not generic hallucination benchmarks.
    • Skeptical about: labels partly derive from the generation pipeline, and evaluation is limited to a single-center curated subset.
  • Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers
    • Combines bilevel worst-case scenario search with interpretable runtime rules that protect pretrained RL controllers without retraining.
    • Finds controllers can be 43%–64% worse than achievable in feasible scenarios, then shrinks those gaps by roughly 79%–85% while preserving nominal performance.
    • Useful now because it offers a concrete template for “discover failure first, then patch locally” in safety-critical learned control.
    • Skeptical about: certificate tightness depends on the quality of the inner reference portfolio and the simplicity of the rule class.
  • AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
    • Isolates a subtle but important benchmark failure mode: agent selectors reading evaluator outputs can change rankings when evaluator configs change.
    • Provides a large paired-trace corpus plus a screening ensemble that reaches AUROC 0.96 on source-level surgery cases and supports low-cost repairs.
    • Useful now because agent leaderboards are proliferating faster than their measurement hygiene, and this paper gives a concrete audit path.
    • Skeptical about: it explicitly does not certify causal mechanisms beyond its observability boundary, and forward transfer is only moderate.

5) Practical next steps

  • Add runtime governance layers to agent stacks by default: repetition guards, phase-transition checks, schema validation, bounded retries, and explicit escalation paths.
  • Benchmark agents under architecture-controlled ablations, not just model swaps: hold tools/prompts fixed and vary coordination, evaluator access, or verifier placement.
  • For high-stakes domains, require typed intermediate artifacts and deterministic execution for numeric, protocol, or solver-critical steps.
  • Build worst-case discovery loops before deployment: search for feasible high-regret scenarios, then derive minimal interpretable runtime protections rather than retraining globally.
  • Measure distributional realism of simulators and synthetic users before trusting simulation-based evals; especially track feedback, context disclosure, termination, and domain-specific behavior.
  • Treat detector wins skeptically unless diagnostics rule out confounds like length, formatting, or frozen-evaluator leakage.
  • In multimodal safety/unlearning work, verify stage-1 memorization actually happened before claiming forgetting; add exposure-style or internal-state checks.
  • For agentic systems with retrieval, move beyond flat RAG toward graph-structured evidence + schema-constrained outputs when the domain is relational or patient-/entity-specific.

Generated from per-paper analyses; no external browsing.