Daily AI Paper Report (2026-05-09)

Published: May 09, 2026

Chinese version: [中文]

Run stats

Candidates: 692
Selected: 30
Deepread completed: 30
Window (UTC): 2026-05-08T00:00:00Z → 2026-05-09T00:00:00Z (weekend_backlog_unknown, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2605.03619`	The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code PDF	cs.CR	93	Measures LLM malware polymorphism with dual-agent pipeline; directly relevant to offensive capability risk.	llm-safety, cybersecurity, malware, evaluation, agents
`2605.03353`	SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents PDF	cs.CR, cs.AI	92	Portable skill compilation plus security hardening for cross-framework LLM agents.	llm-agents, agent-security, prompt-engineering, compiler, skills
`2605.04624`	AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair PDF	cs.AI, cs.SE	92	Agent-repair leaderboard instability from evaluator leakage; large trace corpus for auditing selection bias.	agent-safety, evaluation, benchmark, auditing, repair
`2605.02346`	APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks PDF	cs.CR, cs.AI	90	Autonomous OT pentesting/remediation with runtime controls; strong agent-security relevance.	agent-security, cybersecurity, autonomous-agents, operational-technology, red-teaming
`2605.03310`	Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems PDF	cs.MA, cs.LG, q-fin.TR	90	Principled coordination layer for LLM multi-agent failures; strong relevance to agent reliability.	multi-agent, coordination, agent-architecture, reliability, evaluation
`2605.03547`	Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models PDF	cs.CV, cs.AI	89	First benchmark for multimodal copyright unlearning in LVLMs; strong safety and evaluation relevance.	unlearning, multimodal, LVLM, benchmark, copyright
`2605.02815`	FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents PDF	cs.CL	89	Agentic text-to-SQL with flexible exploration, execution, and repair; strong relevance to tool-using LLMs.	agents, text-to-sql, tool-use, reasoning, evaluation
`2605.04003`	Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing PDF	cs.MA, cs.AI, cs.IR	88	Traceable multi-agent decision support with safety bounds, provenance, and human approval.	multi-agent, safety, provenance, human-in-the-loop, tool-use
`2605.04874`	Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models PDF	cs.LG, cs.CL, cs.CV	88	Uncertainty-aware DPO for MLLM hallucination; directly relevant to multimodal alignment reliability.	multimodal-llm, alignment, dpo, hallucination, uncertainty
`2605.04831`	StoryAlign: Evaluating and Training Reward Models for Story Generation PDF	cs.CL, cs.AI	88	Benchmarking and training reward models for story preferences; useful for alignment and RM evaluation.	alignment, reward-models, evaluation, llms, preferences
`2605.05017`	Position: Embodied AI Requires a Privacy-Utility Trade-off PDF	cs.AI, cs.RO	88	Privacy-focused position on embodied AI lifecycle risks; strong safety relevance despite no empirical results.	embodied-ai, privacy, safety, position-paper, deployment
`2605.02765`	U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning PDF	cs.AI, cs.HC, cs.LG	88	User control and verification for LLM planning; directly relevant to reliable agent workflows.	llm-planning, human-ai, verification, reliability, agents
`2605.02709`	An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance PDF	cs.AI	87	Empirical study of healthcare agent skills highlights governance, safety gaps, and deployment realities.	agents, governance, healthcare, safety, empirical-study
`2605.03900`	Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems PDF	cs.AI	86	Frames frontier AI failures as contextual objective selection; broad alignment relevance.	alignment, objectives, agents, decision-making, theory
`2605.03759`	Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks PDF	cs.CV, cs.AI	86	Finds unlearning benchmarks fail when models never memorized; proposes stronger LVLM memorization benchmark.	unlearning, privacy, LVLM, benchmark, evaluation
`2605.02463`	When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems PDF	cs.MA, cs.AI, cs.CE	86	Targets robustness beyond robustness: stress-testing multi-agent LLMs for antifragility signals.	multi-agent, robustness, evaluation, stress-testing, agents
`2605.04906`	Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games PDF	cs.AI	86	RL framework for strategic reasoning in multi-agent games; relevant to agentic reasoning and evaluation.	llms, agents, reasoning, multi-agent, reinforcement-learning
`2605.04373`	Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers PDF	cs.NI, cs.AI, eess.SY	86	Finds worst-case failures in RL controllers and adds runtime protection; strong robustness/security angle.	rl, robustness, runtime-protection, verification, networking
`2605.03677`	Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe PDF	cs.LG	86	Unified on-policy distillation for LLMs/MLLMs with concrete bottlenecks and recipe.	LLM, MLLM, distillation, post-training, optimization
`2605.02741`	AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development PDF	cs.SE, cs.AI	86	Audits maintainability risks in LLM/agent-generated code with concrete defect patterns and tradeoffs.	llm-agents, software-engineering, evaluation, reliability, technical-debt
`2605.02620`	Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race PDF	cs.CL, cs.LG	85	Agentic research reproduces NLP study fast; strong frontier-agent capability signal with eval implications.	agents, evaluation, automation, llm-capabilities, reproducibility
`2605.02624`	Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations PDF	cs.CL	85	Framework to evaluate realism of simulated users in multi-turn chats; useful for scalable agent evaluation.	evaluation, user-simulation, multi-turn, chatbots, benchmark
`2605.03476`	CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification PDF	cs.CL, cs.AI	84	GraphRAG multi-agent hallucination detection for medical summaries with evidence grounding.	hallucination, graphrag, medical-llm, multi-agent, factuality
`2605.02728`	ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling PDF	cs.AI	84	Production-oriented agentic LLM system with modular data/spec elicitation; useful for real-world agent design.	agents, LLM, optimization, tool-use, production
`2605.04507`	Distilling Bayesian Belief States into Language Models for Auditable Negotiation PDF	cs.CL	84	Makes negotiation agents auditable by distilling explicit Bayesian beliefs into LM outputs.	auditing, interpretability, belief-state, negotiation, llm
`2605.03571`	PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination PDF	cs.CL, cs.AI	84	Real-world multi-turn benchmark for office actions and rebuttals; strong agentic/legal reasoning testbed.	benchmark, agents, llms, legal-reasoning, retrieval
`2605.02730`	Perceptual Flow Network for Visually Grounded Reasoning PDF	cs.CV, cs.AI	84	Targets LVLM hallucination and language bias with reward-shaped grounded reasoning; frontier multimodal reliability.	multimodal, hallucination, reasoning, vlm, reliability
`2605.03824`	Reproducing Complex Set-Compositional Information Retrieval PDF	cs.CL	84	Repro study + new benchmark for compositional retrieval; useful for RAG reasoning evaluation.	RAG, retrieval, benchmark, evaluation, reasoning
`2605.04922`	Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation PDF	cs.MA, cs.AI	84	Structured multi-agent ideation via evolving graphs; notable for explicit coordination and evaluation claims.	multi-agent, scientific-discovery, coordination, llm-systems, evaluation
`2605.02735`	Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs PDF	cs.LG	84	Novel MLLM latent-reasoning pathology and fix; relevant to multimodal reasoning efficiency.	multimodal, reasoning, latent-space, MLLMs, efficiency

AI Paper Insight Brief

2026-05-09

0) Executive takeaways (read this first)

Runtime structure is becoming the main reliability lever for agents. Across OT security, planning, manufacturing, coordination, and network control, papers repeatedly show that guardrails, critics, formal verifiers, typed IRs, and rule-based runtime interventions improve outcomes more than prompt tweaks alone.
Evaluation is shifting from average-case scores to failure-surface mapping. Several papers focus on worst-case discovery, evaluator-channel leakage, stress geometry, distributional realism, and architecture-specific failure signatures rather than just benchmark accuracy.
Grounding now means more than retrieval. Stronger systems increasingly combine retrieval with typed outputs, deterministic tools, graph structure, or formal checks: GraphRAG for patient-specific verification, database tool loops for text-to-SQL, knowledge graphs for manufacturing, and model checking for hard planning constraints.
Agent capability is expanding into operationally meaningful domains, but transfer remains the bottleneck. APIOT shows end-to-end exploit→patch→verify on bare-metal OT; ORPilot handles production-style optimization workflows; MAKA supports aerospace machining decisions. In each case, real-world deployment questions remain around physical transfer, semantic validation, or live operations.
Unlearning and detection remain shallow in multimodal/security settings. Copyright unlearning benchmarks show current methods either preserve utility or truly forget, but not both; LVLM unlearning benchmarks may be invalid if stage-1 memorization never happened; AI-text and offensive-code papers show detectors and static signatures are increasingly brittle under adaptive generation.
The practical frontier is “auditable autonomy.” The most decision-useful papers do not just improve task success; they expose provenance, uncertainty, evidence grades, cost-quality tradeoffs, or interpretable rules that let humans inspect and bound system behavior.

2) Key themes (clusters)

Theme: Runtime governance and verifiable control for agents

Why it matters: A common pattern across high-stakes agent systems is that raw model capability is not enough; reliability comes from runtime mediation layers that constrain actions, verify outputs, and trigger escalation. This is especially important in domains where silent errors are costly.
Representative papers:
Common approach:
- Add a runtime layer between model outputs and execution: overseers, critics, model checkers, or rule engines.
- Separate hard constraints from soft preferences, and route them to different verification mechanisms.
- Use deterministic tools for numeric or protocol-critical operations instead of free-form generation.
- Escalate to humans or abstain when verification fails or uncertainty is high.
Open questions / failure modes:
- Verification formalisms are often too narrow for real-world constraints or temporal semantics.
- Runtime rules can preserve safety while missing semantically wrong-but-valid outputs.
- Many results are still in emulation, digital twins, or controlled testbeds rather than live deployment.
- Added control layers can increase latency, complexity, and maintenance burden.

Theme: Evaluation is moving toward failure diagnostics, not just leaderboard scores

Why it matters: Multiple papers argue that average benchmark performance hides the real deployment risks: ranking instability, worst-case regret, architecture-specific brittleness, and unrealistic simulators. The field is building tools to expose where systems fail and why.
Representative papers:
Common approach:
- Hold some factors fixed and vary one architectural or evaluator channel to isolate causal structure.
- Use richer diagnostics: Murphy decomposition, Jensen-gap stress geometry, paired traces, distribution distances, rank intervals.
- Focus on architecture signatures and transfer properties rather than single scalar scores.
- Release artifacts that let others recompute or stress-test claims.
Open questions / failure modes:
- Many methods detect opportunities or instabilities without yet closing the loop into improved systems.
- External validity is limited by single domains, single models, or partial observability.
- Some diagnostics depend on surrogate judges or reconstructed latent variables.
- Power remains a problem: several comparisons are suggestive but underpowered.

Theme: Grounded reasoning via tools, graphs, and typed intermediate representations

Why it matters: Systems that must interact with real data or safety-critical evidence increasingly rely on structured intermediates and tool use rather than pure text generation. This improves reproducibility, provenance, and recovery from early reasoning errors.
Representative papers:
Common approach:
- Introduce typed IRs or schemas that decouple reasoning from backend execution.
- Use iterative tool loops to inspect data values, execute partial programs, or retrieve graph-structured evidence.
- Enforce structured outputs with schema validation and repair.
- Backtrack to earlier planning stages when execution reveals deeper reasoning errors.
Open questions / failure modes:
- Tool-rich systems can become expensive in calls, latency, and orchestration complexity.
- Typed IRs improve reproducibility but do not guarantee semantic correctness.
- Graph retrieval quality depends heavily on domain-specific graph construction and normalization.
- Portability gains can be model-dependent and may not transfer uniformly across frameworks.

Theme: Security, misuse, and the erosion of static defenses

Why it matters: Several papers show that modern models can automate offensive workflows, generate highly variable malicious code, mimic personal style, and evade frozen detectors. This shifts the defender burden toward adaptive, semantics-aware, and runtime defenses.
Representative papers:
Common approach:
- Measure attack capability end-to-end rather than via isolated generations.
- Compare structural and semantic variation to understand what detectors actually latch onto.
- Probe detector brittleness with adaptive rewriting or cross-model transfer.
- Move some defenses earlier in the pipeline via compile-time analysis and injected constraints.
Open questions / failure modes:
- Many defenses still rely on superficial signals such as length or static structure.
- Compile-time hardening helps but cannot cover runtime misuse or novel attack patterns.
- Offensive evaluations are often limited to one model, language, or emulated environment.
- Adaptive adversaries can exploit frozen detectors and fixed evaluation setups.

Theme: Alignment and preference learning are becoming more context- and token-aware

Why it matters: Several papers push beyond scalar reward assumptions, showing that alignment depends on context-sensitive objective routing, token-level uncertainty, domain-specific reward models, and stronger foundations for unlearning evaluation.
Representative papers:
Common approach:
- Replace single scalar notions of quality with decomposed objectives or token-level signals.
- Build domain-specific benchmarks and preference datasets rather than relying on generic judges.
- Diagnose whether training pipelines are valid before evaluating downstream interventions.
- Use uncertainty or exposure-style metrics to probe deeper model state, not just output behavior.
Open questions / failure modes:
- Many proposals are conceptually strong but not yet tied to full deployment pipelines.
- Token-level or decomposed signals may still encode biased proxies.
- Better forgetting metrics do not yet solve the underlying unlearning algorithm problem.
- Context routing and objective decomposition themselves can become brittle subsystems.

3) Technical synthesis

Typed intermediates are emerging as a core systems pattern: ORPilot’s JSON IR, SkCC’s SkIR, CuraView’s schema-bound outputs, and MAKA’s structured JSON routing all reduce ambiguity and make downstream validation possible.
Backtracking beats one-shot repair: FlexSQL explicitly revisits plan assumptions, not just SQL syntax; APIOT’s overseer enforces phase transitions; REGUARD iterates search-and-protect loops; this suggests robust agents need upstream correction, not only final-output patching.
Deterministic tools are being reserved for the parts models are worst at: numeric computation, protocol packet crafting, formal verification, solver execution, and physical compensation calculations are increasingly delegated away from free-form generation.
Evaluation is becoming architecture-aware: coordination papers hold model and information fixed to isolate orchestration effects; AuditRepairBench isolates selector/evaluator coupling; this is a useful template for future agent benchmarking.
Distributional realism matters more than sample realism: realsim evaluates user simulators over intent, feedback, identity, knowledge, and surface-form distributions, echoing the broader shift toward population-level validity.
Graph structure helps when evidence is relational, not just textual: CuraView’s per-patient GraphRAG and MAKA’s machining KG both outperform flatter retrieval setups by preserving entity relations and provenance.
Runtime protection is increasingly interpretable: REGUARD’s threshold rules, U-Define’s hard/soft split, and MAKA’s critic checks show a preference for auditable interventions over opaque policy changes.
Several papers expose a “semantic correctness gap”: ORPilot can compile and solve yet still be semantically wrong; style detectors can classify based on length confounds; unlearning methods can refuse without forgetting; benchmark wins can mask shallow mechanisms.
Test-time scaling remains useful when paired with diversity and verification: FlexSQL’s Majority@16 gains, Strat-Reasoner’s micro-rollouts, and CAFE’s architecture-specific stress patterns all point to structured exploration as a practical lever.
Many strongest results are still bounded by environment realism: OT emulation, digital twins, synthetic banking stress, synthetic copyrighted concepts, and synthetic identities all improve control and measurement, but transfer to live settings remains the key unresolved step.

4) Top 5 papers (with “why now”)

APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
- Demonstrates autonomous discovery → exploitation → patching → verification on bare-metal MCU OT targets using protocol primitives rather than shell-centric tooling.
- Shows runtime governance matters materially: overseer-on reached 100% mission success in the T1 ablation and cut completion time by 20.5%.
- Useful now because it expands the threat model from Linux/web pentesting to industrial protocols and resource-constrained firmware.
- Skeptical about: results are in QEMU/simulated environments with limited exploit scope and uncertain transfer to physical silicon.
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
- Reframes text-to-SQL as continual exploration plus plan/program backtracking, not one-shot schema linking and repair.
- Achieves 65.44% Majority@16 on Spider2-Snow with gpt-oss-120b and shows large drops when Python support or diversity is removed.
- Useful now because enterprise database interfaces increasingly fail on ambiguity and large schemas, exactly where fixed-stage pipelines break.
- Skeptical about: the gains come with heavy tool-call overhead, and comparisons exclude closed-source top systems.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
- Builds a patient-specific GraphRAG pipeline for sentence-level discharge-summary verification with structured evidence grades.
- Reports E4 F1 of 0.831 with 0.909 recall on safety-critical contradictions, outperforming flat-retrieval baselines by about 0.19–0.20 F1.
- Useful now because clinical deployment needs patient-grounded factuality checks, not generic hallucination benchmarks.
- Skeptical about: labels partly derive from the generation pipeline, and evaluation is limited to a single-center curated subset.
Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers
- Combines bilevel worst-case scenario search with interpretable runtime rules that protect pretrained RL controllers without retraining.
- Finds controllers can be 43%–64% worse than achievable in feasible scenarios, then shrinks those gaps by roughly 79%–85% while preserving nominal performance.
- Useful now because it offers a concrete template for “discover failure first, then patch locally” in safety-critical learned control.
- Skeptical about: certificate tightness depends on the quality of the inner reference portfolio and the simplicity of the rule class.
AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
- Isolates a subtle but important benchmark failure mode: agent selectors reading evaluator outputs can change rankings when evaluator configs change.
- Provides a large paired-trace corpus plus a screening ensemble that reaches AUROC 0.96 on source-level surgery cases and supports low-cost repairs.
- Useful now because agent leaderboards are proliferating faster than their measurement hygiene, and this paper gives a concrete audit path.
- Skeptical about: it explicitly does not certify causal mechanisms beyond its observability boundary, and forward transfer is only moderate.

5) Practical next steps

Add runtime governance layers to agent stacks by default: repetition guards, phase-transition checks, schema validation, bounded retries, and explicit escalation paths.
Benchmark agents under architecture-controlled ablations, not just model swaps: hold tools/prompts fixed and vary coordination, evaluator access, or verifier placement.
For high-stakes domains, require typed intermediate artifacts and deterministic execution for numeric, protocol, or solver-critical steps.
Build worst-case discovery loops before deployment: search for feasible high-regret scenarios, then derive minimal interpretable runtime protections rather than retraining globally.
Measure distributional realism of simulators and synthetic users before trusting simulation-based evals; especially track feedback, context disclosure, termination, and domain-specific behavior.
Treat detector wins skeptically unless diagnostics rule out confounds like length, formatting, or frozen-evaluator leakage.
In multimodal safety/unlearning work, verify stage-1 memorization actually happened before claiming forgetting; add exposure-style or internal-state checks.
For agentic systems with retrieval, move beyond flat RAG toward graph-structured evidence + schema-constrained outputs when the domain is relational or patient-/entity-specific.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-05-09

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Runtime governance and verifiable control for agents

Theme: Evaluation is moving toward failure diagnostics, not just leaderboard scores

Theme: Grounded reasoning via tools, graphs, and typed intermediate representations

Theme: Security, misuse, and the erosion of static defenses

Theme: Alignment and preference learning are becoming more context- and token-aware

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps