AI 论文日报(2026-05-09)

Published:

English version: /paper-news/2026-05-09/

运行统计

  • 候选论文: 692
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-05-08T00:00:00Z → 2026-05-09T00:00:00Z (weekend_backlog_unknown, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2605.03619The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code
PDF
cs.CR93Measures LLM malware polymorphism with dual-agent pipeline; directly relevant to offensive capability risk.llm-safety, cybersecurity, malware, evaluation, agents
2605.03353SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
PDF
cs.CR, cs.AI92Portable skill compilation plus security hardening for cross-framework LLM agents.llm-agents, agent-security, prompt-engineering, compiler, skills
2605.04624AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
PDF
cs.AI, cs.SE92Agent-repair leaderboard instability from evaluator leakage; large trace corpus for auditing selection bias.agent-safety, evaluation, benchmark, auditing, repair
2605.02346APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
PDF
cs.CR, cs.AI90Autonomous OT pentesting/remediation with runtime controls; strong agent-security relevance.agent-security, cybersecurity, autonomous-agents, operational-technology, red-teaming
2605.03310Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
PDF
cs.MA, cs.LG, q-fin.TR90Principled coordination layer for LLM multi-agent failures; strong relevance to agent reliability.multi-agent, coordination, agent-architecture, reliability, evaluation
2605.03547Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
PDF
cs.CV, cs.AI89First benchmark for multimodal copyright unlearning in LVLMs; strong safety and evaluation relevance.unlearning, multimodal, LVLM, benchmark, copyright
2605.02815FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
PDF
cs.CL89Agentic text-to-SQL with flexible exploration, execution, and repair; strong relevance to tool-using LLMs.agents, text-to-sql, tool-use, reasoning, evaluation
2605.04003Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing
PDF
cs.MA, cs.AI, cs.IR88Traceable multi-agent decision support with safety bounds, provenance, and human approval.multi-agent, safety, provenance, human-in-the-loop, tool-use
2605.04874Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
PDF
cs.LG, cs.CL, cs.CV88Uncertainty-aware DPO for MLLM hallucination; directly relevant to multimodal alignment reliability.multimodal-llm, alignment, dpo, hallucination, uncertainty
2605.04831StoryAlign: Evaluating and Training Reward Models for Story Generation
PDF
cs.CL, cs.AI88Benchmarking and training reward models for story preferences; useful for alignment and RM evaluation.alignment, reward-models, evaluation, llms, preferences
2605.05017Position: Embodied AI Requires a Privacy-Utility Trade-off
PDF
cs.AI, cs.RO88Privacy-focused position on embodied AI lifecycle risks; strong safety relevance despite no empirical results.embodied-ai, privacy, safety, position-paper, deployment
2605.02765U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
PDF
cs.AI, cs.HC, cs.LG88User control and verification for LLM planning; directly relevant to reliable agent workflows.llm-planning, human-ai, verification, reliability, agents
2605.02709An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance
PDF
cs.AI87Empirical study of healthcare agent skills highlights governance, safety gaps, and deployment realities.agents, governance, healthcare, safety, empirical-study
2605.03900Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
PDF
cs.AI86Frames frontier AI failures as contextual objective selection; broad alignment relevance.alignment, objectives, agents, decision-making, theory
2605.03759Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks
PDF
cs.CV, cs.AI86Finds unlearning benchmarks fail when models never memorized; proposes stronger LVLM memorization benchmark.unlearning, privacy, LVLM, benchmark, evaluation
2605.02463When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
PDF
cs.MA, cs.AI, cs.CE86Targets robustness beyond robustness: stress-testing multi-agent LLMs for antifragility signals.multi-agent, robustness, evaluation, stress-testing, agents
2605.04906Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games
PDF
cs.AI86RL framework for strategic reasoning in multi-agent games; relevant to agentic reasoning and evaluation.llms, agents, reasoning, multi-agent, reinforcement-learning
2605.04373Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers
PDF
cs.NI, cs.AI, eess.SY86Finds worst-case failures in RL controllers and adds runtime protection; strong robustness/security angle.rl, robustness, runtime-protection, verification, networking
2605.03677Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
PDF
cs.LG86Unified on-policy distillation for LLMs/MLLMs with concrete bottlenecks and recipe.LLM, MLLM, distillation, post-training, optimization
2605.02741AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development
PDF
cs.SE, cs.AI86Audits maintainability risks in LLM/agent-generated code with concrete defect patterns and tradeoffs.llm-agents, software-engineering, evaluation, reliability, technical-debt
2605.02620Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
PDF
cs.CL, cs.LG85Agentic research reproduces NLP study fast; strong frontier-agent capability signal with eval implications.agents, evaluation, automation, llm-capabilities, reproducibility
2605.02624Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
PDF
cs.CL85Framework to evaluate realism of simulated users in multi-turn chats; useful for scalable agent evaluation.evaluation, user-simulation, multi-turn, chatbots, benchmark
2605.03476CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
PDF
cs.CL, cs.AI84GraphRAG multi-agent hallucination detection for medical summaries with evidence grounding.hallucination, graphrag, medical-llm, multi-agent, factuality
2605.02728ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling
PDF
cs.AI84Production-oriented agentic LLM system with modular data/spec elicitation; useful for real-world agent design.agents, LLM, optimization, tool-use, production
2605.04507Distilling Bayesian Belief States into Language Models for Auditable Negotiation
PDF
cs.CL84Makes negotiation agents auditable by distilling explicit Bayesian beliefs into LM outputs.auditing, interpretability, belief-state, negotiation, llm
2605.03571PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination
PDF
cs.CL, cs.AI84Real-world multi-turn benchmark for office actions and rebuttals; strong agentic/legal reasoning testbed.benchmark, agents, llms, legal-reasoning, retrieval
2605.02730Perceptual Flow Network for Visually Grounded Reasoning
PDF
cs.CV, cs.AI84Targets LVLM hallucination and language bias with reward-shaped grounded reasoning; frontier multimodal reliability.multimodal, hallucination, reasoning, vlm, reliability
2605.03824Reproducing Complex Set-Compositional Information Retrieval
PDF
cs.CL84Repro study + new benchmark for compositional retrieval; useful for RAG reasoning evaluation.RAG, retrieval, benchmark, evaluation, reasoning
2605.04922Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation
PDF
cs.MA, cs.AI84Structured multi-agent ideation via evolving graphs; notable for explicit coordination and evaluation claims.multi-agent, scientific-discovery, coordination, llm-systems, evaluation
2605.02735Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
PDF
cs.LG84Novel MLLM latent-reasoning pathology and fix; relevant to multimodal reasoning efficiency.multimodal, reasoning, latent-space, MLLMs, efficiency

AI 论文洞察简报

2026-05-09

0) 执行要点(先读这个)

  • 运行时结构正成为智能体最主要的可靠性杠杆。 在 OT 安全、规划、制造、协同和网络控制等领域,论文反复表明:护栏、批评器、形式化验证器、类型化 IR 以及基于规则的运行时干预,对结果的改善往往超过单纯的提示词微调。
  • 评估正从平均表现分数转向失败表面的刻画。 多篇论文关注最坏情况发现、评估器通道泄漏、压力几何、分布真实性以及架构特定的失败特征,而不只是基准准确率。
  • “落地到真实依据”如今不止意味着检索。 更强的系统越来越多地将检索与类型化输出、确定性工具、图结构或形式化检查结合起来:用于患者特定验证的 GraphRAG、用于 text-to-SQL 的数据库工具循环、用于制造的知识图谱,以及用于硬规划约束的模型检查。
  • 智能体能力正在扩展到具有实际运营意义的领域,但迁移仍是瓶颈。 APIOT 展示了在裸机 OT 上从漏洞利用→修补→验证的端到端流程;ORPilot 处理生产风格的优化工作流;MAKA 支持航空航天加工决策。在每种情况下,真实部署仍面临物理迁移、语义验证或在线运行等问题。
  • 在多模态/安全场景中,遗忘与检测仍较为浅层。 版权遗忘基准显示,当前方法要么保留效用,要么真正遗忘,但难以兼得;LVLM 遗忘基准如果第一阶段记忆化从未发生,可能本身就无效;AI 文本与攻击性代码论文表明,在自适应生成面前,检测器和静态特征越来越脆弱。
  • 实践前沿是“可审计的自主性”。 最具决策价值的论文不只是提升任务成功率;它们还暴露出处、 不确定性、证据等级、成本-质量权衡或可解释规则,使人类能够检查并约束系统行为。

2) 关键主题(聚类)

主题:面向智能体的运行时治理与可验证控制

主题:评估正转向失败诊断,而不只是排行榜分数

主题:通过工具、图和类型化中间表示实现有依据的推理

主题:安全、滥用与静态防御的失效

主题:对齐与偏好学习正变得更具上下文感知和 token 感知

3) 技术综合

  • 类型化中间层正成为核心系统模式:ORPilot 的 JSON IR、SkCC 的 SkIR、CuraView 的模式约束输出,以及 MAKA 的结构化 JSON 路由,都减少了歧义,并使下游验证成为可能。
  • 回溯优于一次性修补:FlexSQL 明确回访规划假设,而不只是修 SQL 语法;APIOT 的监督器强制阶段转换;REGUARD 迭代执行搜索-保护循环;这表明稳健智能体需要上游纠错,而不只是最终输出打补丁。
  • 确定性工具正被保留给模型最不擅长的部分:数值计算、协议报文构造、形式化验证、求解器执行以及物理补偿计算,正越来越多地从自由生成中剥离出来。
  • 评估正变得架构感知:协同类论文在固定模型和信息的前提下隔离编排效应;AuditRepairBench 隔离选择器/评估器耦合;这是未来智能体基准设计的有用模板。
  • 分布真实性比样本真实性更重要:realsim 从意图、反馈、身份、知识和表面形式分布上评估用户模拟器,呼应了更广泛的向总体层面有效性转移。
  • 当证据是关系型而非纯文本时,图结构更有帮助:CuraView 的逐患者 GraphRAG 和 MAKA 的加工知识图谱,都通过保留实体关系和出处,优于更扁平的检索设置。
  • 运行时保护正变得更可解释:REGUARD 的阈值规则、U-Define 的硬/软拆分,以及 MAKA 的批评器检查,都体现出相较于不透明策略修改,更偏好可审计干预。
  • 多篇论文暴露出“语义正确性缺口”:ORPilot 可以编译并求解,却仍可能语义错误;风格检测器可能基于长度混杂因素分类;遗忘方法可能只是拒答而非真正遗忘;基准胜利也可能掩盖浅层机制。
  • 测试时扩展在与多样性和验证结合时仍然有用:FlexSQL 的 Majority@16 提升、Strat-Reasoner 的微 rollout,以及 CAFE 的架构特定压力模式,都表明结构化探索是一个实用杠杆。
  • 许多最强结果仍受环境真实性限制:OT 仿真、数字孪生、合成银行压力、合成版权概念和合成身份都提升了控制与测量能力,但向真实环境迁移仍是关键未解步骤。

4) 前 5 篇论文(附“为什么是现在”)

  • APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
    • 展示了在裸机 MCU OT 目标上,利用协议原语而非以 shell 为中心的工具,实现自主发现 → 利用 → 修补 → 验证。
    • 表明运行时治理具有实质性作用:在 T1 消融中,开启 overseer 后任务成功率达到 100%,完成时间缩短 20.5%。
    • 现在有用,因为它将威胁模型从 Linux/web 渗透测试扩展到了工业协议和资源受限固件。
    • 持保留态度之处:结果来自 QEMU/模拟环境,漏洞利用范围有限,向真实物理芯片迁移的效果尚不确定。
  • FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
    • 将 text-to-SQL 重构为持续探索加计划/程序回溯,而不是一次性的 schema linking 和修复。
    • 在 Spider2-Snow 上使用 gpt-oss-120b 达到 65.44% 的 Majority@16,并显示移除 Python 支持或多样性后性能大幅下降。
    • 现在有用,因为企业数据库接口越来越常在歧义和大规模 schema 下失效,而这正是固定阶段流水线容易崩溃的地方。
    • 持保留态度之处:这些提升伴随着较高的工具调用开销,而且比较中未包含闭源顶级系统。
  • CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
    • 构建了一个面向患者的 GraphRAG 流水线,用于带结构化证据等级的出院小结句级验证。
    • 报告 E4 F1 为 0.831,在安全关键矛盾上的召回率为 0.909,相比平面检索基线高出约 0.19–0.20 F1。
    • 现在有用,因为临床部署需要以患者为依据的事实性检查,而不是通用幻觉基准。
    • 持保留态度之处:标签部分来自生成流水线本身,且评估仅限于单中心整理子集。
  • Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers
    • 将双层最坏情况场景搜索与可解释运行时规则结合起来,在不重训的情况下保护预训练 RL 控制器。
    • 发现控制器在可行场景中可能比可达到水平差 43%–64%,随后在保持标称性能的同时,将这些差距缩小约 79%–85%。
    • 现在有用,因为它为安全关键学习控制提供了一个具体模板:先发现失败,再做局部修补。
    • 持保留态度之处:证书紧致性取决于内部参考组合的质量以及规则类别的简单性。
  • AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
    • 隔离出一种微妙但重要的基准失效模式:当智能体选择器读取评估器输出时,评估器配置变化会改变排名。
    • 提供了一个大规模配对轨迹语料库,以及一个在源码级手术案例上达到 AUROC 0.96 的筛查集成器,并支持低成本修复。
    • 现在有用,因为智能体排行榜的扩张速度快于其测量卫生建设,而这篇论文给出了具体的审计路径。
    • 持保留态度之处:论文明确不在其可观测边界之外认证因果机制,且前向迁移能力仅属中等。

5) 实际下一步

  • 默认在智能体栈中加入 运行时治理层:重复防护、阶段转换检查、模式校验、有限重试和明确的升级路径。
  • 架构受控消融 下评测智能体,而不只是替换模型:固定工具/提示词,只改变协同方式、评估器访问或验证器放置位置。
  • 对高风险领域,要求 类型化中间工件,并在数值、协议或求解器关键步骤中使用确定性执行。
  • 在部署前建立 最坏情况发现循环:搜索可行的高遗憾场景,然后导出最小、可解释的运行时保护,而不是全局重训。
  • 在信任基于仿真的评估前,先衡量模拟器和合成用户的 分布真实性;尤其跟踪反馈、上下文披露、终止行为和领域特定行为。
  • 除非诊断已排除长度、格式或冻结评估器泄漏等混杂因素,否则应 谨慎看待检测器胜利
  • 在多模态安全/遗忘工作中,在声称遗忘之前先验证 第一阶段记忆化确实发生过;加入 exposure 风格或内部状态检查。
  • 对带检索的智能体系统,当领域具有关系性或患者/实体特定性时,应从平面 RAG 进一步转向 图结构证据 + 模式约束输出

基于逐篇论文分析生成;未进行外部浏览。