AI 论文日报(2026-05-11)

Published:

English version: /paper-news/2026-05-11/

运行统计

  • 候选论文: 5420
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-05-08T00:00:00Z → 2026-05-09T00:00:00Z (weekend_backlog_sat, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2605.02236Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
PDF
cs.AI, cs.CL, cs.LG90Studies persistence/escape in recursive LLM loops; relevant to agent stability and prompt-induced drift.llm, agents, safety, robustness, evaluation
2604.19734UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
PDF
cs.RO, cs.AI88Humanoid foundation-model direction: unified latent action language for human-to-robot transfer.robotics, foundation-models, world-models, policy-learning, transfer-learning
2605.02372Privacy Preserving Machine Learning Workflow: from Anonymization to Personalized Differential Privacy Budgets in Federated Learning
PDF
cs.CR, cs.AI88Privacy-preserving FL workflow with poisoning detection and personalized DP budgets.privacy, federated-learning, differential-privacy, poisoning, security
2605.03426Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
PDF
cs.AI88Federated preference-based alignment for heterogeneous VLMs; strong privacy/alignment relevance.federated-learning, alignment, VLM, preference-modeling, privacy
2603.15506Seeking SOTA: Time-Series Forecasting Must Adopt Taxonomy-Specific Evaluation to Dispel Illusory Gains
PDF
cs.LG, cs.AI88Calls out misleading TSF benchmarks; strong evaluation critique with broad ML relevance.evaluation, benchmarking, time-series, methodology, robustness
2605.02351MolViBench: Evaluating LLMs on Molecular Vibe Coding
PDF
cs.CL87New benchmark for LLM molecular code generation; useful eval for domain agents and executable reasoning.llm, benchmark, code-generation, agents, evaluation
2605.02669An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES
PDF
cs.AI86Agentic, auditable biomedical reasoning plus benchmark for a high-stakes domain.agents, llm, safety-critical, benchmark, explainability, biomed
2605.03941A Benchmark for Interactive World Models with a Unified Action Generation Framework
PDF
cs.CV, cs.AI86Large benchmark for interactive world models with unified action evaluation; reusable for agent capability testing.world-models, benchmark, agents, evaluation, multimodal
2605.02110Adversarial Update-Based Federated Unlearning for Poisoned Model Recovery
PDF
cs.LG, cs.CR86Targets poisoned federated models with efficient unlearning/recovery; concrete security relevance.federated-learning, security, unlearning, poisoning, robustness
2604.24001CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation
PDF
cs.AI86Fine-grained factuality benchmark for CT report generation; strong eval utility and reuse potential.evaluation, benchmark, factuality, medical-ai, report-generation
2605.04491An Evaluation of Chat Safety Moderations in Roblox
PDF
cs.CY, cs.CR85Large-scale independent evaluation of chat moderation on a child-heavy platform; concrete safety relevance.safety, moderation, evaluation, platforms, cybersecurity
2605.03821RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
PDF
cs.RO, cs.AI85Reward-aligned robot world models plus new benchmark/judge; relevant to alignment of embodied generative models.alignment, robotics, world-models, reward-modeling, benchmark
2605.05045When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
PDF
cs.CV, cs.CL85Targets VLM relation hallucination under perturbations; useful robustness evaluation.vlm, hallucination, robustness, evaluation, multimodal
2603.22219Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
PDF
cs.LG, stat.ML85Exact statistical benchmark for probabilistic forecasting; reusable eval framework.evaluation, benchmark, probabilistic-modeling, time-series, robustness
2605.02374Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
PDF
cs.CR, cs.CL84Adversarial training for robust machine-generated text detection; concrete black-box threat model.llm-security, adversarial-training, text-detection, evaluation, robustness
2604.11734Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving
PDF
cs.RO, cs.AI84Online RL post-training for multi-agent diffusion driving planners with explicit safety/efficiency aims.reinforcement-learning, autonomous-driving, multi-agent, diffusion, safety
2604.20719ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
PDF
cs.SD, cs.AI, cs.MM, eess.AS84Benchmark targets omnimodal reasoning and explicitly critiques hallucination-prone LLM-as-judge evals.benchmark, multimodal, evaluation, hallucinations, reasoning
2605.03544DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset
PDF
cs.CV, cs.AI84Open multicentric benchmark comparing pathology copilots to experts; strong real-world LLM/VLM evaluation value.benchmark, multimodal, medical-ai, evaluation, copilots
2604.10996When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
PDF
cs.CL, cs.AI, cs.CE84LLM-generated features help RL trading only in some regimes; useful reliability lesson with concrete IC results.llm, rl, reliability, evaluation, representation
2605.03986From Intent to Execution: Composing Agentic Workflows with Agent Recommendation
PDF
cs.AI84Automates multi-agent workflow composition and agent recommendation; useful agentic systems infra.agents, multi-agent, workflow, orchestration, LLM
2604.19724Benign Overfitting in Adversarial Training for Vision Transformers
PDF
cs.LG, cs.AI84Theoretical analysis of adversarial training in ViTs; robustness results could inform secure model design.adversarial-robustness, vision-transformers, theory, security, generalization
2605.05121Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
PDF
cs.CL83Targets trustworthy prediction with uncertainty and reasoning-aware views in a high-stakes language setting.trustworthiness, uncertainty, nlp, reliability, evaluation
2604.20382Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs
PDF
cs.CL82LLM data generation for counseling with structured grounding in a high-risk domain.llm, synthetic-data, mental-health, safety-critical, grounding
2603.21597A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
PDF
cs.AI, cs.CV82Interactive multi-agent clinical AI with privacy-preserving deployment and clinician-facing reasoning tools.agents, healthcare, multimodal, privacy, decision-support
2604.20166Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders
PDF
cs.CL, cs.HC82Trust/safety framework for mental-health AI; strong multi-stakeholder lens on reliability and deployment.AI-safety, trust, mental-health, survey, evaluation
2604.26498Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
PDF
cs.LG, q-bio.QM82Useful scaling reality check: larger models often do not win in drug discovery across many endpoints.scaling-laws, benchmark, foundation-models, evaluation, drug-discovery
2603.15185What Matters for Scalable and Robust Learning in End-to-End Driving Planners?
PDF
cs.RO, cs.AI, cs.CV82Systematic study of what actually improves closed-loop end-to-end driving robustness and scalability.autonomy, robustness, evaluation, scaling, planning
2604.25472SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
PDF
cs.AI82New benchmark for LLM-based evaluation of AI-generated science materials with evidence.benchmark, evaluation, llm, education, reliability
2605.03788Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones
PDF
cs.AI, cs.NI, cs.RO82Grounded LLM agent framework for real-time drone swarms; notable agent execution/safety setting.agents, LLM, robotics, tool-use, cyber-physical-systems
2604.19357FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition
PDF
cs.LG82Subgroup fairness auditing with bias-variance decomposition; practical auditing tool with broad applicability.fairness, auditing, evaluation, bias, reliability

AI 论文洞察简报

2026-05-11

0) 核心结论(请先读这里)

  • 评测是今天最突出的主题:多篇论文指出当前基准会夸大进展,随后用更可证伪或更细粒度的协议替代——如面向分类体系的预测评估、用于概率时间序列预测的精确噪声滴定、属性级 CT 报告评分、确定性的音乐记谱评测,以及多中心病理/VLM 基准。
  • 鲁棒性失效越来越多地被追溯到接口设计,而非单纯模型规模:BEV 压缩提升闭环驾驶表现,记忆/更新规则决定递归式 LLM 的“脆弱性”,而简单预处理只能部分修复 VLM 在旋转/噪声下的关系幻觉。
  • 后训练正变得更有针对性且更模块化:扩散规划器引入带方差门控优化的在线 RL,机器人世界模型采用蒸馏式多模态奖励对齐加推理时重编码,联邦 VLM 对齐则从参数共享转向奖励路由。
  • 更大的模型在专业领域并不稳定占优:简单/经典方法在时间序列预测和分子预测中仍具竞争力,而病理专用或任务专用系统在领域任务上往往优于通用多模态模型。
  • 在高风险领域,最强论文通常将性能提升与面向工作流的可解释性结合:痴呆风险评估、DILI 假设生成、子群公平性审计和心理健康预测都强调证据轨迹、不确定性或机制性解释,而不只是原始分数。
  • 对于智能体系统,实践上的教训是要加固脚手架,而不只是底座模型:类型化工具、护栏、路由、检索和显式记忆策略反复决定了系统在分布偏移或长时程执行下是否仍然可靠。

2) 关键主题(聚类)

主题:评测正从排行榜分数转向可证伪诊断

主题:闭环鲁棒性取决于表征瓶颈与后训练

主题:智能体可靠性主要是系统问题

主题:高风险 AI 正走向带证据、感知不确定性的输出

主题:领域专用基准正在暴露通用模型的失效点

3) 技术综合

  • 一个反复出现的模式是围绕因果结构重设计基准:预测中的已知 DGP、放射学中的属性模式、音乐中的规范化音高映射,以及病理中的隔离答案,都在减少“正确”定义上的歧义。
  • 多篇论文表明,开环或特征级有效性并不意味着闭环效用:拥有强 BEV 特征的驾驶规划器在闭环中仍会失败,LLM 导出的交易特征提升了 IC 却未提升策略鲁棒性,而视觉上合理的世界模型仍可能与任务不对齐。
  • 压缩/瓶颈化正成为一种鲁棒性工具:驾驶中的场景 token 化、类人迁移中的共享潜在动作 token,以及机器人世界模型中的轻量蒸馏奖励模型,都在提升可扩展性的同时减少对原始高维输入的脆弱依赖。
  • 后训练正变得比通用 RLHF 更有结构:用于扩散规划器的 VG-GRPO、用于联邦 VLM 的带路由奖励的 GRPO,以及用于世界模型的奖励蒸馏 RL,都在根据模型类别和部署约束定制优化。
  • 多篇论文强调配对式或反事实评测:处理组对照组的递归循环、释义版对抗版 CT 报告,以及按分类体系或化学相似性划分的基准,都试图将真实增益与伪影区分开来。
  • 简单基线依然出人意料地强,尤其是在周期性预测和分子性质预测中,这再次说明基准构成和切分设计可能主导人们对进展的感知。
  • 推理时修复很重要:方向校正、去噪、滑动窗口重编码、辅助工具和护栏,往往比单纯提示词微调更能恢复可靠性。
  • 不确定性正越来越被操作化为分诊信号,而不只是校准分数:基于证据的心理健康预测、模态感知的痴呆融合和公平性审计都旨在识别何时应由人类检查或干预。
  • 智能体系统正收敛到模块化编排:路由器、推荐器、类型化工具网关和评审循环反复优于“把一切都交给模型”的单体式设计。
  • 在各类安全相关领域,最强论文通常结合了任务专用结构 + 可供人审计的输出,这表明当前前沿进展更多来自系统设计和评测纪律,而非单纯模型扩展。

4) Top 5 论文(附“为什么是现在”)

  • What Matters for Scalable and Robust Learning in End-to-End Driving Planners?
    • 表明高分辨率 BEV 特征会因因果混淆而损害闭环驾驶;一个简单的 tokenizer 瓶颈就能显著提升驾驶分数和成功率。
    • 区分了解耦输出与扩散规划的作用:前者减少静态违规,后者减少动态违规,二者结合效果最佳。
    • 展示了扩散规划器的数据扩展优势,并报告了 SOTA 的闭环 Bench2Drive 结果以及在 NAVSIM 上的提升。
    • 持保留态度于:压缩在长距离/高速场景中可能失效,而扩散仍带来运行时权衡。
  • A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
    • 是面向工作流医疗 AI 的强例子:模态智能体、提议-评审式融合,以及面向临床医生的仪表盘。
    • 在预测、诊断和生存任务上优于单模态和 LLM 基线,并在读片研究中将临床医生准确率提高了 +17.5 个百分点。
    • 能够优雅地处理模态缺失,并加入 Dynamic Medical Notebook 以支持迭代纠正。
    • 持保留态度于:标签来自回顾性 EHR 代理变量,系统仍依赖通用 LLM 推理组件。
  • Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
    • 通过控制 DGP 和注入噪声,将预测鲁棒性重构为一个精确统计问题,因此比标准历史基准能提出更尖锐的结论。
    • 引入了具有完整高斯信念和丰富校准诊断的概率 Fern 模型。
    • 揭示了零样本基础模型和保形方法在非平稳条件下的失效模式。
    • 持保留态度于:证据来自合成数据且基于高斯噪声,因此真实世界迁移尚未被证明。
  • RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
    • 为将机器人世界模型对齐到任务级标准而非仅像素相似性,提供了实用方案。
    • 将一个 8B 多模态评审器蒸馏为约 98M 的奖励模型,速度足以支持在线 RL,并加入滑动窗口重编码以减少 rollout 漂移。
    • 报告称相较最强基线,聚合评审指标提升 +10.1%,且以极小运行时开销获得更好的长时程保真度。
    • 持保留态度于:收益展示于桌面操作场景,尚未与下游闭环控制改进建立联系。
  • DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset
    • 是高价值基准发布:多中心、病理学家策展、隔离式评测,并直接与 31 位人类读者比较。
    • 表明病理专用的 PathChat+ 在若干任务上比通用 VLM 更接近专家表现。
    • 其当下价值在于病理 copilot 发展迅速,而抗泄漏基准又极其稀缺。
    • 持保留态度于:评测使用的是选定 ROI 而非完整 WSI,且缺乏更广泛的临床背景或辅助检查。

5) 实际下一步

  • 审计你的评测栈是否存在伪影驱动的增益:在相信排行榜提升之前,加入简单基线、面向分类体系的切分和扰动测试。
  • 对于智能体系统,显式测试记忆/更新策略(追加 vs 替换 vs 摘要化上下文),因为脚手架机制可能主导鲁棒性。
  • 在闭环规划或控制中,加入表征瓶颈并比较开环与闭环指标;不要假设更丰富的潜在状态一定有帮助。
  • 如果使用昂贵的评审器或奖励模型,尝试教师→学生蒸馏,这样对齐信号就能在线使用,而不只是离线使用。
  • 在鲁棒性研究中加入配对对照实验:比较处理组 vs 对照组 vs 对照组随机下限,以区分真实效应和采样方差。
  • 对于多模态或医疗系统,要求输出包含证据轨迹、不确定性或机制假设,以便人类检查。
  • 在联邦或隐私敏感场景中,当客户端异构时,可考虑共享偏好/奖励/路由信号而不是完整参数。
  • 对于 VLM 部署,基准测试应覆盖旋转/噪声下的关系推理并测试预处理流水线;仅靠提示词修复大概率不够。

基于逐篇论文分析生成;未进行外部浏览。