AI 论文日报(2026-04-26)

Published:

English version: /paper-news/2026-04-26/

运行统计

  • 候选论文: 4233
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-24T00:00:00Z → 2026-04-25T00:00:00Z (weekend_backlog_unknown, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.19457Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
PDF
cs.AI92Clear eval framework for long-horizon enterprise agents: factuality, reasoning, compliance, abstention axesagents, alignment, evaluation, compliance, abstention, enterprise
2604.18292Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
PDF
cs.AI, cs.CL92Self-evolving arena to synthesize verifiable real-world tool envs for training lifelong agents.agents, environment-generation, tool-use, lifelong-learning, evaluation, MCP
2604.18401StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
PDF
cs.CL92Agentic RL post-training with step-aligned optimization for tool-using LLM agents.agentic-RL, LLM-agents, post-training, tool-use, credit-assignment, long-horizon
2604.17870GraSP: Graph-Structured Skill Compositions for LLM Agents
PDF
cs.CL92Executable skill graphs w/ verification+repair for LLM agents; tackles skill overload/orchestration.llm-agents, tool-use, planning, skill-composition, verification, reliability
2604.17771SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
PDF
cs.CL, cs.AI, cs.DB90Practical framework to detect/quantify NL2SQL benchmark contamination; important for trustworthy evalsdata-contamination, evaluation, NL2SQL, synthetic-variants, benchmarking, LLM-reliability
2604.19548Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment
PDF
cs.CL, cs.AI, cs.CY90Identifies bias in multi-agent reflection/auditing; introduces Ambiguous Failure Benchmark + mitigation.agents, multi-agent, reliability, evaluation, cognitive-bias, self-critique
2604.17950CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
PDF
cs.AI90Risk-aware contextual capability calibration for delegation; reduces misdelegation via uncertainty.multi-agent, delegation, calibration, uncertainty, risk-aware, agent-safety
2604.20300FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory
PDF
cs.AI90Selective forgetting for LLM agents targets efficiency + security (forget malicious/sensitive memory).agents, memory, security, privacy, robustness
2604.17821WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
PDF
cs.AI90Web agent planning+reasoning with uncertainty-aware adaptive planning and MCTS to curb hallucinationsweb-agents, planning, uncertainty, MCTS, hallucinations, agent-reliability
2604.17894Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions
PDF
cs.CL89DynaSlide benchmark + SlideAgent for NL-driven slide edits; strong real-world agentic eval.benchmarks, agents, tool-use, multimodal, document-ai, evaluation
2604.20211Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs
PDF
cs.SE, cs.AI, cs.CR88Taxonomy + real-world benchmark (101 cases) for insecure logging; enables LLM detection/repair evaluationsecurity, benchmark, software-engineering, LLM, privacy-leakage, log-injection
2604.19300HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
PDF
cs.SD, cs.AI88Large benchmark for hallucination detection in audio-language models; fills key multimodal safety gaphallucinations, multimodal, audio-language-models, benchmark, evaluation, reliability
2604.18576Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
PDF
cs.AI88Agentic forecasting with Bayesian linguistic belief updates + calibration; SOTA on ForecastBench.agents, calibration, uncertainty, tool-use, forecasting, evaluation
2604.19502Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
PDF
cs.CL88Benchmark/framework for AI peer reviews beyond ratings; multi-dimension eval + dataset.evaluation, benchmarks, LLM-reviews, metrics, faithfulness, argumentation
2604.17842QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
PDF
cs.CL88Method/tool to efficiently find hard cases in dynamic LLM benchmarks; useful for red-teaming.evaluation, benchmarks, adversarial-testing, bayesian-optimization, robustness
2604.20658Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
PDF
cs.CL88Cooperation-game profiles predict multi-agent LLM team performance; useful for agent evaluation/safety.multi-agent, evaluation, cooperation, AI-for-science, benchmarks
2604.17827Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models
PDF
cs.CL88SLM learns when/how to query LLM under privacy/cost constraints; useful for safe, efficient agent designssmall-models, LLM-collaboration, privacy, cost-control, multi-step-reasoning, agent-architectures
2604.19015FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
PDF
cs.LG, cs.AI88Federated LLM fine-tuning tackling IP+privacy+heterogeneity via proxy SLM and fusion; practical deployment valuefederated-learning, llm-finetuning, privacy, ip-protection, heterogeneous-data, compression, edge
2604.18038First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows
PDF
cs.CY, cs.AI86Measures and mitigates racial bias in clinical LLM workflows; governance lens + multi-model eval.safety, bias, healthcare, evaluation, agentic-workflows, governance
2604.19005Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection
PDF
cs.CL86Multi-agent debate to detect misleading half-truths under noisy retrieval; strong eval claim.misinformation, fact-checking, multi-agent, debate, RAG, robustness
2604.20443DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
PDF
cs.CL, cs.AI, cs.LG86DialToM benchmark separates ToM recognition vs using states to forecast dialogue; exposes reasoning gaps.benchmark, theory-of-mind, dialogue, evaluation, reasoning
2604.19012Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis
PDF
cs.CR, cs.SE85Training-free multi-agent vuln detection via behavioral contract synthesis; addresses dedup collapse issuecybersecurity, vulnerability-detection, agents, specification-synthesis, robust-evaluation, software-security
2604.18176QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning
PDF
cs.AI, quant-ph85Physics-consistent QuantumQA dataset + verification-aware RLVR reward modeling for rigor.RLVR, verifiable-rewards, scientific-reasoning, dataset, reward-modeling, reliability
2604.20495Towards Certified Malware Detection: Provable Guarantees Against Evasion Attacks
PDF
cs.CR, cs.LG84Certified robustness for malware detection via randomized smoothing; formal guarantees vs evasion attackscybersecurity, malware, certified-robustness, adversarial, randomized-smoothing
2604.18071Architectural Design Decisions in AI Agent Harnesses
PDF
cs.AI84Empirical taxonomy of agent harness architecture decisions across 70 projects; reusable patterns.agents, systems, orchestration, tooling, safety-controls, survey
2604.20755V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
PDF
cs.AI, cs.LG84Process-supervised RL for multimodal table reasoning with step-level critic feedback.multimodal, process-supervision, RL, verifiable-reasoning, tables, critics
2604.07712CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
PDF
cs.LG84CausalVAE plug-in boosts counterfactual world models; better intervention robustness & interpretability.causal-representation-learning, world-models, counterfactuals, robustness, distribution-shift, interpretability
2604.17966TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
PDF
cs.AI84Safety-critical engineering calc benchmark; penalizes plausible-but-physically-wrong LLM answers.evaluation, safety-critical, reasoning, numerical-robustness, aerospace
2604.19281Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
PDF
cs.HC, cs.AI, cs.CL, cs.LG84VB-评分 evaluates medical QA via verification components; highlights health equity and safety risks.evaluation, factuality, medical, health-equity, reliability
2603.17820Federated Distributional Reinforcement Learning with Distributional Critic Regularization
PDF
cs.LG84Fed distributional RL preserves tail risks via Wasserstein barycenter trust region; safety-critical fitfederated-learning, distributional-rl, risk-sensitive, trust-region, safety-critical

AI 论文洞察简报

2026-04-26

0) 核心要点(先读这个)

  • “评估是瓶颈”正在变得具体可操作:多篇论文提出诊断型基准/指标,用于暴露隐藏失效模式(NL2SQL 中通过句法脆弱性检测污染;TPS 工程中的“答案对、推理错”;医疗 QA 的实体级失败;对话中 ToM 的功能性 vs 字面性差异)。
  • 不确定性与风险感知控制正从理论走向智能体实践:Web 智能体用双层不确定性切换规划模式并驱动 MCTS;联邦 RL 用 CVaR 加权的分布式 critic + 信任域来降低异质性下的尾部风险与漂移。
  • 结构化中间表示反复成为鲁棒性的杠杆:带局部修复的类型化技能 DAG(GraSP)、用于漏洞检测的 Gherkin 行为契约(Phoenix)、用于预测的半结构化信念状态(BLF)、用于多模态表格推理的可验证步骤轨迹(V-tableR1)。
  • 验证 + RL 正在收敛到“过程监督”:量子推理将确定性验证器融合进 verification-aware reward model(VRM)用于 RLVR;表格推理用 critic VLM 评分步骤保真度并门控奖励(PGPO)。
  • 混合部署模式(端+云、多智能体团队)正在变得更有原则:SLM 学会在隐私/效率约束下何时/如何向 LLM 求助;委派用上下文条件化的 Beta 后验进行校准,减少在 GAIA/SWE-bench 上的误路由。

2) 关键主题(聚类)

主题:将不确定性与风险作为一等控制信号(智能体 + RL)

主题:结构化表示 + 局部修复优于扁平提示

主题:针对隐藏失效模式的基准(污染、遗漏、过程错误、公平性)

主题:验证感知 RL / 过程监督(文本 + 多模态)

  • 重要性:“看似合理但错误”往往是过程失败;在有验证器的领域(物理、表格)可获得稠密反馈,从而减少类似幻觉的错误。
  • 代表论文
  • 共同方法
    • 用更丰富信号替代稀疏结果奖励:确定性验证器 + 语义评分(VRM);critic 评分的步骤保真度(V-tableR1)。
    • 将优化粒度对齐到智能体因果:step-level MDP + step-level GAE/PPO(StepPO)。
    • 通过消融展示移除验证器/过程奖励会降低性能。
  • 开放问题 / 失效模式
    • 计算与复杂度:大型 critic(如 32B critic VLM)与验证器套件增加成本。
    • 通用性:量子/表格领域可验证性强;迁移到开放世界任务尚不明确。
    • 实证广度:StepPO 目前证据主要是一条 HotpotQA 曲线,缺少广泛报告。

主题:真实流水线中的安全与隐私(混合推理、代码安全、认证鲁棒性、记忆遗忘)

3) 技术综合

  • 中间表示是反复出现的“控制面”:DAG(GraSP)、契约(Phoenix)、信念 JSON(BLF)、rubric 维度(TPS-CalcBench)、轴向分解(Four-Axis Decision Alignment)、可验证步骤轨迹(V-tableR1)都创造了可挂载检查、奖励与修复的位置
  • 风险/不确定性正在被操作化为门控:规划模式切换(WebUncertainty)、带 LCB 风格分数的委派裕度 δ(CADMAS-CTX)、以及信任域 shrink–squash 约束(FedDistRL)都实现了“只有足够自信才行动”。
  • 评估正从均值转向带保证的诊断:QuickScope 使用 COUP 风格自适应采样 + 认证;SPENCE 使用 Kendall τ 敏感性;TPS-CalcBench 使用噪声敏感性与象限分析(结果高/过程低)。
  • 检索不是万能药:VB-Score 显示 RAG 提升综合分数但实体 F1 仍 <10%;RADAR 显示检索噪声正是多智能体辩论发挥作用之处;WebUncertainty 用不确定性感知搜索而非“更多检索”。
  • 通过验证器/critic 的过程监督跨模态收敛:VRM 融合确定性 SES 信号与语义分数;V-tableR1 用 critic 门控奖励;两者都报告移除验证器会变差的消融结果。
  • 多智能体系统更统计化:CADMAS-CTX 用 Beta 后验 + 方差惩罚;QuickScope 用置信界;BLF 用多次试验聚合 + 分层校准。
  • “答案对、推理错”在多个领域变得可测:TPS-CalcBench 明确评分公式选择/假设;V-tableR1 用 grounding 分数;医疗 QA 显示语义相似度可掩盖实体失败。
  • 成本控制越来越内建:RADAR 早停;WebUncertainty 报告相对基线的推理时降低;QuickScope 批处理;BLF 用 K=5 次试验但做校准/收缩;FSFM 将记忆上限设为 70%。
  • 联邦与去中心化趋势:FedDistRL 保持策略本地并联邦化 critics;CADMAS-CTX 保持信念为 agent-local(无全局存储),强调可扩展的去中心化协作。

4) Top 5 论文(含“为什么是现在”)

1) Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

  • 构建约 ~1,978 个可执行环境与 ~19,822 个工具,并提供可验证的任务合成(基于图 + 程序化)。
  • 用多环境 GRPO 与自进化循环训练智能体(诊断失败 → 生成定向任务/环境)。
  • 展示随进化轮次单调提升(例如 τ2-Bench:14B 从 60.2→63.5→65.4)。
  • 质疑点:高度依赖 LLM 驱动的挖掘/合成/诊断;超过 ~500 个环境后收益递减。

2) GraSP: Graph-Structured Skill Compositions for LLM Agents

  • 将检索到的技能编译为类型化 DAG(状态/数据/顺序边),并配套验证器与局部修复算子(REBIND/INSERTPREREQ/SUBSTITUTE/REWIRE/BYPASS)。
  • 报告最高 +19 奖励点、最多减少 41% 环境步数;对过度检索与技能质量下降更鲁棒。
  • “编译层”视角对构建技能库的人很可落地。
  • 质疑点:DAG 无环性限制循环/迭代精炼;依赖 LLM 编译质量与调参后的路由阈值。

3) Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models

  • 通过 RL 训练的 SLM 决定何时/如何查询 LLM;奖励混合 EM 质量、效率与隐私泄露惩罚。
  • 报告质量提升(相对 SLM-CoT +14.5%–17.4%;相对静态交互 +2.8%–9.9%),同时减少轮次并降低泄露(24.3%–32.4%)。
  • 展示迁移:用某个 LLM 训练的策略可泛化到更强的未见 LLM。
  • 质疑点:指令跟随较弱的 SLM 需要监督冷启动;隐私评估依赖 LLM 评审。

4) TPS-CalcBench: … Analytical Calculation Competence in Hypersonic TPS Engineering

  • 420 条精编基准,双轨评分:数值正确性 + 8 维过程 rubric(公式选择、单位、合理性、假设等)。
  • 发现显著的“结果高/过程低”象限(约 ~11–14%),并识别公式选择为主导失效模式(约 ~18% 的标注错误)。
  • 展示“诊断→干预”:RAG-EQ、DFA-TPS 微调与 PA-CoT 提升 KPI 并减少幻觉类错误。
  • 质疑点:LLM 评审 rubric 存在 ±3–5 KPI 不确定性;基准目前 420 条且领域特定。

5) SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

  • 受控改写探针:固定 schema+SQL,生成改写句,按依存树编辑距离排序,测量随句法差异增大准确率下降。
  • 在较老基准(Spider/SParC/CoSQL)上出现强负 Kendall τ,在 BIRD 上接近 0——提供可操作的“基准可信度”信号。
  • 鲁棒性检查:改写器选择、温度、长度与词汇重叠控制。
  • 质疑点:时间梯度是相关性;改写生成/过滤可能引入偏差;执行准确率混合了多种误差来源。

5) 实用下一步

  • 采用“诊断优先”的评估:除总体准确率外,至少加入一条敏感性曲线(如 SPENCE 风格句法差异、检索噪声压测)与一个过程指标(TPS rubric 或 VB-Score 组件)。
  • 为智能体加入显式不确定性门控:原型化任务级模式切换(显式 vs 隐式规划)与动作级不确定性感知搜索(MCTS 奖励调制),并测量级联错误降低。
  • 将编排迁移到结构化产物:尝试把工具/技能计划编译为带验证器 + 局部修复的类型化 DAG;与扁平 ReAct+skills 对比步数与恢复率。
  • 对安全关键推理域,把验证器接入 RL:实现类似 VRM 的“确定性检查 + 语义评分”融合,或 critic 门控的过程奖励,并做移除各验证组件的消融。
  • 端/云混合部署:训练或模拟“求助策略”,优化(质量、轮次、隐私),并在不同惩罚权重下跟踪隐私泄露率。
  • 安全工程:若维护代码助手,用 SecLogging 风格模式(脱敏/掩码、注入)评估,并为修复加入功能性补丁测试(不止相似度)。
  • 记忆治理:引入带可审计重要性评分的遗忘策略,测量剪枝前后检索延迟 + “危险内容保留”;将遗忘视为安全控制而非仅成本控制。

由逐篇分析生成;无外部浏览。