AI 论文日报(2026-04-27)

Published:

English version: /paper-news/2026-04-27/

运行统计

  • 候选论文: 4394
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-24T00:00:00Z → 2026-04-25T00:00:00Z (weekend_backlog_sat, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.17745HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
PDF
cs.CL90Hierarchical multi-agent paper-to-code reproduction + improved Paper2Code eval protocol (P2C-Ex).agents, paper-to-code, reproducibility, evaluation, multi-agent, automation
2604.21510OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
PDF
cs.CL90OptiVerse: 1k optimization problems; 22 LLM eval shows big drop on hard tasks; strong benchmark valuebenchmark, LLM-evaluation, optimization, reasoning, tool-use
2604.19633Time Series Augmented Generation for Financial Applications
PDF
cs.AI, cs.CE90Benchmark for LLM agents doing verifiable financial time-series tool use; strong eval focusagents, tool-use, evaluation, benchmarks, finance, verifiable-tools
2604.21882Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
PDF
cs.CL90RedirectQA probes factual recall vs name/surface-form access; key for reliability & memorization evalsLLMs, memorization, factuality, evaluation, datasets, entity-linking, robustness
2604.20572Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
PDF
cs.CL90Proactive memory/skill retrieval for lifelong agents; strong agentic relevance and reusable framework.agents, lifelong-learning, memory, retrieval, tool-use, online-learning
2604.20621SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion
PDF
cs.CR88SoK of AV perception attacks incl. multi-sensor fusion threats; taxonomy + gaps for defenses.security, autonomous-vehicles, perception-attacks, sensor-fusion, survey, robustness
2604.21192How VLAs (Really) Work In Open-World Environments
PDF
cs.RO, cs.AI88Critiques VLA evals for hiding unsafe behaviors; proposes safety-relevant analysis in open-worldrobotics, VLA, safety-evaluation, open-world, long-horizon, deployment
2604.20711Participatory provenance as representational auditing for AI-mediated public consultation
PDF
cs.AI, cs.HC88Audits input-fidelity of AI summarization for public consultation via provenance/optimal transport metricsauditing, summarization, governance, evaluation, optimal-transport, causal-inference, public-policy
2604.21725AEL: Agent Evolving Learning for Open-Ended Environments
PDF
cs.CL, cs.AI, cs.CE88Two-timescale learning for long-horizon LLM agents: adaptive memory retrieval + reflection updates.llm-agents, continual-learning, memory, retrieval-policy, reflection, bandits
2604.19016AlignCultura: Towards Culturally Aligned Large Language Models?
PDF
cs.CL86UNESCO-grounded cultural alignment dataset/pipeline for HHH evaluation; useful for safety+fairness auditscultural-alignment, evaluation, dataset, HHH, fairness, safety
2604.21193Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models
PDF
cs.AI86DAVinCI attributes+verifies claims with calibration; targets hallucinations and interpretability in LMsfactuality, hallucinations, verification, attribution, calibration, reliability
2604.21579A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
PDF
cs.SE, cs.AI86Metamorphic testing to expose memorization/data leakage in LLM program repair evaluations.data-leakage, memorization, evaluation, program-repair, metamorphic-testing, software-engineering
2603.17883SoK: From Silicon to Netlist and Beyond $-$ Two Decades of Hardware Reverse Engineering Research
PDF
cs.CR86Comprehensive SoK on hardware reverse engineering; strong security relevance and reusable overview.security, SoK, hardware, reverse-engineering, supply-chain, verification
2604.12596KumoRFM-2: Scaling Foundation Models for Relational Learning
PDF
cs.LG, cs.AI86Foundation model for relational DBs; avoids flattening, supports ICL+finetune, temporal consistency.foundation-models, relational-learning, in-context-learning, databases, tabular, pretraining
2604.19087OLLM: Options-based Large Language Models
PDF
cs.AI86Latent “options” for next-token; controllable diversity/search with minimal params on pretrained LLMsLLM, latent-variable, decoding, reasoning, controllability, efficient-adaptation
2604.21232ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
PDF
cs.AI86Hierarchical predictive correction for VLA agents to prevent cascading multi-step failures.agents, VLA, planning, robustness, error-correction, multimodal
2604.18356ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship
PDF
cs.CL86Tool-augmented companionship + new benchmark for personalized social support; relevant to agent eval/safety.agents, tool-use, evaluation, benchmarks, personalization, social-support
2604.20511CHASM: Unveiling Covert Advertisements on Chinese Social Media
PDF
cs.LG, cs.AI, cs.CL, cs.CV, cs.CY85CHASM dataset for multimodal covert-ad detection; concrete, safety-adjacent eval data from real platform.datasets, evaluation, multimodal, content-moderation, security, adversarial
2604.21917CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis
PDF
cs.CR, cs.SE84Benchmark of multi-commit CVEs that evade per-commit SAST; strong for secure code/agent tooling evalssecurity, benchmark, vulnerabilities, SAST, software-engineering, datasets
2604.17816Privacy-Preserving Product-Quantized Approximate Nearest Neighbor Search Framework for Large-scale Datasets via A Hybrid of Fully Homomorphic Encryption and Trusted Execution Environment
PDF
cs.CR84Privacy-preserving ANN for embeddings using FHE+TEE; relevant to secure RAG/vector DBsprivacy, security, ANN, vector-search, FHE, TEE, RAG
2604.19172Reasoning-Aware AIGC Detection via Alignment and Reinforcement
PDF
cs.AI84New multi-domain AIGC detection dataset + reasoning-chain detector trained with RL for robustnessAIGC-detection, datasets, robustness, reasoning, RL, misinformation, evaluation
2604.21396VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
PDF
cs.CV, cs.AI84Automated dataset linking each visual reasoning step to image regions for trustworthy LVLM eval.vision-language, grounding, chain-of-thought, trustworthiness, dataset, evaluation
2604.11529TempusBench: An Evaluation Framework for Time-Series Forecasting
PDF
cs.LG84Needed TS foundation-model eval framework; tackles dataset leakage/metadata issues and standardization.evaluation, benchmarks, time-series, foundation-models, data-contamination, forecasting
2604.18459Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
PDF
cs.CV, cs.AI84Streaming video agent: evidence-aligned response timing + transparent decision maker under compute limitsvideo-LLM, online-inference, agent, transparency, evaluation, multimodal
2604.05966FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures
PDF
cs.CL84Auditable agentic workflow with ontology mapping + anomaly logging for verified financial reporting.agents, auditing, verification, ontology, information-extraction, LLM-workflows
2604.21916MathDuels: Evaluating LLMs as Problem Posers and Solvers
PDF
cs.CL, cs.SE84Self-play math benchmark where models pose+solve; better capability separation than static tests.evaluation, benchmarks, math, self-play, adversarial-testing, LLMs
2604.08352Security Concerns in Generative AI Coding Assistants: Insights from Online Discussions on GitHub Copilot
PDF
cs.SE, cs.CR, cs.HC82Empirical security concerns for GenAI coding assistants from GitHub Copilot discussions.LLM-security, coding-assistants, developer-practice, secure-software, human-factors
2604.21416CSC: Turning the Adversary's Poison against Itself
PDF
cs.CR, cs.AI82Backdoor defense using latent-space cluster dynamics; targets poisoning without heavy utility lossbackdoors, data-poisoning, robustness, defense, security, representation-learning
2604.18543ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
PDF
cs.AI, cs.CL82Auto-generates and validates agent environments from NL; scalable eval/training infra for agentsagents, environment-generation, evaluation, benchmarks, validation, tool-interfaces
2604.19090Dual-Guard: Dual-Channel Latent Watermarking for Provenance and Tamper Localization in Diffusion Images
PDF
cs.CR82Dual-channel watermarking for diffusion provenance + tamper localization; practical integrity angleprovenance, watermarking, diffusion, content-integrity, tamper-detection, forensics

AI 论文洞察简报

2026-04-27

0) 核心要点(先读这个)

  • 评测正在成为瓶颈——论文正以“可审计性优先(auditability-first)”框架回应:多项工作引入基准与协议,明确针对数据泄漏、评测幻觉(hallucinated evaluation)或缺失安全信号等问题(TempusBench、HiRAS/P2C-Ex、TSAG、OptiVerse/DVA-Agent、MathDuels、CHASM、B1K safety Q-scores)。
  • 智能体进展正从“更多工具”转向“更好地控制何时/为何使用工具与记忆”:将主动检索作为显式动作并用 RL 监督(PROACTAGENT),以及用 bandit 选择检索策略 + 反思(AEL),都带来大幅提升;强消融证据表明关键杠杆在于真正使用经验。
  • 安全研究更强调“系统级缺口”而非孤立攻击:自动驾驶感知攻击在融合层研究不足(AV SoK + 跨模态欺骗 PoC),软件安全数据集揭示常见流水线盲点(CrossCommitVuln-Bench 显示按提交的 SAST 会漏掉跨多提交链;Copilot 讨论暴露许可/溯源与不安全建议的担忧)。
  • 溯源与完整性正在收敛到可落地、可部署的信号:扩散图像水印从全局溯源进一步走向篡改定位(Dual-Guard 通过双潜变量通道),公共咨询摘要通过输入表征保真度指标进行审计(参与式溯源)。
  • 隐私保护检索正接近交互式规模:一种混合 FHE+TEE + PQ 设计在百万级数据集上报告顺序加密 ANN >50 QPSRecall@10 > 0.9(PPPQ-ANN),但仍未解决访问模式泄漏。

2) 关键主题(聚类)

主题:抗泄漏、抗“幻觉打分”、补齐安全信号的基准与评测

主题:将主动记忆/检索作为终身智能体中的可学习动作

  • 重要性:长时程智能体常因上下文过载或缺失关键经验而失败;学习何时检索可同时提升成功率与效率。
  • 代表论文
  • 共同方法
    • 将检索视为显式控制而非被动 RAG(PROACTAGENT 将检索加入动作空间)。
    • 通过反事实对比提供步级监督(PROACTRL 用有/无检索的配对分支 rollout)。
    • 维护类型化记忆/技能(事实、情节、成功/失败技能;AEL 的 episodic→semantic→procedural 分层)。
    • 使用轻量在线自适应(AEL 用 Thompson Sampling bandit 在检索策略间选择)。
  • 开放问题 / 失效模式
    • PROACTRL 假设前缀可重放以进行配对 rollout;在随机/不可重放环境中可能失效。
    • 记忆增长与淘汰策略仍不成熟(PROACTAGENT 指出没有容量上限/学习式淘汰)。
    • 跨模块信用分配仍困难(AEL 发现更复杂的信用方法在噪声环境中反而降性能)。

主题:智能体环境生成与可扩展智能体评测

  • 重要性:手工创建环境/基准不可扩展且会陈旧;自动生成可实现持续、抗泄漏的评测。
  • 代表论文
  • 共同方法
    • 从自然语言规格生成任务,并配套验证循环(ClawEnvKit 的 Parser/Generator/Validator → 可执行沙箱任务)。
    • 自博弈/难度共演化避免基准天花板(MathDuels:模型出题+解题;Rasch/IRT 排名)。
    • 尽可能采用确定性检查,限制 LLM 裁判影响(ClawEnvKit 使用 15 个确定性检查 + 限制 llm_judge 权重)。
  • 开放问题 / 失效模式
    • 仿真到现实差距:mock 服务缺少鉴权/限流/模式漂移(ClawEnvKit 局限)。
    • 自博弈仍产生大量无区分度题目(MathDuels 报告约 39% 被所有非作者解出)。
    • 验证器依赖:边缘情况的自动验证仍依赖模型骨干(MathDuels)。

主题:溯源、完整性与表征审计

主题:现代 AI 流水线中的安全与隐私缺口(硬件、自动驾驶融合、代码、检索)

3) 技术综合

  • 多篇论文在“可审计分解(auditable decomposition)”上趋同:将系统拆成阶段并给出显式状态标签/日志(FinReporting 的 OK/MISSING/PARSE_ERROR;ClawEnvKit 审计日志;参与式溯源指标;DAVinCI 归因+验证)。
  • 反事实评测正在成为核心工具:PROACTRL 的配对 rollout(检索 vs 不检索)呼应 OptiVerse 的双视角(text→math vs code→math)以及 HiRAS 对执行 vs 仅代码打分差距的关注。
  • 基准越来越多地包含“像正例的难负例”(CHASM 包含产品分享但非广告;CrossCommitVuln-Bench 需要单个提交各自无害;AIGC-text-bank 包含 AI-Polish)。
  • 趋势明显:衡量经典指标遗漏的内容
    • 流式视频中的时序偏差与透明性(Thinking-QwenVL)。
    • 具身任务中的安全违规与非目标物体交互(B1K 的 sQ/seQ)。
    • 摘要中的表征排除(覆盖率 + W2 + 概念召回/精确率)。
  • 多项工作显示:当代码/文本看似合理后,执行/环境成为主导失效模式(HiRAS:import/env 问题可显著拉低分数)。
  • “泄漏感知”评测出现在多个领域:时间序列(TempusBench)、APR(变形测试 + NLL)、优化(OptiVerse 过滤可网页获取解答 + 严格数值验证)。
  • 工具使用系统正走向有界决策空间以降低不安全自治(FinReporting 的 KEEP/REPAIR/NEED_REVIEW;ClawEnvKit 的确定性检查 + 限制裁判权重)。
  • 安全 SoK 指出一个与 ML 共享的元问题:工件稀缺与脆弱(HRE 可复现率 4%)与智能体基准的陈旧/泄漏、以及评测中“幻觉裁判”的担忧相呼应。

4) Top 5 论文(含“为什么是现在”)

1) ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

  • 将环境创建从自然语言规格自动化为可执行任务 E=(P,M,C),并带验证/再生成循环。
  • 发布 Auto-ClawEval(1,040 个任务),且无模型在完成率上饱和(34%–76%),适合前沿追踪。
  • 证明评测框架很关键:结构化 harness 相比 ReAct 最高提升 15.7 点
  • 质疑点:mock 服务可能无法迁移到真实 API(鉴权、模式漂移、限流)。

2) Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

  • 将检索设为一等动作,并用配对分支 rollout 训练以获得步级监督。
  • 强消融信号:移除 PROACTRL 使 SciWorld SR 73.50% → 26.50%
  • 在成功率提升的同时也带来效率收益(更少轮次/Token)。
  • 质疑点:依赖可重放假设;记忆扩展/淘汰未解决。

3) Participatory provenance as representational auditing for AI-mediated public consultation

  • 提出具体指标(覆盖率、W2 差距、AIPW 因果归因、概念保真度)审计输入→摘要的表征保真度
  • 实证发现:官方摘要在覆盖率上不如随机基线,并排除约 15–17% 参与者,且集中在反对意见簇。
  • 提供可操作工具(Co-creation Provenance Lab)支持审计-修订工作流。
  • 质疑点:基于嵌入的代理指标与嵌入选择敏感性会影响个体层排序。

4) Dual-Guard: Dual-Channel Latent Watermarking for Provenance and Tamper Localization in Diffusion Images

  • 将鲁棒的全局溯源锚(z_T 中的 GS)与可学习的空间编码器(z0 中)结合以定位篡改。
  • 报告封闭集溯源区分近乎完美,并在多种编辑/篡改场景下检测率 ≥99.9%;提供粗粒度定位(16×16 块)。
  • 互补性明确:仅 GS 会漏局部编辑;双通道可修复。
  • 质疑点:依赖所有者侧往返参考潜变量;未评估自适应白盒攻击。

5) OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

  • 将优化评测扩展到六个领域,并显示巨大提升空间(顶级模型在 Hard 上约 25–27%)。
  • 指出主导失效模式是语义建模/逻辑错误(代码能跑但答案错)。
  • DVA-Agent 的双视角审计提升 Hard 准确率(如 16.67% → 24.33%,Qwen3-235B-Instruct),且仅约 23–32% 情况触发编辑。
  • 质疑点:教材/考试分布可能不代表工业中的复杂优化;评测流水线计算开销大。

5) 实用下一步

  • 面向智能体构建者:实现检索即动作(retrieval-as-action),并用反事实 rollout(检索 vs 不检索)训练以获得步级监督;同时跟踪成功率与效率(轮次/Token)。
  • 面向基准维护者:加入变形样本(语义保持变换)并报告鲁棒性差值;在可用时配合熟悉度代理(如开源模型的 NLL)。
  • 面向评测流水线:在任何“能跑 ≠ 正确”的领域采用双视角审计(规格→形式化 vs 代码→形式化)(优化、数据流水线、ETL、政策规则)。
  • 面向具身/VLA 安全:将指标扩展到终态成功之外——记录抓取/放置违规与非目标交互(sQ/seQ 风格),并要求报告跨试验方差。
  • 面向溯源/完整性:若运营平台,考虑封闭集验证设计(注册工件),并加入定位信号,而不仅是二元溯源。
  • 面向隐私保护检索:原型化混合设计(PQ + 密码学 + enclave),但要明确测量仍泄漏的内容(如访问模式)并记录威胁模型边界。
  • 面向安全工具:在 SAST/CI 中评估跨提交链(而非仅快照),加入历史感知检查;使用 CrossCommitVuln-Bench 等数据集量化缺口。

由逐篇论文分析生成;未进行外部浏览。