AI 论文日报(2026-04-19)

Published:

English version: /paper-news/2026-04-19/

运行统计

  • 候选论文: 3640
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-17T00:00:00Z → 2026-04-18T00:00:00Z (weekend_backlog_sun, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.11477OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems
PDF
cs.AI, cs.SE, q-fin.TR90Alignment via real-market feedback; tackles evaluator hacking/sycophancy in multi-agent systemsagents, alignment, multi-agent, RL, evaluation, reward-hacking, software-engineering
2604.12311Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety
PDF
cs.SE, cs.AI, cs.HC86Large empirical study of LLM-generated code silent failures in safety-critical construction workflowsLLM reliability, code generation, safety-critical, evaluation, silent failures, software engineering
2604.12995PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models
PDF
cs.CL, cs.CY86Large-scale policy reasoning benchmark (21K, US-China); strong eval value for real deployments.benchmark, evaluation, policy, reasoning, LLM
2604.07039AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules
PDF
cs.RO, cs.AI86Single-agent robot OS w/ capability modules, control authority + explicit safety constraints focusagents, robotics, agent-architecture, runtime, safety-constraints, tool-use
2603.29953Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect
PDF
cs.AI, cs.HC86Structured intent protocol improves cross-model/language goal preservation; large controlled study + user studyprompting, structured-intent, robustness, cross-model, evaluation, human-study
2604.11655RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
PDF
cs.CL, cs.AI, cs.MA86Automated, multi-stage eval for LLM role-playing agents: role adherence, consistency, long-horizon stability.LLM-evaluation, agents, role-playing, checklists, long-horizon, reliability
2603.17331Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design
PDF
cs.CR86Declarative sovereignty constraints for federated systems; relevant to secure AI deployment governance.federated-computing, security, policy, sovereignty, systems
2604.05738MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models
PDF
cs.CL86Large multimodal benchmark for expert→lay medical VLM alignment; targets hallucination riskbenchmark, medical-vlm, semantic-alignment, hallucinations, evaluation, dataset
2604.07102The Impact of Steering Large Language Models with Persona Vectors in Educational Applications
PDF
cs.CL, cs.AI86Activation steering shifts grading/quality; shows risks of persona control in educationllm-steering, persona-vectors, reliability, calibration, education, evaluation
2604.00560Quantum-Safe Code Auditing: LLM-Assisted Static Analysis and Quantum-Aware Risk Scoring for Post-Quantum Cryptography Migration
PDF
cs.CR, cs.SE, quant-ph86LLM-assisted static analysis + quantum-aware risk scoring for PQC migration; practical security tooling.cybersecurity, static-analysis, LLM-tools, post-quantum-crypto, risk-scoring, software-security
2604.12168Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
PDF
cs.CR, cs.AI84FHE inference on Llama 3 targets privacy-preserving LLM deployment; concrete security angleprivacy, FHE, LLM-inference, cryptography, deployment, security
2604.12258Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System
PDF
cs.CL, cs.AI84Agentic clinical research via MCP with explicit privacy-preserving tool/data separation; relevant to safe tool useagents, tool use, privacy, MCP, clinical, system design
2604.12282Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning
PDF
cs.CL84Multi-agent, multi-format spreadsheet understanding for huge sheets; practical agentic workflow setting.agents, tool-use, document-understanding, spreadsheets, long-context
2604.01841Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
PDF
cs.AI84EHR tabular ICL benchmark + task-aligned retrieval (AWARE) for shift/imbalance; practical robustness focustabular-foundation-models, retrieval-augmented, EHR, distribution-shift, robustness, benchmark
2604.11188MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis
PDF
cs.CL, cs.AI84Adversarial constraint-graph evolution to synthesize harder math reasoning data; reduces mode collapse vs prompts.reasoning, data-synthesis, math, adversarial, constraint-graphs, LLM-training
2603.17570FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models
PDF
cs.LG, cs.AI84Intrinsic explainability/uncertainty signals for outlier-detection foundation models; safety-critical OD use.foundation-models, outlier-detection, explainability, uncertainty, tabular
2604.07274A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
PDF
cs.CL, cs.AI, cs.LG84Systematic RAG pipeline ablations for medical QA; practical grounding + design guidanceRAG, medical-QA, retrieval, reranking, evaluation, grounding
2604.08203MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
PDF
cs.CV, cs.AI82Agentic RL for medical VLM visual grounding using uncertainty; targets visual hallucination riskVLM, medical AI, hallucinations, uncertainty, reinforcement learning, grounding
2604.06079Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning
PDF
cs.CV, cs.AI82High-quality SciTikZ-230K + closed-loop RL for image→TikZ synthesis; reusable dataset + eval gaps.multimodal, program-synthesis, dataset, reinforcement-learning, evaluation
2604.07035Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
PDF
cs.CL82Careful dense vs MoE reasoning LMs benchmark under realistic prompting/inference constraintsllm, reasoning, moe, efficiency, benchmarking, evaluation
2603.09940SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
PDF
cs.LG82Large multimodal biosignal FM benchmark (ECG+PPG), 20 tasks, systematic eval; reusable for reliability testingbenchmark, foundation-models, biosignals, ECG, PPG, evaluation, healthcare
2603.30025ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
PDF
cs.CL82Context-aware verifiable-claim detection; better front-end for fact-checking pipelines than claim-only baselinesmisinformation, fact-checking, verifiability, retrieval, NLP, evaluation
2604.01870Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler
PDF
cs.LG, eess.SY82Diffusion posterior sampling for intrinsically calibrated UQ; strong reliability angle for industrial ML.uncertainty-quantification, diffusion, calibration, posterior-sampling, reliability
2603.29142REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour
PDF
cs.AI, cs.HC82Locally deployable multi-agent LLM feedback system with judge-guided regeneration + tool usellm-agents, multi-agent, llm-judge, tool-calling, education, reliability
2604.08333Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
PDF
cs.CV, cs.AI, cs.LG82Finds medical MLLMs underperform; probes where degradation comes from across 14 modelsmultimodal-LLM, medical-imaging, evaluation, representation-probing, reliability
2604.00931PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor
PDF
cs.AI82Lifelong-learning counselor agent with memory/planning + skill evolution; high-stakes deployment implications.agents, lifelong-learning, memory, planning, healthcare, safety-reliability
2603.15216vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing
PDF
cs.CR80Verifiable causality analysis for cloud endpoint auditing; integrity against untrusted cloud results.security, auditing, provenance, verifiable-computation, cloud
2604.05930"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?
PDF
cs.CL, cs.AI80MultiPun benchmark + adversarial distractors exposes VLM weakness on multimodal wordplay; useful eval artifactVLM, benchmark, multimodal, evaluation, robustness, dataset
2603.15091Trustworthy Koopman Operator Learning: Invariance Diagnostics and Error Bounds
PDF
math.NA, cs.LG, math.DS, math.OC80Diagnostics + error bounds to certify trustworthy Koopman learning; actionable guarantees for dynamics ML.robustness, verification, error-bounds, koopman, dynamical-systems
2604.12719Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning
PDF
cs.LG, stat.ML79MC Stochastic Depth for scalable uncertainty; theory + detection benchmark for safety-critical UQuncertainty, bayesian, stochastic-depth, reliability, safety, object-detection

AI 论文洞察简报

2026-04-19

0) 核心要点(先读这个)

  • “更好的表征”往往关乎对齐下游接口,而不是规模:在生物信号中,更长的上下文胜过更大的模型(SignalMC-MED);在表格 EHR 中,任务对齐的检索嵌入优于朴素相似度(AWARE);在 Koopman 学习中,不变性诊断(PAD)决定谱/预测是否可信。
  • 验证正在从“信任云/模型”转向“证明它”:vCause 为溯源图上的因果查询提供密码学证明;FCaC 将跨组织准入推进到“携带证明的能力令牌”;FHE-on-Llama-3 展示了部分加密推理的可行性。
  • 智能体系统正在收敛到“工具使用 + 门控 + 审计”:MedVR 用带工具奖励与共识信用分配的 RL 做视觉落地;SpreadsheetAgent 采用抽取→验证→求解并配合多格式工具;OOM-RL 加入硬外部约束(资本损失)与严格测试锁以抵抗奖励黑客/测试规避。
  • Prompt/引导干预很强但也有锋利边界:结构化意图格式显著降低跨语言方差并帮助较弱模型(PPS/5W3H/CO-STAR/RISEN),但可能触及“编码开销”边界;persona 向量引导会降低教育类回答质量并诱发系统性评分偏差(尤其在 MoE 中)。
  • 安全关键场景的“氛围编程(vibe coding)”目前默认不安全:施工安全代码生成显示,即使脚本能成功执行,也有很高的静默失败率,意味着需要确定性封装、验证与领域约束。

2) 关键主题(聚类)

主题:在真实约束下的检索/表征对齐

主题:分布式系统的携证完整性与主权

主题:带验证/门控回路的智能体工具使用

主题:面向“行为可靠性”的评估与诊断(不只是准确率)

主题:面向部署的轻量不确定性/可解释信号

3) 技术综合

  • 多篇论文共同指向 “门控(gating)”是核心的反幻觉原语:编译门控(SciTikZ)、以正确性为条件的工具奖励(MedVR)、策略引擎阻断(AEROS)、RO-lock 防止测试变异(OOM-RL)、以及密码学验证(vCause/FCaC)。
  • 表征诊断正在成为一等公民:Koopman 的 PAD 主角度量化不变性;医疗 MLLM 的 Feature Health Scores 定位判别信号丢失位置;AWARE 用 SNNL 显式塑造检索邻域。
  • 在冻结设置下,更长上下文胜过更大模型:SignalMC-MED 发现 10 分钟信号比扩参更稳定地带来收益;同样,结构化意图对弱模型的帮助更显著(弱模型补偿)。
  • 检索正在前移到流水线更早阶段:ContextClaim 将检索移入 claim 检测;医疗 QA 研究隔离检索组件;SpreadsheetAgent 迭代检索/检查局部区域,而非串行化整张表。
  • 工具使用 RL 正在采用自监督信用分配:MedVR 的共识掩码(CCA)与熵触发分支(EVR)呼应更广泛趋势:用集成/共识信号在无标注下塑造探索。
  • MoE vs dense 并非免费午餐:面向部署的基准显示中等规模 MoE 可能在帕累托上更优(Gemma-4-E4B),而 persona 引导显示 MoE 评分器的校准漂移可能大于 dense 模型。
  • “能跑”≠“正确”已被量化:施工安全引入可执行脚本中的静默失败率;医疗 MLLM 显示尽管规模更大,生成式流水线仍可能不如判别式基线。
  • 隐私/安全工具正在与 LLM 集成:量子安全审计用 LLM 做上下文标注增强;CARIS 用 MCP 将原始临床数据留在工具后;FHE 将加密推理集成到 attention 中。

4) Top 5 论文(含“为何是现在”)

1) MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

  • 用 RL(GRPO)训练医疗 VLM 交错文本与显式视觉动作(Zoom-in),无需中间标注。
  • 引入 EVR(熵引导分支)CCA(基于共识的信用分配),使工具使用可学习且有落地性。
  • 在医疗 VQA 与 grounding 上显著提升(例如 mIoU 大幅提升;无标注 CCA 在某 grounding 任务上几乎达到监督上界)。
  • 质疑点:RL 计算开销大、rollout 多;动作空间目前仅限单一 Zoom-in 工具;中间步骤的临床合理性未由专家验证。

2) vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing

  • 为版本化溯源图上的因果查询提供 密码学证明(成员性 + 后向/前向因果组件)。
  • 可扩展到大规模日志(报告处理 2500 万日志;对 10 万节点因果的证明生成 49 ms),端点开销低(<1%)。
  • 在标准假设下给出形式化安全归约;适用于云不可信的事件响应。
  • 质疑点:机密性不在范围内;分段深度与承诺间隔带来运维权衡。

3) SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG

  • 建立 真实的长时程(10 分钟)同步 ECG+PPG 基准,含 20 个临床任务与按时间顺序划分。
  • 发现 领域生物信号 FM > 通用时间序列 FMECG+PPG 融合有益,且在冻结线性探测下 更长信号比扩参更有帮助
  • 显示 手工生理特征仍然很强,并与 FM 嵌入互补。
  • 质疑点:冻结线性探测 + 均值池化可能低估微调与更丰富时序聚合的收益;单一医疗系统限制泛化。

4) Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

  • 实用的多智能体回路:增量抽取 → 验证 → 求解,通过 图像 + LaTeX + 代码执行 工具保留版式。
  • 在 SpreadsheetBench/RealHiTBench 上稳定提升;消融显示验证与多格式工具很关键。
  • 为“上下文装不下”的文档提供可操作模板。
  • 质疑点:依赖强多模态/工具能力模型;运行时/内存开销不小(报告抽取平均约 97s;峰值 21GB)。

5) Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

  • 大规模因子研究(3,240 个输出)显示结构化意图格式 提升目标对齐 并将 跨语言方差降低约 24×
  • 识别 弱模型补偿(结构化提示对弱基线帮助更大)与 容量/开销边界(结构过多反而可能有害)。
  • 包含用户研究(N=50):AI 扩展的 5W3H 减少交互轮次并提升满意度。
  • 质疑点:LLM-as-judge 的上限效应与单评审偏差;用户研究顺序未做 counterbalance。

5) 实用下一步

  • 若在医疗中部署 RAG/RA-TICL:测量检索邻域的标签纯度,并在扩展骨干前先尝试 任务对齐的嵌入塑形(如 SNNL + 均衡采样)(AWARE)。
  • 对多模态医疗助手:加入 显式视觉工具调用 并跟踪 grounding 指标(IoU/footprints);考虑 基于共识的伪监督 以避免标注瓶颈(MedVR)。
  • 对安全关键代码生成:实现 沙箱执行 + 领域逻辑验证器 并报告 静默失败率(而不只是“能跑”);要求明确可执行指令(施工安全研究)。
  • 对跨组织智能体部署:采用 携证准入(能力令牌 + PoP),并将 吊销/新鲜性 作为一等需求规划(FCaC)。
  • 对云取证/审计:评估你的溯源查询是否需要 可验证证明;原型化 Merkle 累加器 + 递归摘要方法并基准测试证明大小/延迟(vCause)。
  • 对受约束条件下的模型选型:做 条件于提示的帕累托扫描(准确率 vs 延迟/显存),而不是只看参数量;警惕提示敏感性病理(Gemma/Phi/Qwen 权衡研究)。
  • 对生产中的不确定性:考虑 单次前向的蒸馏不确定性 head(FoMo-X)或在残差架构已使用随机深度时采用 MCSD;基准测试校准(ECE/AUARC)与准确率权衡。

由逐篇论文分析生成;未进行外部浏览。