AI 论文日报（2026-04-19）

Published: April 19, 2026

English version: /paper-news/2026-04-19/

运行统计

候选论文: 3640
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-04-17T00:00:00Z → 2026-04-18T00:00:00Z (weekend_backlog_sun, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2604.11477`	OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems PDF	cs.AI, cs.SE, q-fin.TR	90	Alignment via real-market feedback; tackles evaluator hacking/sycophancy in multi-agent systems	agents, alignment, multi-agent, RL, evaluation, reward-hacking, software-engineering
`2604.12311`	Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety PDF	cs.SE, cs.AI, cs.HC	86	Large empirical study of LLM-generated code silent failures in safety-critical construction workflows	LLM reliability, code generation, safety-critical, evaluation, silent failures, software engineering
`2604.12995`	PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models PDF	cs.CL, cs.CY	86	Large-scale policy reasoning benchmark (21K, US-China); strong eval value for real deployments.	benchmark, evaluation, policy, reasoning, LLM
`2604.07039`	AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules PDF	cs.RO, cs.AI	86	Single-agent robot OS w/ capability modules, control authority + explicit safety constraints focus	agents, robotics, agent-architecture, runtime, safety-constraints, tool-use
`2603.29953`	Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect PDF	cs.AI, cs.HC	86	Structured intent protocol improves cross-model/language goal preservation; large controlled study + user study	prompting, structured-intent, robustness, cross-model, evaluation, human-study
`2604.11655`	RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents PDF	cs.CL, cs.AI, cs.MA	86	Automated, multi-stage eval for LLM role-playing agents: role adherence, consistency, long-horizon stability.	LLM-evaluation, agents, role-playing, checklists, long-horizon, reliability
`2603.17331`	Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design PDF	cs.CR	86	Declarative sovereignty constraints for federated systems; relevant to secure AI deployment governance.	federated-computing, security, policy, sovereignty, systems
`2604.05738`	MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models PDF	cs.CL	86	Large multimodal benchmark for expert→lay medical VLM alignment; targets hallucination risk	benchmark, medical-vlm, semantic-alignment, hallucinations, evaluation, dataset
`2604.07102`	The Impact of Steering Large Language Models with Persona Vectors in Educational Applications PDF	cs.CL, cs.AI	86	Activation steering shifts grading/quality; shows risks of persona control in education	llm-steering, persona-vectors, reliability, calibration, education, evaluation
`2604.00560`	Quantum-Safe Code Auditing: LLM-Assisted Static Analysis and Quantum-Aware Risk Scoring for Post-Quantum Cryptography Migration PDF	cs.CR, cs.SE, quant-ph	86	LLM-assisted static analysis + quantum-aware risk scoring for PQC migration; practical security tooling.	cybersecurity, static-analysis, LLM-tools, post-quantum-crypto, risk-scoring, software-security
`2604.12168`	Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference PDF	cs.CR, cs.AI	84	FHE inference on Llama 3 targets privacy-preserving LLM deployment; concrete security angle	privacy, FHE, LLM-inference, cryptography, deployment, security
`2604.12258`	Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System PDF	cs.CL, cs.AI	84	Agentic clinical research via MCP with explicit privacy-preserving tool/data separation; relevant to safe tool use	agents, tool use, privacy, MCP, clinical, system design
`2604.12282`	Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning PDF	cs.CL	84	Multi-agent, multi-format spreadsheet understanding for huge sheets; practical agentic workflow setting.	agents, tool-use, document-understanding, spreadsheets, long-context
`2604.01841`	Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints PDF	cs.AI	84	EHR tabular ICL benchmark + task-aligned retrieval (AWARE) for shift/imbalance; practical robustness focus	tabular-foundation-models, retrieval-augmented, EHR, distribution-shift, robustness, benchmark
`2604.11188`	MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis PDF	cs.CL, cs.AI	84	Adversarial constraint-graph evolution to synthesize harder math reasoning data; reduces mode collapse vs prompts.	reasoning, data-synthesis, math, adversarial, constraint-graphs, LLM-training
`2603.17570`	FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models PDF	cs.LG, cs.AI	84	Intrinsic explainability/uncertainty signals for outlier-detection foundation models; safety-critical OD use.	foundation-models, outlier-detection, explainability, uncertainty, tabular
`2604.07274`	A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering PDF	cs.CL, cs.AI, cs.LG	84	Systematic RAG pipeline ablations for medical QA; practical grounding + design guidance	RAG, medical-QA, retrieval, reranking, evaluation, grounding
`2604.08203`	MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning PDF	cs.CV, cs.AI	82	Agentic RL for medical VLM visual grounding using uncertainty; targets visual hallucination risk	VLM, medical AI, hallucinations, uncertainty, reinforcement learning, grounding
`2604.06079`	Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning PDF	cs.CV, cs.AI	82	High-quality SciTikZ-230K + closed-loop RL for image→TikZ synthesis; reusable dataset + eval gaps.	multimodal, program-synthesis, dataset, reinforcement-learning, evaluation
`2604.07035`	Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models PDF	cs.CL	82	Careful dense vs MoE reasoning LMs benchmark under realistic prompting/inference constraints	llm, reasoning, moe, efficiency, benchmarking, evaluation
`2603.09940`	SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG PDF	cs.LG	82	Large multimodal biosignal FM benchmark (ECG+PPG), 20 tasks, systematic eval; reusable for reliability testing	benchmark, foundation-models, biosignals, ECG, PPG, evaluation, healthcare
`2603.30025`	ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection PDF	cs.CL	82	Context-aware verifiable-claim detection; better front-end for fact-checking pipelines than claim-only baselines	misinformation, fact-checking, verifiability, retrieval, NLP, evaluation
`2604.01870`	Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler PDF	cs.LG, eess.SY	82	Diffusion posterior sampling for intrinsically calibrated UQ; strong reliability angle for industrial ML.	uncertainty-quantification, diffusion, calibration, posterior-sampling, reliability
`2603.29142`	REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour PDF	cs.AI, cs.HC	82	Locally deployable multi-agent LLM feedback system with judge-guided regeneration + tool use	llm-agents, multi-agent, llm-judge, tool-calling, education, reliability
`2604.08333`	Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification PDF	cs.CV, cs.AI, cs.LG	82	Finds medical MLLMs underperform; probes where degradation comes from across 14 models	multimodal-LLM, medical-imaging, evaluation, representation-probing, reliability
`2604.00931`	PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor PDF	cs.AI	82	Lifelong-learning counselor agent with memory/planning + skill evolution; high-stakes deployment implications.	agents, lifelong-learning, memory, planning, healthcare, safety-reliability
`2603.15216`	vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing PDF	cs.CR	80	Verifiable causality analysis for cloud endpoint auditing; integrity against untrusted cloud results.	security, auditing, provenance, verifiable-computation, cloud
`2604.05930`	"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns? PDF	cs.CL, cs.AI	80	MultiPun benchmark + adversarial distractors exposes VLM weakness on multimodal wordplay; useful eval artifact	VLM, benchmark, multimodal, evaluation, robustness, dataset
`2603.15091`	Trustworthy Koopman Operator Learning: Invariance Diagnostics and Error Bounds PDF	math.NA, cs.LG, math.DS, math.OC	80	Diagnostics + error bounds to certify trustworthy Koopman learning; actionable guarantees for dynamics ML.	robustness, verification, error-bounds, koopman, dynamical-systems
`2604.12719`	Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning PDF	cs.LG, stat.ML	79	MC Stochastic Depth for scalable uncertainty; theory + detection benchmark for safety-critical UQ	uncertainty, bayesian, stochastic-depth, reliability, safety, object-detection

AI 论文洞察简报

2026-04-19

0) 核心要点（先读这个）

“更好的表征”往往关乎对齐下游接口，而不是规模：在生物信号中，更长的上下文胜过更大的模型（SignalMC-MED）；在表格 EHR 中，任务对齐的检索嵌入优于朴素相似度（AWARE）；在 Koopman 学习中，不变性诊断（PAD）决定谱/预测是否可信。
验证正在从“信任云/模型”转向“证明它”：vCause 为溯源图上的因果查询提供密码学证明；FCaC 将跨组织准入推进到“携带证明的能力令牌”；FHE-on-Llama-3 展示了部分加密推理的可行性。
智能体系统正在收敛到“工具使用 + 门控 + 审计”：MedVR 用带工具奖励与共识信用分配的 RL 做视觉落地；SpreadsheetAgent 采用抽取→验证→求解并配合多格式工具；OOM-RL 加入硬外部约束（资本损失）与严格测试锁以抵抗奖励黑客/测试规避。
Prompt/引导干预很强但也有锋利边界：结构化意图格式显著降低跨语言方差并帮助较弱模型（PPS/5W3H/CO-STAR/RISEN），但可能触及“编码开销”边界；persona 向量引导会降低教育类回答质量并诱发系统性评分偏差（尤其在 MoE 中）。
安全关键场景的“氛围编程（vibe coding）”目前默认不安全：施工安全代码生成显示，即使脚本能成功执行，也有很高的静默失败率，意味着需要确定性封装、验证与领域约束。

2) 关键主题（聚类）

主题：在真实约束下的检索/表征对齐

重要性：许多失败源于表征（邻域、嵌入、子空间）不匹配，而非模型容量不足。将表征对齐到任务/接口可在无需完全重训的情况下提升鲁棒性。
代表论文：
共同方法：
- 学习或选择能形成“好邻域”的表征（AWARE 使用 SNNL + 均衡采样 + 集成）。
- 在接近部署的约束下评估（按时间顺序划分；类别不平衡；高维；长信号）。
- 为表征失效时提供诊断/证书（Koopman 通过主角度/PAD 做不变性诊断）。
开放问题 / 失效模式：
- 如何端到端联合优化检索与 in-context 推理（AWARE 指出检索–推理不匹配）。
- 在微调/非线性头下收益是否仍然存在（SignalMC-MED 为冻结线性探测）。
- 检索噪声与摘要的“信号清晰度”上限（ContextClaim；医疗 RAG 流水线权衡）。

主题：分布式系统的携证完整性与主权

重要性：随着更多分析与决策迁移到不可信云与跨组织联邦，“可审计性”越来越需要密码学可验证性，而不仅是日志或策略引擎。
代表论文：
共同方法：
- 认证数据结构 + 简洁证明（vCause：分层 Merkle + 递归路径摘要）。
- 通过签名能力工件实现无状态边界验证（FCaC：KYO→ECT→PoP/DPoP）。
- 加密敏感计算路径（FHE attention）或通过自动发现 + 评分来优先推进密码迁移。
开放问题 / 失效模式：
- 新鲜性/吊销与生命周期管理（FCaC 的 PoC 缺少即时吊销）。
- 证明/更新权衡（vCause 分段深度 L；承诺间隔的新鲜性）。
- 实用性上限：量化 + 编译/PBS 开销（FHE），以及对测试路径排除的精度敏感性（量子安全审计）。

主题：带验证/门控回路的智能体工具使用

重要性：工具增强智能体可减少幻觉并提升鲁棒性，但前提是对探索与工具输出进行信用分配与验证；否则会放大错误传播。
代表论文：
共同方法：
- 将显式工具调用（缩放/裁剪；代码执行；渲染/编译）集成进推理回路。
- 门控奖励/验证（编译门控；视觉保真门；RO-lock 测试不可变）。
- 用自一致/共识信号塑造信用分配（MedVR 的 CCA；SciTikZ 的 DSC 往返）。
开放问题 / 失效模式：
- 计算强度与 rollout 成本（MedVR RL；SciTikZ DSC 开销；SpreadsheetAgent 抽取平均约 97s）。
- 动作集合/工具受限（MedVR 仅 Zoom-in；表格工具依赖强 VLM）。
- 人在回路依赖与可复现性（OOM-RL 依赖人工 “Epistemic Autopsy”；代码/日志仅部分开源）。

主题：面向“行为可靠性”的评估与诊断（不只是准确率）

重要性：许多系统以标准指标无法捕捉的方式失败：流程漂移、提示敏感性、假阳性偏差、可执行输出中的静默失败。
代表论文：
共同方法：
- 将需求拆成清单/布尔项（RPA-Check），并做跨模型/语言的因子研究（结构化意图）。
- 逐层探测以定位信息在哪一层退化（医疗 MLLM 分类的 FHS）。
- 沙箱执行 + 基于 rubric 的评审以发现“能跑但错”（施工安全 SFR）。
开放问题 / 失效模式：
- LLM-as-judge 的上限/偏差与场景覆盖有限（结构化意图的上限；RPA-Check 仅五个场景）。
- 执行成功掩盖领域逻辑失败（安全代码静默失败率高）。
- 生成式解码与判别任务不匹配（医疗 MLLM 可能不如监督分类器）。

主题：面向部署的轻量不确定性/可解释信号

重要性：常开系统需要廉价、内生的不确定性与可解释信号；事后采样可能过贵，但移除不确定性是不安全的。
代表论文：
共同方法：
- 将昂贵的不确定性蒸馏为单次前向的 head（FoMo-X 蒸馏 MC-dropout 变异性）。
- 用更好的后验采样器实现校准而无需事后校准（DiffUQ 通过薛定谔桥/SOC）。
- 将结构随机性复用为变分推断（MCSD），并在校准与准确率间做基准对比。
开放问题 / 失效模式：
- 解释 head 的仿真–真实不匹配（FoMo-X 的全局 head 难以迁移）。
- 训练成本与离散化/网络规模交互（DiffUQ）。
- 推理时多次前向成本（MCSD 需要 T 次；延迟随 T 线性增长）。

3) 技术综合

多篇论文共同指向 “门控（gating）”是核心的反幻觉原语：编译门控（SciTikZ）、以正确性为条件的工具奖励（MedVR）、策略引擎阻断（AEROS）、RO-lock 防止测试变异（OOM-RL）、以及密码学验证（vCause/FCaC）。
表征诊断正在成为一等公民：Koopman 的 PAD 主角度量化不变性；医疗 MLLM 的 Feature Health Scores 定位判别信号丢失位置；AWARE 用 SNNL 显式塑造检索邻域。
在冻结设置下，更长上下文胜过更大模型：SignalMC-MED 发现 10 分钟信号比扩参更稳定地带来收益；同样，结构化意图对弱模型的帮助更显著（弱模型补偿）。
检索正在前移到流水线更早阶段：ContextClaim 将检索移入 claim 检测；医疗 QA 研究隔离检索组件；SpreadsheetAgent 迭代检索/检查局部区域，而非串行化整张表。
工具使用 RL 正在采用自监督信用分配：MedVR 的共识掩码（CCA）与熵触发分支（EVR）呼应更广泛趋势：用集成/共识信号在无标注下塑造探索。
MoE vs dense 并非免费午餐：面向部署的基准显示中等规模 MoE 可能在帕累托上更优（Gemma-4-E4B），而 persona 引导显示 MoE 评分器的校准漂移可能大于 dense 模型。
“能跑”≠“正确”已被量化：施工安全引入可执行脚本中的静默失败率；医疗 MLLM 显示尽管规模更大，生成式流水线仍可能不如判别式基线。
隐私/安全工具正在与 LLM 集成：量子安全审计用 LLM 做上下文标注增强；CARIS 用 MCP 将原始临床数据留在工具后；FHE 将加密推理集成到 attention 中。

4) Top 5 论文（含“为何是现在”）

1) MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

用 RL（GRPO）训练医疗 VLM 交错文本与显式视觉动作（Zoom-in），无需中间标注。
引入 EVR（熵引导分支） 与 CCA（基于共识的信用分配），使工具使用可学习且有落地性。
在医疗 VQA 与 grounding 上显著提升（例如 mIoU 大幅提升；无标注 CCA 在某 grounding 任务上几乎达到监督上界）。
质疑点：RL 计算开销大、rollout 多；动作空间目前仅限单一 Zoom-in 工具；中间步骤的临床合理性未由专家验证。

2) vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing

为版本化溯源图上的因果查询提供 密码学证明（成员性 + 后向/前向因果组件）。
可扩展到大规模日志（报告处理 2500 万日志；对 10 万节点因果的证明生成 49 ms），端点开销低（<1%）。
在标准假设下给出形式化安全归约；适用于云不可信的事件响应。
质疑点：机密性不在范围内；分段深度与承诺间隔带来运维权衡。

3) SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG

建立 真实的长时程（10 分钟）同步 ECG+PPG 基准，含 20 个临床任务与按时间顺序划分。
发现 领域生物信号 FM > 通用时间序列 FM、ECG+PPG 融合有益，且在冻结线性探测下 更长信号比扩参更有帮助。
显示 手工生理特征仍然很强，并与 FM 嵌入互补。
质疑点：冻结线性探测 + 均值池化可能低估微调与更丰富时序聚合的收益；单一医疗系统限制泛化。

4) Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

实用的多智能体回路：增量抽取 → 验证 → 求解，通过 图像 + LaTeX + 代码执行 工具保留版式。
在 SpreadsheetBench/RealHiTBench 上稳定提升；消融显示验证与多格式工具很关键。
为“上下文装不下”的文档提供可操作模板。
质疑点：依赖强多模态/工具能力模型；运行时/内存开销不小（报告抽取平均约 97s；峰值 21GB）。

5) Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

大规模因子研究（3,240 个输出）显示结构化意图格式 提升目标对齐 并将 跨语言方差降低约 24×。
识别 弱模型补偿（结构化提示对弱基线帮助更大）与 容量/开销边界（结构过多反而可能有害）。
包含用户研究（N=50）：AI 扩展的 5W3H 减少交互轮次并提升满意度。
质疑点：LLM-as-judge 的上限效应与单评审偏差；用户研究顺序未做 counterbalance。

5) 实用下一步

若在医疗中部署 RAG/RA-TICL：测量检索邻域的标签纯度，并在扩展骨干前先尝试 任务对齐的嵌入塑形（如 SNNL + 均衡采样）（AWARE）。
对多模态医疗助手：加入 显式视觉工具调用 并跟踪 grounding 指标（IoU/footprints）；考虑 基于共识的伪监督 以避免标注瓶颈（MedVR）。
对安全关键代码生成：实现 沙箱执行 + 领域逻辑验证器 并报告 静默失败率（而不只是“能跑”）；要求明确可执行指令（施工安全研究）。
对跨组织智能体部署：采用 携证准入（能力令牌 + PoP），并将 吊销/新鲜性 作为一等需求规划（FCaC）。
对云取证/审计：评估你的溯源查询是否需要 可验证证明；原型化 Merkle 累加器 + 递归摘要方法并基准测试证明大小/延迟（vCause）。
对受约束条件下的模型选型：做 条件于提示的帕累托扫描（准确率 vs 延迟/显存），而不是只看参数量；警惕提示敏感性病理（Gemma/Phi/Qwen 权衡研究）。
对生产中的不确定性：考虑 单次前向的蒸馏不确定性 head（FoMo-X）或在残差架构已使用随机深度时采用 MCSD；基准测试校准（ECE/AUARC）与准确率权衡。

由逐篇论文分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-04-19

0) 核心要点（先读这个）

2) 关键主题（聚类）

主题：在真实约束下的检索/表征对齐

主题：分布式系统的携证完整性与主权

主题：带验证/门控回路的智能体工具使用

主题：面向“行为可靠性”的评估与诊断（不只是准确率）

主题：面向部署的轻量不确定性/可解释信号

3) 技术综合

4) Top 5 论文（含“为何是现在”）

5) 实用下一步