AI 论文日报(2026-04-02)

Published:

English version: /paper-news/2026-04-02/

运行统计

  • 候选论文: 235
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-31T00:00:00Z → 2026-04-01T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.29403Security in LLM-as-a-Judge: A Comprehensive SoK
PDF
cs.CR, cs.AI94First SoK on LLM-as-a-Judge security; maps attacks/risks for eval pipelines.LLM-as-a-judge, security, evaluation, adversarial, SoK, reliability
2603.29231Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
PDF
cs.AI94Reliability metrics for long-horizon agents; shows pass@1 fails as duration grows; large eval.agents, reliability, evaluation, long-horizon, benchmarks, deployment
2603.30016Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks
PDF
cs.CR, cs.AI92System-level design guidance for indirect prompt injection defenses in agents.agents, prompt-injection, system-design, security, tool-use, policies
2603.29993Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation
PDF
cs.AI92Reproduces+extends MONA reward-hacking mitigation; probes learned approval assumptions & tooling.alignment, reward-hacking, RL, MONA, reproducibility, safety
2603.29357BenchScope: How Many Independent Signals Does Your Benchmark Provide?
PDF
cs.AI92Quantifies benchmark redundancy via effective dimensionality; actionable for eval design/leaderboards.evaluation, benchmarks, measurement, leaderboards, metrics
2603.29665Near-Miss: Latent Policy Failure Detection in Agentic Workflows
PDF
cs.CL90Detects latent policy failures (near-misses) in agent workflows beyond end-state checks.agents, policy-compliance, evaluation, monitoring, ToolGuard, safety-metrics
2603.29418Adversarial Prompt Injection Attack on Multimodal Large Language Models
PDF
cs.CV, cs.AI90Imperceptible visual prompt injection against closed MLLMs; practical multimodal attack surface.security, prompt-injection, multimodal, adversarial, red-teaming, MLLM
2603.29500Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries
PDF
cs.AI, cs.LG90Process reward using structured formal intermediates to improve step reliability without losing accuracy.reasoning, process-reward, formal-methods, RL, reliability
2603.29846SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models
PDF
cs.CL88Benchmark for strategic communication & secret-keeping; targets info leakage in LLMs.information-leakage, multi-agent, benchmarks, security, strategic-communication, LLM-eval
2603.29429CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
PDF
cs.CL88Auditing toolkit for mental-health dialogues with evidence-linked, multi-metric risk reports.evaluation, auditing, safety, mental-health, rubrics, LLM
2603.29353Nomad: Autonomous Exploration and Discovery
PDF
cs.AI88Exploration-first agent architecture with hypothesis generation + independent verification; relevant to agent reliability.agents, autonomous-research, tool-use, verification, evaluation
2603.29373Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
PDF
cs.CL86Realistic medical safety eval: challenging patient behaviors + concrete unsafe failure criteria.medical, safety-evaluation, robustness, hallucinations, high-stakes, LLM
2603.29492Calibrated Confidence Expression for Radiology Report Generation
PDF
cs.CL86RL framework to calibrate verbalized confidence in radiology reports; targets hallucination risk.calibration, medical, vision-language, hallucinations, RL, reliability
2603.29632An Empirical Study of Multi-Agent Collaboration for Automated Research
PDF
cs.MA, cs.AI86Controlled empirical study of multi-agent coordination for automated research; useful evidence for MAS design/safety.multi-agent, coordination, automated-research, benchmarks, agent-evaluation
2603.29194Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention
PDF
cs.CV, cs.AI86Agent memory layering + retrieval gating reduces drift/false memories under bounded context budgets.agents, memory, long-context, retrieval, reliability
2603.29493MemFactory: Unified Inference & Training Framework for Agent Memory
PDF
cs.CL, cs.AI85Unified framework for training/inference of agent memory with modular components; reusable infra.agents, memory, framework, RL, tooling, long-term
2603.29902ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
PDF
cs.AI84ATP-Bench evaluates agentic tool planning for interleaved multimodal generation.agents, tool-planning, multimodal, benchmark, MLLM, evaluation
2603.29405Hallucination-aware intermediate representation edit in large vision-language models
PDF
cs.CV, cs.AI84Low-overhead hallucination mitigation for VLMs via intermediate-representation detection and edits; practical reliability gain.hallucinations, vision-language, reliability, representation-editing, multimodal
2603.29676A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
PDF
cs.LG, cs.CL, cs.CV84PID-based decomposition to measure redundant/unique/synergistic info in 26 LVLMs across tasks.interpretability, vision-language, information-decomposition, multimodal, analysis
2603.29497Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models
PDF
cs.CL83Distills LLM privacy sensitivity judgments into small models for scalable deployment.privacy, distillation, data-governance, classification, LLM-judge, efficiency
2603.29318PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
PDF
cs.AI83Personalization benchmark for smartphone GUI agents with 12.8k instructions across apps/scenarios.agents, benchmarks, GUI, smartphones, personalization, evaluation
2603.29139SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
PDF
cs.AI, cs.GR, cs.HC82Benchmark for scientific analysis/visualization agents with taxonomy + outcome-centric evaluation.agents, benchmarks, scientific-workflows, tool-use, evaluation, visualization
2603.29466An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms
PDF
cs.LG, cs.AI, cs.CL82Cheap uncertainty estimates from gradient norms (single backward pass) for large models; helps calibration/monitoring.uncertainty, calibration, gradient-norm, epistemic-uncertainty, monitoring
2603.29288Sima AIunty: Caste Audit in LLM-Driven Matchmaking
PDF
cs.CY, cs.AI, cs.CL, cs.HC, cs.SI82Controlled audit of caste bias in LLM matchmaking across model families and income strata.bias, fairness, auditing, sociotechnical, evaluation
2603.29759TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios
PDF
cs.CV, cs.AI81Large real-world VLM benchmark for trustworthy indoor safety hazard assessment.VLM, safety, benchmark, hazard-detection, robust-evaluation, vision-language
2603.29232Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
PDF
cs.CL, cs.AI, cs.LG80Structured long-doc QA (CoST) enabling verifiable outputs; aims for accuracy+latency with SLMs.long-context, QA, structured-output, verification, SLMs, reliability
2603.29871ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
PDF
cs.AI80Shapley-style reward allocation for multi-candidate LLM post-training; reduces free-riding vs set-level rewards.LLM-training, RLHF, GRPO, credit-assignment, shapley
2603.29109SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization
PDF
cs.SE, cs.AI80Grounds free-form LLM reasoning into structured intermediates for more verifiable fault localization.software, debugging, LLM-reasoning, grounding, verification
2603.29088WybeCoder: Verified Imperative Code Generation
PDF
cs.SE, cs.AI79Agentic verified code generation with co-evolving invariants/proofs; improves reliability.code-generation, verification, agents, Lean, SMT, reliability
2603.29824Curvature-Guided LoRA: Steering in the pretrained NTK subspace
PDF
cs.LG79Curvature/NTK-guided LoRA aims to match full fine-tuning predictions with efficient second-order updates.PEFT, LoRA, optimization, second-order, fine-tuning

AI 论文洞察简报

2026-04-02

0) 执行要点(先读这个)

  • 评估正在从“是否曾经成功一次?”转向“是否能在轨迹上可靠且安全地持续成功?” 新的指标/基准聚焦长时程可靠性衰减(RDC/VAF/GDS/MOP)、潜在策略失败(“近失误” near-misses)以及基准冗余(有效维度),这表明许多现有排行榜可能高估了进展。
  • 结构化、可执行的中间产物正在成为支撑 LLM 推理落地的主流模式。 这体现在可验证的命令式代码生成(VC 子目标 + Lean/SMT)、语义故障定位(LLM→可执行约束)、以及长文档问答(LLM→结构化输出并蒸馏到 SLM)。
  • 记忆并非“免费”:朴素的记忆脚手架可能损害长时程表现。 可靠性研究发现,带记忆的 ReAct 在长时程 GDS 上从未提升且常常变差;相对地,更显式的分层记忆(工作/情景/语义 + 保留正则)报告了保留率/FMR 的提升——指向记忆设计才是关键变量。
  • LLM-as-a-judge 现在是关键的安全依赖。 一篇安全 SoK 汇总了针对评审器的高 ASR 攻击(提示注入、投毒/后门、分词利用);多个新基准也依赖 MLLM 评审器,使得评审器加固与元评估更为必要。
  • 多模态系统面临双向挤压:更强的基准 + 更强的攻击。 新的危险评估与工具规划基准提升了真实度/覆盖面;同时,隐蔽的多模态提示注入对商用 MLLM 实现了高黑盒 ASR——部署需要系统级防御,而不只是模型微调。
  • 形式化验证正在从函数式证明扩展到大规模命令式程序。 WybeCoder 在翻译后的命令式基准上报告高解决率,并给出大型已验证产物(Heapsort),表明“代理式证明+代码协同演化”正变得可行(但仍有注意事项)。

2) 关键主题(聚类)

主题:轨迹级可靠性与智能体中的隐性失败

主题:用于落地、可审计与蒸馏的结构化中间产物

主题:对基准进行基准化(冗余、评审有效性、领域真实度)

主题:评估器与多模态系统中的安全与隐私风险

主题:记忆与长上下文保留(有效 vs 适得其反)

主题:多模态可信性:幻觉、校准与融合诊断


3) 技术综合

  • GRPO 正在成为跨领域的常见后训练原语:结构化长文档 QA 蒸馏(LITECOST)、放射学置信校准(ConRad)、记忆 RL 基础设施(MemFactory)、过程奖励的形式推理(PRoSFI),以及 Shapley 增强的多候选 RL(ShapE-GRPO)。
  • “让它可执行”是统一的反幻觉策略:cbfl-ir 约束在测试上执行(SemLoc)、VC 由 SMT/Lean 解除(WybeCoder)、形式步骤中间产物由证明器检查(PRoSFI)、工具计划标签以 precision/recall 评审(ATP-Bench)。
  • 多智能体分解用于扩展验证与评估:WybeCoder 将 VC 子目标分派给并行证明代理;ATP-Bench 使用多智能体评审(precision/recall/chief);Nomad 分离 explorer 与 verifier 用于发现。
  • 可靠性失败越来越被刻画为“多次运行的分布”,而非点估计:重复回合(k=3)揭示方差放大与排名反转;近失误表明“最终状态正确”也可能掩盖策略违规。
  • 记忆是一把双刃剑:显式分层记忆 + 保留正则报告保留率/FMR 提升;而可靠性实验中的记忆增强 ReAct 从未提升长时程 GDS 且常常变差——暗示除非精心结构化与训练,否则干扰/开销占主导。
  • 对评审器的依赖在扩大,安全风险随之上升:SciVisAgentBench、ATP-Bench、TSHA 与 CounselReflect 都使用 LLM/MLLM 评审并做鲁棒性检查;LaaJ 安全 SoK 记录了可污染这些流水线的高 ASR 攻击。
  • 基准设计正在更科学化:BenchScope 的有效维度 + 零假设/可靠性检验提供了审计套件是否真正测量多个独立能力的方法。
  • 多模态可信性在不同层面被攻击与防御:攻击操纵输入(CoTTA),防御操纵隐状态(HIRE)或训练校准置信(ConRad),而 PID 试图衡量“视觉是否真的起作用”。
  • 系统级安全方案收敛到“约束模型能看到/能决定什么”:结构化产物、识别与行动解耦、程序化验证器与 SemLoc/WybeCoder 的原则一致——减少自由形式的自由度。

4) Top 5 论文(含“为何是现在”)

1) Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

  • 引入 指标套件(RDC/RDS、VAF、GDS、MOP),揭示 pass@1 隐藏的长时程失效模式。
  • 大规模研究(396 任务,23,392 回合)显示 普遍的可靠性衰减与长时程下的 排名反转
  • 给出尖锐且可操作的结论:记忆增强 ReAct 从未提升长时程 GDS,且常常变差。
  • 质疑 / 局限:时长分桶使用估计的人类时间(不完美代理);仅 10 个开源权重模型与 3 个领域。

2) WybeCoder: Verified Imperative Code Generation

  • 展示 命令式代码 + 不变式 + 证明在 SMT + Lean 下的代理式协同演化。
  • 在翻译后的命令式基准上报告强解决率(例如 74.1% Verina-Loom62.1% Clever-Loom),并给出大型已验证 Heapsort 产物。
  • 多智能体 VC 子目标分解 + 通过确定性命名进行证明迁移,是具体的扩展配方。
  • 质疑 / 局限:Loom/Velvet 流水线仍属实验;面向托管内存目标;存在部分手工规格/分解;开源模型落后。

3) SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization

  • 将 LLM 的“语义意图”转为锚定 SSA 的 可执行约束,从而可在测试上进行频谱式打分。
  • 相比 SBFL 获得显著定位提升(例如 Acc@1 42.8% vs 6.4%,且可疑行更少)。
  • 反事实补丁步骤实质性提升 Acc@1(消融显示去掉会下降约 12pp)。
  • 质疑 / 局限:约束浪费高(许多从不触发/过度近似);数据集为单故障、小程序;仓库级设置存在问题。

4) BenchScope: How Many Independent Signals Does Your Benchmark Provide?

  • 提供快速诊断(有效维度)以检测 冗余基准套件与脆弱的复合指标。
  • 实证显示主要套件可能坍缩到约 1–2 个有效轴(例如 Open LLM Leaderboard ≈1.7)。
  • 给出实用维护者工作流(零假设、饱和、半分可靠性、ED-greedy 选择)。
  • 质疑 / 局限:ED 依赖模型群体;二值 SVD 会高估维度(需要修正)。

5) Adversarial Prompt Injection Attack on Multimodal Large Language Models

  • 展示隐蔽且表达力强的多模态注入(隐蔽文本触发 + ℓ∞ 扰动),对商用 MLLM 具有高黑盒 ASR。
  • 双目标对齐(文本 + 迭代更新的目标图像)在实证上至关重要(消融:去掉会导致 ASR 大幅下降)。
  • 与图像作为不可信输入的智能体部署直接相关。
  • 质疑 / 局限:任务(captioning/VQA)与预算有限;未报告人类感知研究。

5) 实用下一步

  • 在你的智能体栈中采用轨迹级评估:运行 k 次重复回合并按任务时长计算可靠性衰减;记录工具调用熵以检测崩溃(MOP 风格)并与失败相关联。
  • 为任何使用工具的智能体加入近失误审计:对每个会改变状态的动作,验证轨迹中更早处存在所需的只读证据(守卫代码回放 + 历史搜索)。
  • 加固 LLM-as-a-judge 流水线:将评审器视为攻击目标;尽可能使用受约束 schema、集成/委员会检查,并跟踪评审漂移/稳定性(提示扰动测试)。
  • 优先结构化中间产物而非自由形式推理:要求 JSON/IR 输出可执行/可检查(约束、工具计划、形式步骤),并丢弃格式错误/无落地依据的输出。
  • 谨慎对待“记忆增强”:测试你的记忆脚手架是否提升长时程 GDS(部分得分),而不仅是 pass@1;考虑带漂移正则的分层记忆,而非朴素情景草稿。
  • 对多模态智能体,假设图像不可信:针对隐蔽提示注入进行评估;加入系统级防御(计划/策略分离、结构化验证器),而非仅依赖提示指令。
  • 在优化前审计你的基准套件冗余:计算有效维度并运行半分/置换零假设检验,确保不是在单一潜在轴上过拟合。
  • 若训练多候选生成器:考虑避免“搭便车”的奖励分配(候选级信用分配),而不是把集合级标量广播给所有候选。

由逐篇论文分析生成;无外部浏览。