AI 论文日报(2026-03-25)

Published:

English version: /paper-news/2026-03-25/

运行统计

  • 候选论文: 223
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-23T00:00:00Z → 2026-03-24T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.21697Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
PDF
cs.CR, cs.AI, cs.MM95Comic-based multimodal jailbreak benchmark; very high attack success across 15 MLLMs.multimodal-safety, jailbreaks, benchmark, red-teaming, MLLM, adversarial-prompts
2603.21687Mirage The Illusion of Visual Understanding
PDF
cs.AI95Shows multimodal benchmarks can be gamed w/ no image; exposes "mirage reasoning" reliability failuremultimodal, evaluation, hallucination, reliability, benchmarking, medical-ai
2603.21642Are AI-assisted Development Tools Immune to Prompt Injection?
PDF
cs.CR, cs.SE93First empirical prompt-injection/tool-poisoning study across 7 real MCP dev clients.prompt-injection, tool-poisoning, MCP, agent-security, empirical-study, secure-tool-use
2603.21972Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe
PDF
cs.LG, cs.CL92Empirical recipe for scaling RL in long-horizon tool agents; actionable axes + takeaways on TravelPlanner.tool-using agents, long-horizon RL, RLHF/RLVR, agent evaluation, reward design, planning
2603.22117On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
PDF
cs.LG, cs.AI92Token-level signed Δlog p reveals reasoning-critical RLVR updates; actionable analysis + interventionsLLM, RLVR, reasoning, post-training, mechanistic-analysis, token-level
2603.21641Auditing MCP Servers for Over-Privileged Tool Capabilities
PDF
cs.CR, cs.SE90Practical auditing toolkit for over-privileged MCP servers with static+dynamic fuzzing.MCP, tool-permissions, sandboxing, security-audit, fuzzing, eBPF, agent-infra
2603.21461DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
PDF
cs.LG, cs.AI, cs.CL90Inference-time preference alignment via prompt-conditional SAE steering; compute-light with strong benchmarks.alignment, preference optimization, SAE, steering, mechanistic interpretability, inference-time control
2603.21558Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
PDF
cs.AI90Stabilizes recursive self-training by step-level symbolic verification; targets drift/mode collapse riskself-training, recursive-self-improvement, verification, neuro-symbolic, reasoning, safety
2603.21469Hardening Confidential Federated Compute against Side-channel Attacks
PDF
cs.CR, cs.DS90Finds side-channels that can bypass DP in confidential federated compute; proposes mitigationsprivacy, differential-privacy, federated-learning, side-channels, security, confidential-compute
2603.21975SecureBreak -- A dataset towards safe and secure models
PDF
cs.CR, cs.AI, cs.CL, cs.LG88Security-focused dataset for robustness evaluation/training against jailbreaks/injection.dataset, security-alignment, jailbreaks, prompt-injection, robustness-eval, guardrails
2603.22214Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
PDF
cs.CR, cs.AI, cs.LG88Systematic study of LLM-as-judge reliability vs humans; important for scalable eval and security assessment.evaluation, LLM-as-judge, reliability, human agreement, model auditing, safety eval
2603.21693Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain
PDF
cs.AI88Single-pass logprob-based medical MLLM hallucination detection; avoids costly multi-sample entropy methodshallucination-detection, MLLM, medical, VQA, uncertainty, logprobs, reliability
2603.21654Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks
PDF
cs.CR, cs.AI86Comprehensive RAG security review: threats (poisoning/inference) + defenses + benchmarks.RAG, security, data-poisoning, membership-inference, defenses, survey, benchmarking
2603.21523SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems
PDF
cs.RO, cs.AI86Safety assurance framework for LLM-enabled cyber-physical systems; targets hallucination-driven unsafe acts.CPS safety, robotics, neuro-symbolic, assurance, runtime safety, hallucinations
2603.21577Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
PDF
cs.AI86New benchmark for long-horizon spatial planning from egocentric video; targets agentic MLLM limitsagents, benchmark, embodied-ai, multimodal, planning, long-context, evaluation
2603.21607INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation
PDF
cs.AI85Mechanistic RAG UQ fix: induction heads inflate entropy; proposes gating for reliability.RAG, uncertainty, hallucinations, mechanistic-interpretability, calibration, reliability
2603.21489Effective Strategies for Asynchronous Software Engineering Agents
PDF
cs.CL, cs.AI84Practical strategies for asynchronous multi-agent SWE; tackles interference, dependencies, and integration.agents, software engineering, multi-agent coordination, asynchrony, long-horizon tasks, workflow
2603.21925Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support
PDF
cs.AI84Guideline-page image RAG with routing/filtering + traceable citations; strong clinical decision support evalRAG, grounding, citations, multimodal, healthcare, evaluation, retrieval
2603.21454Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
PDF
cs.CL83Black-box method to detect benchmark contamination via multi-session solution diversity.evaluation, benchmark-contamination, SWE-bench, leakage, multi-agent, audit-methods
2603.21692Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces
PDF
cs.AI, cs.DC, cs.SE82Proposes structured reasoning provenance for agents: queryable 'why' records at scale.agents, observability, auditing, reasoning-provenance, governance, monitoring
2603.21705Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs
PDF
cs.LG82Fisher/Hessian-motivated layer-adaptive model merging for long-to-short reasoning; practical compression levermodel-merging, reasoning, compression, Fisher-information, alignment, LLM
2603.21522Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation
PDF
cs.SE, cs.AI82Failure management for LLM multi-agent systems using historical patterns + trace representationsmulti-agent, reliability, monitoring, debugging, reasoning-traces, software-engineering
2603.21563Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
PDF
cs.AI81Counterfactual credit assignment for collaborative agents; reduces variance/free-riding in multi-agent RL.multi-agent RL, credit assignment, counterfactual baselines, collaboration, agent training
2603.21606mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT
PDF
cs.LG, cs.AI80Multi-task SFT mixture method that avoids per-dataset overfitting; broad benchmark gains.SFT, data-mixtures, post-training, overfitting, training-recipes, LLM
2603.21877P^2O: Joint Policy and Prompt Optimization
PDF
cs.LG, cs.AI80Combines prompt optimization with RLVR to tackle hard samples and sparse rewards; exploration boost.RLVR, reasoning, prompt optimization, genetic search, training stability, verifiable rewards
2603.21872Manifold-Aware Exploration for Reinforcement Learning in Video Generation
PDF
cs.CV, cs.AI80Constrains GRPO exploration to stay near video manifold; improves stability of reward-based post-trainingRL, GRPO, video-generation, alignment, stability, exploration, diffusion
2603.21663TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression
PDF
cs.CL80Multi-turn RL for long-context compression; tackles credit assignment without heavy judge overheadlong-context, reinforcement-learning, reward-shaping, memory, training, alignment-methods
2603.21840Select, Label, Evaluate: Active Testing in NLP
PDF
cs.CL, cs.AI78Active Testing benchmark across many NLP datasets; reduces labeling cost while estimating performance well.evaluation, active testing, data efficiency, benchmarking, test set design, annotation
2603.22184Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?
PDF
cs.LG, quant-ph78Compares finetune vs RAG vs agent+exec feedback for domain codegen; useful evidence on specialization tradeoffscode-generation, agents, RAG, execution-feedback, evaluation, domain-adaptation
2603.22276Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
PDF
cs.LG, stat.ML78Makes high-rank DoRA practical via factored norms + fused kernels; useful for efficient adaptationefficiency, fine-tuning, LoRA, DoRA, systems, kernels, scaling

AI 论文洞察简报

2026-03-25

0) 核心要点(先读这个)

  • 评测完整性正遭受主动攻击——从代码基准到多模态“视觉”测试。 跨会话行为多样性(CCV)可标记 SWE-bench 污染,而 “Mirage” 表明许多多模态基准在不提供图像时仍高度可答(准确率往往仍保留约 ~70–80%)。
  • 推理时、可逆对齐正变得更实用。 DSPA 使用稀疏自编码器(SAE)特征进行按提示条件、按 token 条件的引导(steering),以适度的多选题回归为代价提升 MT-Bench,并在极小偏好数据集(≈100–250 个 triples)下表现出强鲁棒性。
  • 智能体可靠性正从“更聪明的提示词”转向“软件工程 + 运维原语”。 CAID(git worktrees + 依赖感知委派 + 测试门禁合并)提升长时程 SWE 基准表现;EAGER 与 AER 提出轨迹表示,用于更快的故障检测与群体层面的行为分析。
  • 安全关注点正转向工具边界(MCP)与 RAG 流水线。 对 MCP 客户端的实证测试发现没有任何客户端能阻挡所有工具投毒攻击;协议感知审计(静态 + 动态 eBPF fuzzing)可捕获过度授权的服务器;一篇大型 RAG 安全综述整合了威胁/防御/基准。
  • 用于推理的 RL/RLVR 正在 token 与信用分配层面被“调试”。 方向性的 token 位移(Δlog p)解释稀疏的 RLVR 变化,并支持测试时外推 + 训练重加权;CCPO 与 TAMTRL 重塑多智能体协作与多轮记忆 RL 的信用分配;P²O 通过提示词演化 + 上下文蒸馏打破“困难样本零回报”的死区。
  • 形式化验证与 DP 正作为实用缓解手段重新进入闭环。 SafePilot 使用 Z3/Spot 验证 LLM 生成的 CPS 计划;机密联邦计算工作显示若不加入消息填充与 DP-resize 机制,DP 可能被侧信道削弱。

2) 关键主题(聚类)

主题:基准可信度与污染(代码 + 多模态)

主题:推理时对齐与机制性不确定性信号

主题:面向长时程可靠性的智能体工程(协作、调试、溯源)

主题:工具/RAG 安全与“安全”计算中的隐私泄漏

主题:通过更好的信用分配与探索控制稳定 RL/RLVR

3) 技术综合

  • 行为反事实正在成为通用诊断工具:CCV 使用会话隔离的重复求解;Mirage 使用无图像控制;CCPO 使用反事实 rollout;CEBaG 使用纯文本 vs 多模态的打分前向过程。
  • “白盒信号”越来越多地用于修补评测与安全缺口:归纳头 SinkRate(INTRYGUE)、SAE 潜变量(DSPA)、token logprob 方差/证据增益(CEBaG)、带符号的 Δlog p(RLVR 方向)。
  • 信用分配正收敛到“归一化 + 有界塑形”:CCPO 的 EMA z-scoring/tanh 塑形;TAMTRL 的 min–max 归一化(缺失会崩溃);SAGE-GRPO 的时间步均衡器;RLVR 重加权上调低概率 token。
  • 智能体可靠性工作正在分裂为两层:(a) 协作原语(CAID 的 worktrees/合并/测试)与 (b) 可观测性原语(EAGER 用于故障检索的 embedding;AER schema + mock replay)。
  • 安全正从“模型越狱”转向“系统边界越狱”:MCP 工具元数据投毒与过度授权服务器;RAG 流水线威胁;TEE 中的 DP 侧信道。
  • 形式化方法被用作实用护栏而非端到端验证:SafePilot 用 Z3/Spot 验证计划并迭代重提示;DP 侧信道缓解带有定理但针对特定通道。
  • 数据效率是对齐与评测的共同主题:DSPA 在严重偏好数据受限下仍有效;Active Testing 将标注最多减少 95%;MSFT 通过排除早期过拟合子数据集减少浪费计算。
  • “免训练”或“无权重更新”不只是便利——正在成为安全/运维特性:DSPA 引导可逆;基于 FIM 的合并无数据;INTRYGUE 免训练;CEBaG 确定性且无需采样。

4) Top 5 论文(含“为何现在”)

1) Mirage The Illusion of Visual Understanding(Mirage:视觉理解的幻觉)

  • 表明前沿多模态模型常会自信地描述不存在的图像,且在省略图像时仍能取得高分(平均 mirage-scores ~70–80%)。
  • 展示基准脆弱性:B-Clean 在某些基准中移除约 ~74–77% 的问题,并能显著改变准确率/排名。
  • “为何现在”:多模态模型正被部署到高风险领域(医疗);该工作提供了可扩展的评测控制(无图像)与清洗协议。
  • 需要怀疑:B-Clean 依赖模型集合;mirage 的机制性原因尚未完全识别。

2) Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

  • 提出一种黑盒、仅 API 的污染检测器,使用会话隔离的重复试验与补丁多样性指标。
  • 在 9 个 SWE-bench 问题上报告对污染 vs 真实推理的完美分离(样本小但显著),并提供抗偏差分析流程(HCCA)。
  • “为何现在”:编码基准是前沿主张的核心;该方法无需模型内部即可审计。
  • 需要怀疑:仅在9 个问题 / 1 个模型上评估;推理分类器是启发式且在同一数据上评估。

3) DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

  • 推理时、按提示条件的 SAE 空间稀疏引导;仅编辑 token 激活的潜变量。
  • 在多模型上提升 MT-Bench,并在极小偏好数据集(低至 ~100–250 triples)下保持鲁棒;相对两阶段基线显著节省计算(建模 4.47× FLOPs;实测 11.5× wall-clock)。
  • “为何现在”:对低成本、可逆对齐与机制可审计性的需求上升。
  • 需要怀疑:依赖 SAE 的可用性/质量;开放式评估依赖 LLM 裁判;无形式化安全保证。

4) Are AI-assisted Development Tools Immune to Prompt Injection?(AI 辅助开发工具能免疫提示注入吗?)

  • 在 7 个 MCP 客户端上对 4 种具体攻击进行工具投毒提示注入实证测试,发现没有客户端能阻挡所有攻击
  • 强调差异巨大:Cursor 在所有测试攻击下均不安全;Claude Desktop 与 Cline 在测试配置中最强;许多客户端缺少静态校验/沙箱/审计日志。
  • “为何现在”:MCP 风格工具生态正快速成为 IDE/CLI 工作流默认;这是直接的运营风险。
  • 需要怀疑:受限于特定版本/配置与本地测试床;对沙箱的评估部分基于文档。

5) On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation(LLM 推理中 RLVR 更新方向:识别与利用)

  • 主张 RLVR 变化应通过带符号的 token 概率位移(Δlog p)理解,而非仅看幅度指标。
  • 显示用 Δlog p 选取的 token 替换以约 ~10% 的 token 交换即可恢复 RLVR 性能;提出测试时外推训练时优势重加权并报告增益(如 AIME 等数学集上的 Avg@32 提升)。
  • “为何现在”:RLVR 被广泛用于推理;该工作提供可解释性与可操作的改进旋钮。
  • 需要怀疑:外推在测试时需要同时拥有 base + RL 模型,并引入可调超参(τ, γ)。

5) 实用下一步

  • 在评测框架中加入“反事实控制”:多模态运行无图像 mirage-mode;代码任务运行会话隔离的重复求解并测量多样性(CCV 风格)。
  • 将工具元数据视为不可信输入:采用 MCP 服务器审计(静态规则 + 可选动态沙箱/eBPF),并在部署前要求能力清单 + 最小权限加固。
  • 为智能体加入结构化溯源(意图/观察/推断 + 证据链),并启用 mock replay,在固定的事故语料上对提示词/模型变更做回归测试。
  • 对多智能体 SWE:强制物理隔离(git worktrees/分支)、依赖感知委派、测试门禁合并;测量集成失败率随工程师数量变化以找到并行化“拐点”。
  • 如果做 RAG:评估能体现上下文“如何被使用”的不确定性方法(如归纳头活动),并单独跟踪检索质量以避免“忠实但错误”的置信度。
  • 对 RLVR / 智能体 RL:优先改进信用分配:协作可尝试反事实边际奖励(CCPO),并考虑概率感知重加权以避免忽略低概率但关键的 token。
  • 对安全关键规划(CPS/机器人):集成形式化验证闭环(Z3/Spot),并将验证失败作为一等训练/评测产物记录。
  • 对 DP-in-TEE 部署:审计元数据侧信道(消息长度、分配/缺页),并在适用时考虑 DP padding + DP 定时 resize 机制。

由逐篇论文分析生成;无外部浏览。