AI 论文日报（2026-03-25）

Published: March 25, 2026

English version: /paper-news/2026-03-25/

运行统计

候选论文: 223
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-23T00:00:00Z → 2026-03-24T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.21697`	Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models PDF	cs.CR, cs.AI, cs.MM	95	Comic-based multimodal jailbreak benchmark; very high attack success across 15 MLLMs.	multimodal-safety, jailbreaks, benchmark, red-teaming, MLLM, adversarial-prompts
`2603.21687`	Mirage The Illusion of Visual Understanding PDF	cs.AI	95	Shows multimodal benchmarks can be gamed w/ no image; exposes "mirage reasoning" reliability failure	multimodal, evaluation, hallucination, reliability, benchmarking, medical-ai
`2603.21642`	Are AI-assisted Development Tools Immune to Prompt Injection? PDF	cs.CR, cs.SE	93	First empirical prompt-injection/tool-poisoning study across 7 real MCP dev clients.	prompt-injection, tool-poisoning, MCP, agent-security, empirical-study, secure-tool-use
`2603.21972`	Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe PDF	cs.LG, cs.CL	92	Empirical recipe for scaling RL in long-horizon tool agents; actionable axes + takeaways on TravelPlanner.	tool-using agents, long-horizon RL, RLHF/RLVR, agent evaluation, reward design, planning
`2603.22117`	On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation PDF	cs.LG, cs.AI	92	Token-level signed Δlog p reveals reasoning-critical RLVR updates; actionable analysis + interventions	LLM, RLVR, reasoning, post-training, mechanistic-analysis, token-level
`2603.21641`	Auditing MCP Servers for Over-Privileged Tool Capabilities PDF	cs.CR, cs.SE	90	Practical auditing toolkit for over-privileged MCP servers with static+dynamic fuzzing.	MCP, tool-permissions, sandboxing, security-audit, fuzzing, eBPF, agent-infra
`2603.21461`	DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment PDF	cs.LG, cs.AI, cs.CL	90	Inference-time preference alignment via prompt-conditional SAE steering; compute-light with strong benchmarks.	alignment, preference optimization, SAE, steering, mechanistic interpretability, inference-time control
`2603.21558`	Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment PDF	cs.AI	90	Stabilizes recursive self-training by step-level symbolic verification; targets drift/mode collapse risk	self-training, recursive-self-improvement, verification, neuro-symbolic, reasoning, safety
`2603.21469`	Hardening Confidential Federated Compute against Side-channel Attacks PDF	cs.CR, cs.DS	90	Finds side-channels that can bypass DP in confidential federated compute; proposes mitigations	privacy, differential-privacy, federated-learning, side-channels, security, confidential-compute
`2603.21975`	SecureBreak -- A dataset towards safe and secure models PDF	cs.CR, cs.AI, cs.CL, cs.LG	88	Security-focused dataset for robustness evaluation/training against jailbreaks/injection.	dataset, security-alignment, jailbreaks, prompt-injection, robustness-eval, guardrails
`2603.22214`	Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models PDF	cs.CR, cs.AI, cs.LG	88	Systematic study of LLM-as-judge reliability vs humans; important for scalable eval and security assessment.	evaluation, LLM-as-judge, reliability, human agreement, model auditing, safety eval
`2603.21693`	Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain PDF	cs.AI	88	Single-pass logprob-based medical MLLM hallucination detection; avoids costly multi-sample entropy methods	hallucination-detection, MLLM, medical, VQA, uncertainty, logprobs, reliability
`2603.21654`	Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks PDF	cs.CR, cs.AI	86	Comprehensive RAG security review: threats (poisoning/inference) + defenses + benchmarks.	RAG, security, data-poisoning, membership-inference, defenses, survey, benchmarking
`2603.21523`	SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems PDF	cs.RO, cs.AI	86	Safety assurance framework for LLM-enabled cyber-physical systems; targets hallucination-driven unsafe acts.	CPS safety, robotics, neuro-symbolic, assurance, runtime safety, hallucinations
`2603.21577`	Mind over Space: Can Multimodal Large Language Models Mentally Navigate? PDF	cs.AI	86	New benchmark for long-horizon spatial planning from egocentric video; targets agentic MLLM limits	agents, benchmark, embodied-ai, multimodal, planning, long-context, evaluation
`2603.21607`	INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation PDF	cs.AI	85	Mechanistic RAG UQ fix: induction heads inflate entropy; proposes gating for reliability.	RAG, uncertainty, hallucinations, mechanistic-interpretability, calibration, reliability
`2603.21489`	Effective Strategies for Asynchronous Software Engineering Agents PDF	cs.CL, cs.AI	84	Practical strategies for asynchronous multi-agent SWE; tackles interference, dependencies, and integration.	agents, software engineering, multi-agent coordination, asynchrony, long-horizon tasks, workflow
`2603.21925`	Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support PDF	cs.AI	84	Guideline-page image RAG with routing/filtering + traceable citations; strong clinical decision support eval	RAG, grounding, citations, multimodal, healthcare, evaluation, retrieval
`2603.21454`	Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis PDF	cs.CL	83	Black-box method to detect benchmark contamination via multi-session solution diversity.	evaluation, benchmark-contamination, SWE-bench, leakage, multi-agent, audit-methods
`2603.21692`	Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces PDF	cs.AI, cs.DC, cs.SE	82	Proposes structured reasoning provenance for agents: queryable 'why' records at scale.	agents, observability, auditing, reasoning-provenance, governance, monitoring
`2603.21705`	Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs PDF	cs.LG	82	Fisher/Hessian-motivated layer-adaptive model merging for long-to-short reasoning; practical compression lever	model-merging, reasoning, compression, Fisher-information, alignment, LLM
`2603.21522`	Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation PDF	cs.SE, cs.AI	82	Failure management for LLM multi-agent systems using historical patterns + trace representations	multi-agent, reliability, monitoring, debugging, reasoning-traces, software-engineering
`2603.21563`	Counterfactual Credit Policy Optimization for Multi-Agent Collaboration PDF	cs.AI	81	Counterfactual credit assignment for collaborative agents; reduces variance/free-riding in multi-agent RL.	multi-agent RL, credit assignment, counterfactual baselines, collaboration, agent training
`2603.21606`	mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT PDF	cs.LG, cs.AI	80	Multi-task SFT mixture method that avoids per-dataset overfitting; broad benchmark gains.	SFT, data-mixtures, post-training, overfitting, training-recipes, LLM
`2603.21877`	P^2O: Joint Policy and Prompt Optimization PDF	cs.LG, cs.AI	80	Combines prompt optimization with RLVR to tackle hard samples and sparse rewards; exploration boost.	RLVR, reasoning, prompt optimization, genetic search, training stability, verifiable rewards
`2603.21872`	Manifold-Aware Exploration for Reinforcement Learning in Video Generation PDF	cs.CV, cs.AI	80	Constrains GRPO exploration to stay near video manifold; improves stability of reward-based post-training	RL, GRPO, video-generation, alignment, stability, exploration, diffusion
`2603.21663`	TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression PDF	cs.CL	80	Multi-turn RL for long-context compression; tackles credit assignment without heavy judge overhead	long-context, reinforcement-learning, reward-shaping, memory, training, alignment-methods
`2603.21840`	Select, Label, Evaluate: Active Testing in NLP PDF	cs.CL, cs.AI	78	Active Testing benchmark across many NLP datasets; reduces labeling cost while estimating performance well.	evaluation, active testing, data efficiency, benchmarking, test set design, annotation
`2603.22184`	Revisiting Quantum Code Generation: Where Should Domain Knowledge Live? PDF	cs.LG, quant-ph	78	Compares finetune vs RAG vs agent+exec feedback for domain codegen; useful evidence on specialization tradeoffs	code-generation, agents, RAG, execution-feedback, evaluation, domain-adaptation
`2603.22276`	Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels PDF	cs.LG, stat.ML	78	Makes high-rank DoRA practical via factored norms + fused kernels; useful for efficient adaptation	efficiency, fine-tuning, LoRA, DoRA, systems, kernels, scaling

AI 论文洞察简报

2026-03-25

0) 核心要点（先读这个）

评测完整性正遭受主动攻击——从代码基准到多模态“视觉”测试。 跨会话行为多样性（CCV）可标记 SWE-bench 污染，而 “Mirage” 表明许多多模态基准在不提供图像时仍高度可答（准确率往往仍保留约 ~70–80%）。
推理时、可逆对齐正变得更实用。 DSPA 使用稀疏自编码器（SAE）特征进行按提示条件、按 token 条件的引导（steering），以适度的多选题回归为代价提升 MT-Bench，并在极小偏好数据集（≈100–250 个 triples）下表现出强鲁棒性。
智能体可靠性正从“更聪明的提示词”转向“软件工程 + 运维原语”。 CAID（git worktrees + 依赖感知委派 + 测试门禁合并）提升长时程 SWE 基准表现；EAGER 与 AER 提出轨迹表示，用于更快的故障检测与群体层面的行为分析。
安全关注点正转向工具边界（MCP）与 RAG 流水线。 对 MCP 客户端的实证测试发现没有任何客户端能阻挡所有工具投毒攻击；协议感知审计（静态 + 动态 eBPF fuzzing）可捕获过度授权的服务器；一篇大型 RAG 安全综述整合了威胁/防御/基准。
用于推理的 RL/RLVR 正在 token 与信用分配层面被“调试”。 方向性的 token 位移（Δlog p）解释稀疏的 RLVR 变化，并支持测试时外推 + 训练重加权；CCPO 与 TAMTRL 重塑多智能体协作与多轮记忆 RL 的信用分配；P²O 通过提示词演化 + 上下文蒸馏打破“困难样本零回报”的死区。
形式化验证与 DP 正作为实用缓解手段重新进入闭环。 SafePilot 使用 Z3/Spot 验证 LLM 生成的 CPS 计划；机密联邦计算工作显示若不加入消息填充与 DP-resize 机制，DP 可能被侧信道削弱。

2) 关键主题（聚类）

主题：基准可信度与污染（代码 + 多模态）

重要性：如果基准可通过泄漏或模态捷径被解决，报告的“推理”和“视觉理解”提升会被夸大，下游决策（模型选择、安全主张、数据策展）将变得不可靠。
代表论文：
共同方法：
- 用行为或反事实控制替代仅基于产物的检查（会话隔离的重复求解；无图像的 “mirage-mode”）。
- 用简单比率/指标量化易感性（污染分数；mirage-score = acc(无图像)/acc(有图像)）。
- 在保持统计有效性的同时降低评测成本（使用 Horvitz–Thompson 估计量 + 自适应停止的主动测试）。
开放问题 / 失效模式：
- 污染/mirage 诊断在不同模型家族、解码设置与领域上的泛化如何？
- 模型集合依赖：像 B-Clean 这样的清洗流程依赖用于过滤的模型集合。
- 模型能否学会“伪造多样性”或“伪造不确定性”以规避行为型污染检查？

主题：推理时对齐与机制性不确定性信号

重要性：部署往往需要低成本、可逆的对齐与可靠的不确定性，且无需再训练或重采样——尤其适用于 RAG 与开放式生成。
代表论文：
共同方法：
- 使用内部表征（SAE 特征；归纳头 SinkRate；token 对数概率方差 + 证据增益）驱动干预/打分。
- 偏好免训练或低数据方法（DSPA 在 ~100–250 个偏好 triples 下仍鲁棒；INTRYGUE 免训练；CEBaG 仅需 3 次前向传播且确定性）。
- 强调可审计性（稀疏特征编辑；机制探针/消融；无超参打分）。
开放问题 / 失效模式：
- 白盒依赖：INTRYGUE 需要访问注意力；CEBaG 需要 logprobs。
- 忠实性 vs 真实性：INTRYGUE 衡量的是 grounding 忠实性——检索到的文档若本身错误，模型仍可能显得“很确定”。
- 引导滥用风险：推理时 steering 可能被对抗性使用。

主题：面向长时程可靠性的智能体工程（协作、调试、溯源）

重要性：当智能体变得异步与自治时，主要失败来自集成 bug、反复出现的轨迹级失败模式、以及缺乏群体层面的可观测性。
代表论文：
共同方法：
- 引入 SWE/ops 原语：依赖图、隔离工作区、测试门禁合并（CAID）。
- 学习轨迹表示用于基于检索的诊断与逐步缓解（EAGER：双编码器 + 对比目标）。
- 标准化结构化溯源 schema并提供回放模式用于回归（AER：意图/观察/推断 + mock replay）。
开放问题 / 失效模式：
- 开销：更多智能体、更强隔离、更多日志会增加成本与延迟。
- 忠实性：溯源字段可能是自报并被合理化；表示模型仍处早期。
- 超出代码/工具领域的泛化：在缺少测试与版本控制的领域中效果未知。

主题：工具/RAG 安全与“安全”计算中的隐私泄漏

重要性：工具集成与检索流水线扩大攻击面；若元数据与侧信道仍可观测，DP 与 TEE 并不能自动防止泄漏。
代表论文：
共同方法：
- 协议感知的安全评估（MCP 工具元数据作为注入向量；能力族审计）。
- 结合静态 + 动态证据（Docker 沙箱 + eBPF 遥测；fuzzing）。
- 将侧信道视为 DP 的一等威胁；加入 DP padding 与 DP 定时 resize，并给出证明。
开放问题 / 失效模式：
- 静态覆盖缺口（例如 MCP 审计静态规则漏掉 JS/TS；动态需要 eBPF/Linux）。
- 防御权衡：padding 开销；DP-resize 复杂度；通道覆盖不完整。
- 生态漂移：客户端版本/配置快速变化，安全态势可能回退。

主题：通过更好的信用分配与探索控制稳定 RL/RLVR

重要性：用于推理/智能体/视频的 RL 受限于稀疏奖励、困难样本死区、多智能体搭便车、以及不稳定探索。
代表论文：
共同方法：
- 用反事实或按组件塑形的信号替代共享/终止奖励（CCPO 边际贡献；TAMTRL 轮次奖励；Δlog p token 诊断）。
- 使用信任域 / 归一化稳定更新（双 KL anchors；EMA 归一化；梯度均衡器）。
- 加入保持“在流形上”的探索（视频 GRPO 方差校正）或“提示词辅助”的探索（P²O 提示词演化）。
开放问题 / 失效模式：
- 额外计算/运行时（反事实；提示词演化；视频 RL）。
- 超参数敏感（外推 γ/τ；信任域权重；提示词搜索预算）。
- 超出数学/视频/特定拓扑的迁移仍缺乏充分测试。

3) 技术综合

行为反事实正在成为通用诊断工具：CCV 使用会话隔离的重复求解；Mirage 使用无图像控制；CCPO 使用反事实 rollout；CEBaG 使用纯文本 vs 多模态的打分前向过程。
“白盒信号”越来越多地用于修补评测与安全缺口：归纳头 SinkRate（INTRYGUE）、SAE 潜变量（DSPA）、token logprob 方差/证据增益（CEBaG）、带符号的 Δlog p（RLVR 方向）。
信用分配正收敛到“归一化 + 有界塑形”：CCPO 的 EMA z-scoring/tanh 塑形；TAMTRL 的 min–max 归一化（缺失会崩溃）；SAGE-GRPO 的时间步均衡器；RLVR 重加权上调低概率 token。
智能体可靠性工作正在分裂为两层：(a) 协作原语（CAID 的 worktrees/合并/测试）与 (b) 可观测性原语（EAGER 用于故障检索的 embedding；AER schema + mock replay）。
安全正从“模型越狱”转向“系统边界越狱”：MCP 工具元数据投毒与过度授权服务器；RAG 流水线威胁；TEE 中的 DP 侧信道。
形式化方法被用作实用护栏而非端到端验证：SafePilot 用 Z3/Spot 验证计划并迭代重提示；DP 侧信道缓解带有定理但针对特定通道。
数据效率是对齐与评测的共同主题：DSPA 在严重偏好数据受限下仍有效；Active Testing 将标注最多减少 95%；MSFT 通过排除早期过拟合子数据集减少浪费计算。
“免训练”或“无权重更新”不只是便利——正在成为安全/运维特性：DSPA 引导可逆；基于 FIM 的合并无数据；INTRYGUE 免训练；CEBaG 确定性且无需采样。

4) Top 5 论文（含“为何现在”）

1) Mirage The Illusion of Visual Understanding（Mirage：视觉理解的幻觉）

表明前沿多模态模型常会自信地描述不存在的图像，且在省略图像时仍能取得高分（平均 mirage-scores ~70–80%）。
展示基准脆弱性：B-Clean 在某些基准中移除约 ~74–77% 的问题，并能显著改变准确率/排名。
“为何现在”：多模态模型正被部署到高风险领域（医疗）；该工作提供了可扩展的评测控制（无图像）与清洗协议。
需要怀疑：B-Clean 依赖模型集合；mirage 的机制性原因尚未完全识别。

2) Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

提出一种黑盒、仅 API 的污染检测器，使用会话隔离的重复试验与补丁多样性指标。
在 9 个 SWE-bench 问题上报告对污染 vs 真实推理的完美分离（样本小但显著），并提供抗偏差分析流程（HCCA）。
“为何现在”：编码基准是前沿主张的核心；该方法无需模型内部即可审计。
需要怀疑：仅在9 个问题 / 1 个模型上评估；推理分类器是启发式且在同一数据上评估。

3) DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

推理时、按提示条件的 SAE 空间稀疏引导；仅编辑 token 激活的潜变量。
在多模型上提升 MT-Bench，并在极小偏好数据集（低至 ~100–250 triples）下保持鲁棒；相对两阶段基线显著节省计算（建模 4.47× FLOPs；实测 11.5× wall-clock）。
“为何现在”：对低成本、可逆对齐与机制可审计性的需求上升。
需要怀疑：依赖 SAE 的可用性/质量；开放式评估依赖 LLM 裁判；无形式化安全保证。

4) Are AI-assisted Development Tools Immune to Prompt Injection?（AI 辅助开发工具能免疫提示注入吗？）

在 7 个 MCP 客户端上对 4 种具体攻击进行工具投毒提示注入实证测试，发现没有客户端能阻挡所有攻击。
强调差异巨大：Cursor 在所有测试攻击下均不安全；Claude Desktop 与 Cline 在测试配置中最强；许多客户端缺少静态校验/沙箱/审计日志。
“为何现在”：MCP 风格工具生态正快速成为 IDE/CLI 工作流默认；这是直接的运营风险。
需要怀疑：受限于特定版本/配置与本地测试床；对沙箱的评估部分基于文档。

5) On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation（LLM 推理中 RLVR 更新方向：识别与利用）

主张 RLVR 变化应通过带符号的 token 概率位移（Δlog p）理解，而非仅看幅度指标。
显示用 Δlog p 选取的 token 替换以约 ~10% 的 token 交换即可恢复 RLVR 性能；提出测试时外推与训练时优势重加权并报告增益（如 AIME 等数学集上的 Avg@32 提升）。
“为何现在”：RLVR 被广泛用于推理；该工作提供可解释性与可操作的改进旋钮。
需要怀疑：外推在测试时需要同时拥有 base + RL 模型，并引入可调超参（τ, γ）。

5) 实用下一步

在评测框架中加入“反事实控制”：多模态运行无图像 mirage-mode；代码任务运行会话隔离的重复求解并测量多样性（CCV 风格）。
将工具元数据视为不可信输入：采用 MCP 服务器审计（静态规则 + 可选动态沙箱/eBPF），并在部署前要求能力清单 + 最小权限加固。
为智能体加入结构化溯源（意图/观察/推断 + 证据链），并启用 mock replay，在固定的事故语料上对提示词/模型变更做回归测试。
对多智能体 SWE：强制物理隔离（git worktrees/分支）、依赖感知委派、测试门禁合并；测量集成失败率随工程师数量变化以找到并行化“拐点”。
如果做 RAG：评估能体现上下文“如何被使用”的不确定性方法（如归纳头活动），并单独跟踪检索质量以避免“忠实但错误”的置信度。
对 RLVR / 智能体 RL：优先改进信用分配：协作可尝试反事实边际奖励（CCPO），并考虑概率感知重加权以避免忽略低概率但关键的 token。
对安全关键规划（CPS/机器人）：集成形式化验证闭环（Z3/Spot），并将验证失败作为一等训练/评测产物记录。
对 DP-in-TEE 部署：审计元数据侧信道（消息长度、分配/缺页），并在适用时考虑 DP padding + DP 定时 resize 机制。

由逐篇论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-03-25

0) 核心要点（先读这个）

2) 关键主题（聚类）

主题：基准可信度与污染（代码 + 多模态）

主题：推理时对齐与机制性不确定性信号

主题：面向长时程可靠性的智能体工程（协作、调试、溯源）

主题：工具/RAG 安全与“安全”计算中的隐私泄漏

主题：通过更好的信用分配与探索控制稳定 RL/RLVR

3) 技术综合

4) Top 5 论文（含“为何现在”）

5) 实用下一步