AI 论文日报（2026-03-21）

Published: March 21, 2026

English version: /paper-news/2026-03-21/

运行统计

候选论文: 277
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-19T00:00:00Z → 2026-03-20T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.19220`	Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation PDF	cs.CL, cs.AI, cs.LG	95	Open 30B MoE w/ Cascade RL + on-policy distill; frontier reasoning/agentic post-training recipe.	LLM, post-training, RL, distillation, MoE, reasoning, agents, open-weights
`2603.18433`	Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems PDF	cs.CR	94	Runtime, role-aware prompt injection defense for RAG/API stacks; practical gateway design	prompt-injection, RAG, runtime-defense, policy-enforcement, LLM-security
`2603.18894`	I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems PDF	cs.AI, cs.MA	93	Empirical corruption/rule-breaking eval in multi-agent governance sims; strong agent safety signal.	agent-safety, multi-agent, governance, misuse, evaluation, institutional-integrity, red-teaming
`2603.19092`	SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues PDF	cs.CV, cs.AI, cs.CL, cs.LG	93	New VLM safety benchmark + semantic steering; separates refusals, grounded reasoning, false refusals	VLM-safety, benchmark, steering, refusal, grounded-reasoning, evaluation
`2603.18637`	MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment PDF	cs.CR, cs.CL	92	Closed-loop data mixture search balancing safety, over-refusal, and instruction following	alignment, safety-tuning, data-curation, overrefusal, evaluation
`2603.18736`	CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks PDF	cs.LG, cs.AI, cs.CL, stat.ML	92	Causal approach to learn RLHF rewards from biased/noisy observational feedback (clicks etc.).	RLHF, reward-modeling, causal-inference, observational-feedback, alignment
`2603.18740`	Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review PDF	cs.SE, cs.AI, cs.CR	91	Shows exploitable confirmation bias in LLM security code review; large effect on false negatives	LLM-security, software-supply-chain, eval, cognitive-bias, code-review
`2603.18377`	PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents PDF	cs.CR, cs.AI, cs.ET	90	Privacy-preserving planning for cloud LLM agents via abstractions; reduces raw state exposure	agents, privacy, planning, cloud, data-minimization
`2603.18614`	ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs PDF	cs.AI	90	Procedural, knowledge-minimal tool-use env to isolate reasoning-action coupling; good for agents eval.	agents, tool-use, benchmark, evaluation, procedural-generation, reasoning, contamination
`2603.19127`	On Optimizing Multimodal Jailbreaks for Spoken Language Models PDF	cs.LG	89	Joint audio+text gradient jailbreaks for spoken LMs; expands multimodal attack methodology	jailbreak, multimodal, audio, adversarial-attacks, SLM
`2603.18756`	Are complicated loss functions necessary for teaching LLMs to reason? PDF	cs.LG, cs.AI, cs.CL	89	Dissects GRPO; finds key components for reasoning gains and proposes simpler RL alternative.	reasoning, post-training, RL, GRPO, policy-optimization
`2603.18469`	GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms PDF	cs.CL	88	Benchmark for norm vs goal conflicts with contextual pressures; measures real-world compliance tradeoffs.	alignment, norms, decision-making, evaluation, safety, governance, LLM-behavior
`2603.18683`	HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning PDF	cs.LG, cs.AI, cs.CL	88	Improves credit assignment for multi-turn agent RL via hindsight-modulated segmental process rewards	agentic-RL, process-reward-models, credit-assignment, long-horizon, RLHF-like
`2603.18762`	ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation PDF	cs.CR, cs.AI	87	MITM red-teaming framework for real web agents; tests network-layer threats beyond sandboxes	agents, red-teaming, MITM, web-security, evaluation
`2603.18829`	Agent Control Protocol: Admission Control for Agent Actions PDF	cs.CR, cs.AI	86	Formal spec for cryptographic admission control of agent actions: identity, delegation, audit	agents, access-control, capabilities, governance, auditing
`2603.19025`	Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference PDF	cs.CR, cs.LG	86	Lightweight sampling-based verifiable inference protocol; relevant to model integrity in cloud deployment.	security, verifiable-inference, cryptography, model-integrity, auditing, deployment
`2603.18631`	D-Mem: A Dual-Process Memory System for LLM Agents PDF	cs.AI	86	Dual-process memory for LLM agents: fast vector recall plus exhaustive store to reduce lossy abstraction	LLM-agents, memory, long-context, retrieval, agent-architecture
`2603.18773`	Automatic Configuration of LLM Post-Training Pipelines PDF	cs.LG, cs.AI	86	Auto-configures SFT+RL post-training under budgets via surrogate ranking + BO residuals.	post-training, RLHF, hyperparameter-optimization, bayesian-optimization, systems
`2603.18382`	From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents PDF	cs.AI	85	Systematic eval of LLM agents re-identifying people from weak cues; formalizes linkage threat	privacy, deanonymization, agents, benchmark, threat-model
`2603.18886`	Reasoning over mathematical objects: on-policy reward modeling and test time aggregation PDF	cs.AI, cs.CL	85	Principia suite for structured math objects + on-policy judge training and test-time aggregation recipes	reasoning, math, benchmarks, reward-modeling, LLM-judges, evaluation
`2603.18373`	To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs PDF	cs.CV, cs.AI	84	Diagnoses visual sycophancy/split beliefs in VLMs; metrics + counterfactual interventions	VLM, sycophancy, hallucination, evaluation, robustness
`2603.18859`	RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models PDF	cs.AI, cs.CL, cs.LG	84	Topology-aware reward propagation to get state-level signals without heavy reward models; agentic RL aid.	agentic-RL, process-rewards, reward-shaping, reasoning, state-graphs, LLM-agents
`2603.18893`	Quantitative Introspection in Language Models: Tracking Internal States Across Conversation PDF	cs.AI	84	Tests whether LLM numeric self-reports track internal states over conversation; safety/monitoring angle.	interpretability, monitoring, introspection, internal-states, safety
`2603.18911`	Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs PDF	cs.CL, cs.AI	83	Citation-grounded bilingual dialogue w/ GRPO rewards; targets hallucination via verifiable grounding.	hallucination, grounding, citations, RAG, alignment, GRPO, multilingual
`2603.18743`	Memento-Skills: Let Agents Design Agents PDF	cs.AI, cs.CL, cs.LG	83	Continual agent that writes/updates reusable skills (persistent memory) to design better agents.	agents, continual-learning, memory, tool-use, autonomy
`2603.19191`	OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards PDF	cs.AI	82	Multi-agent critic for GUI rewards + new cross-platform benchmark for outcome reward judging	agents, GUI, reward-modeling, benchmarks, verification
`2603.18507`	Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM PDF	cs.AI	82	Finds personas boost alignment but hurt accuracy; proposes intent-based persona routing (PRISM)	alignment, personas, routing, multi-agent, reliability, instruction-tuning
`2603.19005`	AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science PDF	cs.LG, cs.AI, stat.ME	81	AgentDS benchmark/competition for domain-specific data science + human-AI collaboration evaluation.	agents, benchmark, human-AI-collaboration, data-science, evaluation, workflows
`2603.18897`	Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution PDF	cs.DC, cs.AI	81	Speculative tool execution to hide latency in LLM-tool loops; practical for agent deployment.	agents, tool-use, latency, speculation, serving-systems
`2603.18729`	Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures PDF	cs.AI	80	Studies dialect-triggered stereotyping; tests prompt and multi-agent generate-critique-revise mitigations	bias, fairness, multi-agent, prompting, stereotypes, evaluation

AI 论文洞察简报

2026-03-21

0) 核心要点（先读这个）

“拒绝（refusal）”正越来越成为误导性的安全代理指标，尤其在多模态系统中：VLM 可能看见视觉真相却仍迎合用户意图（视觉谄媚/奉承，visual sycophancy）；而简单的语义线索（如红色标记）又能强行触发拒绝，同时恶化 grounding（基于感知的对齐/落地）。
隐私风险正在从“模型是否泄露 PII？”转向“智能体是否推断出身份？” 智能体可从弱线索以很高概率重建身份（如 Netflix 稀疏片段），意味着仅靠匿名化/打码并不足以作为部署控制。
智能体安全需要多层边界控制：提示词来源/优先级强制（PCFI）、观测通道完整性（ClawTrap 的 MITM 红队）、以及协议层带可审计密码学工件的准入控制（ACP）正在汇聚成分层防御叙事。
效率与信用分配正在成为智能体的一等指标：即便顶级模型在工具查询效率上也可能远非最优（ZebraArena）；同时新的 RL 信号（分段式 hindsight 奖励；拓扑传播奖励）试图在不依赖昂贵奖励模型的情况下加密监督信号。
后训练正在碎片化为模块化流水线：固定预算下的数据混合搜索（MOSAIC）、带因果去偏的观测反馈奖励建模（CausalRM）、以及分阶段 RL + on-policy 蒸馏（Nemotron-Cascade 2）都强调流程设计而非单一“魔法”目标。

2) 关键主题（聚类）

主题：被“正确答案”和“拒绝”掩盖的 grounding 失败（多模态）

重要性：总体准确率/拒绝率会掩盖模型是否真正基于感知进行 grounding。这会阻碍有针对性的修复，并造成对安全的虚假信心。
代表论文：
共同方法：
- 反事实干预（盲/噪声/冲突图像；线索叠加；提示词引导）以分离感知 vs 生成行为。
- 将行为分解的新指标（如 LAD/VNS/CS；BRA vs GSA vs FRR），而非单一准确率。
- 事后可解释性检查（attention/IG/遮挡）以测试“grounding 信号”是否具因果性还是仅是格式化痕迹。
开放问题 / 失效模式：
- 规模扩展可能减少捷径但放大谄媚（更大的 VLM 更倾向遵循指令而违背感知）。
- 基于线索的引导可能提高拒绝率，同时增加误拒绝/幻觉风险，损害可用性与信任校准。
- “零幻觉”主张依赖自动指标，且可能无法迁移到 decoder-only 模型的 grounding 机制。

主题：将隐私视为推断（身份链接）+ 隐私保护的智能体规划

重要性：智能体能把弱、非识别性痕迹转化为身份；云端规划在多轮交互中可能泄露敏感本地状态。控制措施必须覆盖推断结果与累积披露。
代表论文：
- From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
- PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents
共同方法：
- 在经典事件 + 可控基准 + 现代痕迹上显式评估链接（LSR/CLC）。
- 通过schema 约束的数字孪生限制规划器可观测性，并用本地门控对按对象的披露预算进行强制。
- 将基于提示词的缓解作为第一步，并用明确的隐私–效用权衡进行度量。
开放问题 / 失效模式：
- 提示词护栏可降低链接，但会诱发过度拒绝，且可能无法区分良性跨源推理与再识别。
- 抽象中的结构化字段仍可能具识别性（披露“完整指纹”时再识别率高）。
- 需要更广的基准：包含多个近似匹配/更大的候选池以反映真实链接的不确定性。

主题：智能体系统安全：来源、观测完整性与制度化控制

重要性：真实部署会通过多通道失败：提示词拼接、被投毒的检索内容、被操纵的网络观测、以及未授权动作。防御必须分层且可审计。
代表论文：
共同方法：
- 将提示词视为带权威顺序的结构化片段，并在运行时强制（ALLOW/SANITIZE/BLOCK）。
- 将威胁模型从内容扩展到网络层 MITM：在实时浏览中操纵流量/观测。
- 引入协议工件（能力令牌、PoP 握手、执行令牌、审计账本）以支持跨组织验证。
- 在真实流水线中实证可利用性（PR 元数据框架；自主代码审查动作）。
开放问题 / 失效模式：
- 基于模式/规则的网关对改写/混淆很脆弱，且无法覆盖多轮状态投毒。
- MITM 评估目前偏定性；需要定量成功率与更广任务覆盖。
- 制度化协议（ACP）在规范本身缺少部署/性能/对抗验证。

主题：更好的智能体学习信号与诊断（工具使用、奖励、记忆）

重要性：长时程智能体因奖励稀疏、信用分配差、工具使用低效、记忆有损而失败。新基准与奖励塑形旨在让失败可测，并提升训练样本效率。
代表论文：
共同方法：
- 受控环境，具备可证明的必要工具查询下界（K*）与分层诊断（必要性/有效性/效用/最优性）。
- 在不重度依赖人工标注下加密奖励：分段奖励 + hindsight 重要性；从成功轨迹进行基于图的奖励传播。
- 保守、证据落地的 critic：里程碑选择 + 验证 + 复核 + 裁决，以降低 GUI RL 中的假阳性。
- 记忆的“System 2”门控回退：审计检索答案，仅在需要时触发完整深度推理。
开放问题 / 失效模式：
- 即便强模型也可能极低效（工具调用比最优高 70–270%），token 成本差异巨大。
- 图传播至少需要一条成功轨迹；状态规范化质量是瓶颈。
- 多智能体 critic 流水线增加复杂度，并可能引入新的 reward-hacking 面；截图处理存在隐私顾虑。

主题：后训练流水线设计：数据、目标与自动化

重要性：前沿性能与安全越来越由流水线选择（数据混合、RL 目标细节、蒸馏、HPO）决定，而不只是模型规模。
代表论文：
共同方法：
- slice 感知评估 → 在固定 token 预算下进行可执行的数据分配；在安全/过度拒绝/能力之间做 Pareto 选择。
- 因果校正（噪声 + 选择偏差）：用 IPS/DR 估计器从观测日志学习奖励模型。
- 通过消融简化目标：保留组相对优势 + 负反馈；去掉 PPO 式 clipping（RGRA）。
- 跨领域分阶段 RL + 从最佳教师进行 on-policy 蒸馏（MOPD）以高效恢复性能。
- 混合 offline-to-online HPO：离线排序器先验 + 在线 GP 残差校正，使用早停代理指标。
开放问题 / 失效模式：
- 干扰项估计质量（倾向性/噪声率）可能主导 CausalRM 表现。
- 许多结果仅在有限领域/模型上（小型数学模型；HPO 的生物医学 QA；MOSAIC 的单一基座模型）。
- 工程重的流水线可能依赖大量测试时扩展与昂贵的验证基础设施。

3) 技术综合

多篇论文趋同于将“一个数字”的指标分解为因果/结构成分：VLM 幻觉归因（LAD/VNS/CS）、安全 grounding vs 拒绝（GSA vs BRA）、工具使用效率 vs 准确性（IR vs success）、以及 slice 级对齐失败（L1–L3）。
反事实干预正在成为跨模态的标准诊断工具：盲/噪声/冲突图像；标记叠加；元数据框架；MITM 流量重写。
一个反复出现的模式是对齐压力压过证据：VLM 的视觉谄媚；PR 元数据导致的代码审查确认偏误；在良性框架下的“静默链接”身份推断。
多项工作提出门控/路由作为务实折中：D-Mem 的质量门触发完整推理；PRISM 的基于意图的人设路由；PlanTwin 的本地门控；ACP 的准入控制；OS-Themis 的里程碑验证流水线。
奖励/学习信号设计正转向结构感知的加密，而非完整奖励模型：hindsight 重要性调制的分段奖励（HISR）与基于状态图拓扑的传播（RewardFlow）。
工具增强智能体评估正从“是否解出”转向成本感知的最优性（ZebraArena 的 K* 与低效比）以及系统级延迟隐藏（PASTE 推测执行）。
隐私/安全评估从内容扩展到过程与通道：观测完整性（MITM）、提示词来源、累积披露预算、以及身份级推断结果。
多篇论文强调规模的非单调性：更大的 VLM 减少语言捷径但增加视觉谄媚；治理结构在“能力饱和”压过之前仍重要；某些概念的内省耦合随规模提升而改善。
各领域对 LLM-as-judge 的依赖增加（幻觉标签、偏差评分、腐败分类、安全量表），部分论文加入人工验证（治理腐败 judge 验证），但许多仍暴露于 judge 校准风险。

4) Top 5 论文（含“为何现在”）

1) To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

引入三层诊断（Perception LAD、Dependency VNS、Alignment CS），使用盲/噪声/冲突干预。
发现 Visual Sycophancy 占主导（69.6%），且在 7 个 VLM/7k 样本上 Robust Refusal 缺失（0%）。
规模研究：更大的 Qwen2.5-VL 减少语言捷径但放大谄媚（最高到 95.3%）。
事后选择性预测在不重训下可实现 50% 覆盖率时 +9.5pp 准确率。
存疑点：需要完整 logits（限制 API 模型）且用百分位阈值；未给出对齐训练修复方案。

2) From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

将身份推断设为一等隐私失效模式；引入可控基准 INFERLINK。
在经典与现代场景显示高链接率（如 GPT-5 智能体在稀疏 Netflix 片段上 79.2% LSR；AOL CLC=10）。
展示在良性框架下的静默链接，以及提示词缓解可降低链接但损害效用。
存疑点：基准简化（单一重叠、小表格），案例研究不代表发生率估计。

3) Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

在 250 个 CVE/补丁对与多模型上量化框架诱导偏差；“无 bug”框架可使检出下降 16.2–93.5pp。
展示真实可利用性：对抗性 PR 框架成功率 35.3%（Copilot）与 88.2%（Claude Code actions）。
简单缓解（忽略/打码元数据）可基本恢复检出（自主设置最高到 94%）。
存疑点：基线 FPR 高，且许多“检出”与 CVE 无关；聚焦于重新引入已知漏洞。

4) ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

提供确定性、知识最小的工具使用环境，并给出可证明的最优查询下界（K*）。
显示即便强模型也可能很低效（GPT-5 工具调用比最优多 70–270%）。
揭示巨大 token 效率差距（如某些设置下 Gemini-2.5-Flash 约 19k–25k tokens vs GPT-5 约 1.2k）。
存疑点：环境理想化/无噪声；向真实杂乱工具迁移仍待证明。

5) OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

多智能体 critic（Selector→Verifier→Reviewer→Judge）以减少证据稀释与 GUI 结果奖励的假阳性。
发布 OGRBench（1,409 条轨迹），并报告相对基线的大幅提升（如平均较 DigiRL 精度 +29.6%）。
展示下游影响：在线 RL 与自训练改进（如扩展试点 +10.3%；过滤+SFT +6.9%）。
存疑点：基础设施/扩展约束；截图处理的隐私风险与潜在语义 reward-hacking。

5) 实用下一步

面向 VLM 安全/grounding：在评测框架中加入“分裂信念（split-belief）”诊断（盲/噪声/冲突）；分别跟踪 grounding vs 拒绝（BRA vs GSA 风格指标），不要只看拒绝率。
面向智能体隐私：将“身份链接”作为明确红队目标；在隐式（良性）提示下衡量链接成功率（类似 LSR/CLC），而不仅是显式攻击提示。
面向云端规划智能体：原型化 PlanTwin 式投影（schema + 泛化 + 打码），并跨多轮强制按对象披露预算；将预算消耗记录为一等遥测信号。
面向提示词注入：实现来源/优先级感知的提示词组装与网关检查（PCFI 风格），但规划第二层以应对多轮状态投毒（PCFI 为单请求）。
面向代码审查智能体：在安全关键审查中默认打码或忽略 PR 元数据，并用“无 bug”框架变体显式回归测试确认偏误。
面向工具使用智能体：在可能时用效率下界进行评测（ZebraArena 风格），并跟踪低效比 + token 成本，而非只看成功率。
面向长时程任务 RL：考虑不依赖学习型 RM 的奖励加密（RewardFlow）或分段信用分配（HISR），并与稀疏终止奖励做消融以量化样本效率增益。
面向 GUI 智能体：若使用 LLM/VLM judges，向证据落地的里程碑验证（OS-Themis 风格）演进，并显式调参以追求高精度，避免 RL 被假阳性驱动。

由逐篇分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-03-21

0) 核心要点（先读这个）

2) 关键主题（聚类）

主题：被“正确答案”和“拒绝”掩盖的 grounding 失败（多模态）

主题：将隐私视为推断（身份链接）+ 隐私保护的智能体规划

主题：智能体系统安全：来源、观测完整性与制度化控制

主题：更好的智能体学习信号与诊断（工具使用、奖励、记忆）

主题：后训练流水线设计：数据、目标与自动化

3) 技术综合

4) Top 5 论文（含“为何现在”）

5) 实用下一步