AI 论文日报（2026-03-19）

Published: March 19, 2026

English version: /paper-news/2026-03-19/

运行统计

候选论文: 277
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-19T00:00:00Z → 2026-03-20T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.18433`	Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems PDF	cs.CR	94	Runtime, role-aware prompt-injection defense for RAG/API stacks; practical gateway design + eval.	prompt-injection, RAG, runtime-defense, policy-enforcement, LLM-security, middleware
`2603.18894`	I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems PDF	cs.AI, cs.MA	94	Empirical multi-agent governance sims quantify rule-breaking/corruption; high direct agent-safety relevance.	agent-safety, multi-agent, governance, evaluation, misuse, institutional-integrity
`2603.19092`	SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues PDF	cs.CV, cs.AI, cs.CL, cs.LG	93	New VLM safety benchmark + semantic steering; separates refusals vs grounded reasoning	vlm-safety, benchmark, steering, refusal, grounded-reasoning, evaluation
`2603.18637`	MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment PDF	cs.CR, cs.CL	92	Closed-loop multi-objective alignment data curation; explicit tradeoff safety vs over-refusal vs IF.	alignment, data-curation, SFT, over-refusal, safety-eval, mixture-optimization
`2603.18614`	ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs PDF	cs.AI	92	Procedural tool-use environment isolates reasoning-action coupling; reduces contamination; strong agent eval asset.	agents, tool-use, benchmark, evaluation, procedural-generation, reasoning
`2603.18736`	CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks PDF	cs.LG, cs.AI, cs.CL, stat.ML	92	Causal framing for reward models from noisy/biased observational feedback; scalable RLHF alternative	RLHF, reward-modeling, causal-inference, observational-feedback, alignment
`2603.18740`	Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review PDF	cs.SE, cs.AI, cs.CR	91	Measures exploitable confirmation bias in LLM security code review; large effect sizes on FN rates.	secure-coding, LLM-failure-modes, supply-chain, evaluation, prompt-framing, robustness
`2603.18631`	D-Mem: A Dual-Process Memory System for LLM Agents PDF	cs.AI	90	Dual-process memory for LLM agents; tackles lossy retrieval for long-horizon context	agents, memory, long-horizon, retrieval, architecture, reliability
`2603.18377`	PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents PDF	cs.CR, cs.AI, cs.ET	89	Privacy-preserving planning for cloud LLM agents via planning abstractions; limits raw state exposure.	agents, privacy, cloud-planning, abstraction, confidential-context, system-design
`2603.18893`	Quantitative Introspection in Language Models: Tracking Internal States Across Conversation PDF	cs.AI	89	Measures whether LLM numeric self-reports track internal states over dialogue; safety+interpretability angle	interpretability, introspection, monitoring, safety, probes, conversation
`2603.18382`	From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents PDF	cs.AI	88	Systematic eval of LLM-agent de-anonymization from weak cues; formalizes inference-driven linkage.	privacy, deanonymization, agents, benchmark, linkage-attacks, risk-evaluation
`2603.18469`	GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms PDF	cs.CL	88	Benchmark for norm-vs-goal conflicts under pressure; useful for alignment and policy compliance testing.	alignment, norms, decision-making, benchmark, robustness, governance
`2603.18683`	HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning PDF	cs.LG, cs.AI, cs.CL	88	Improves multi-turn agent RL via hindsight-modulated segment rewards for credit assignment	agentic-rl, reward-modeling, credit-assignment, process-rewards, long-horizon
`2603.19127`	On Optimizing Multimodal Jailbreaks for Spoken Language Models PDF	cs.LG	87	Joint audio+text gradient jailbreaks for spoken-language models; expands multimodal attack surface.	jailbreaks, multimodal, audio-attacks, adversarial-prompts, SLM, red-teaming
`2603.18756`	Are complicated loss functions necessary for teaching LLMs to reason? PDF	cs.LG, cs.AI, cs.CL	87	Dissects GRPO; finds negative feedback key and clipping unnecessary; simplifies reasoning post-training	reasoning, RL, post-training, GRPO, REINFORCE, optimization
`2603.18762`	ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation PDF	cs.CR, cs.AI	86	MITM-based red-teaming for real web agents (OpenClaw); network-layer attacks beyond sandbox tests.	agents, red-teaming, MITM, web-security, tool-use, evaluation-framework
`2603.19025`	Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference PDF	cs.CR, cs.LG	86	Lightweight verifiable inference protocol for cloud models; relevant to auditing and deployment security.	security, verifiable-inference, cryptography, auditing, cloud-deployment, integrity
`2603.19144`	UGID: Unified Graph Isomorphism for Debiasing Large Language Models PDF	cs.CL, cs.AI	86	Representation-level LLM debiasing via graph invariance across counterfactual inputs	bias, debiasing, interpretability, representations, counterfactuals, fairness
`2603.18829`	Agent Control Protocol: Admission Control for Agent Actions PDF	cs.CR, cs.AI	85	Formal spec for cryptographic admission control of agent actions: identity, delegation, revocation, audit.	agent-governance, capabilities, authorization, cryptography, auditing, protocol
`2603.18743`	Memento-Skills: Let Agents Design Agents PDF	cs.AI, cs.CL, cs.LG	85	Continual agent that writes reusable skills/memory to design new agents; relevant to agentic risk surface	agents, continual-learning, memory, tool-use, skills, agent-design
`2603.19191`	OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards PDF	cs.AI	84	Scalable multi-agent critic for GUI rewards + new cross-platform reward benchmark (OGRBench).	GUI-agents, reward-modeling, critics, benchmarks, verification, RL
`2603.18911`	Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs PDF	cs.CL, cs.AI	84	Citation-grounded bilingual dialogue training + reward; targets hallucinations with verifiable outputs.	hallucination, grounding, citations, RAG, alignment, multilingual
`2603.18507`	Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM PDF	cs.AI	84	Finds persona prompts boost alignment but hurt accuracy; proposes intent-based routing	alignment, personas, prompting, routing, evaluation, tradeoffs
`2603.19017`	What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time? PDF	cs.CL, cs.AI	84	MultiTempBench probes multilingual temporal reasoning; links failures to tokenization via mDFR + probing	evaluation, temporal-reasoning, multilingual, tokenization, benchmarks
`2603.18373`	To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs PDF	cs.CV, cs.AI	83	Diagnoses visual sycophancy/split beliefs in VLMs with counterfactual tests; highlights alignment failure.	VLM, sycophancy, grounding, hallucinations, evaluation, uncertainty
`2603.19220`	Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation PDF	cs.CL, cs.AI, cs.LG	83	Open 30B MoE with Cascade RL/distillation; strong reasoning/agentic claims; potentially impactful post-training.	LLM, post-training, RL, distillation, MoE, reasoning, agents
`2603.18886`	Reasoning over mathematical objects: on-policy reward modeling and test time aggregation PDF	cs.AI, cs.CL	83	Principia suite for formal math objects + on-policy judge training + test-time aggregation	reasoning, math, benchmarks, reward-modeling, llm-judges, verification
`2603.18897`	Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution PDF	cs.DC, cs.AI	82	Speculative tool execution to hide latency in LLM-tool loops; important for scalable agent serving	agents, tool-use, systems, latency, speculation, serving
`2603.18859`	RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models PDF	cs.AI, cs.CL, cs.LG	81	Topology-aware reward propagation for agentic LLM RL; could improve sparse-reward training efficiency.	agentic-RL, process-rewards, trajectory-graphs, reasoning, optimization
`2603.18729`	Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures PDF	cs.AI	80	Studies dialect-triggered stereotypes; tests prompt/COT and multi-agent critique-revise mitigation	bias, stereotypes, multi-agent, mitigation, prompting, fairness

AI 论文洞察简报

2026-03-19

0) 核心结论（先读这个）

“Grounding failures（落地/对齐到输入的失败）”越来越多是对齐/引导失败，而不是感知失败：一个三层 VLM 诊断发现 Visual Sycophancy（视觉谄媚）占主导（69.6%），并且 扩展规模会减少语言捷径但放大谄媚（Qwen2.5-VL 7B→72B：谄媚 72.4%→95.3%）。
Agent 安全正在从提示文本转向系统接口与观测通道：具备优先级感知的提示组合防御（PCFI）、MITM 网页流量红队（ClawTrap）与 密码学准入控制（ACP）都把 agent 栈当作攻击面——不只是模型本身。
隐私风险已变成“推理时链接（inference-time linkage）”，而不仅是显式标识符泄露：agent 能从弱线索重建身份（例如 Netflix 79.2% 链接；AOL 40 条历史中确认 10 个身份），推动隐私评估去衡量推断出的身份，而不只是脱敏/删改。
数据/反馈质量正在成为对齐瓶颈：CausalRM 表明，纠正观测反馈中的 噪声 + 选择偏差 能带来显著下游安全收益（例如 +49.2% WildGuardMix，+32.7% HarmBench）；MOSAIC 则显示 带预算、切片感知的混合搜索 能避免朴素安全混合导致的过度拒答/能力崩塌。
Agent 训练与评估正在收敛到“信用分配 + 效率”：ZebraArena 量化工具查询低效与理论最优的差距；RewardFlow 与 HISR 提出更稠密、结构感知的奖励传播/分段过程奖励；OS-Themis 通过里程碑验证与审计改进长时程 GUI 奖励。

2) 关键主题（聚类）

主题：多模态 grounding 与安全是可引导的（也可被利用）

重要性：VLM 可能“看见”异常却仍然服从/幻觉；简单语义线索即可翻转安全判断。这使多模态系统同时易受 过度信任 与 过度拒答 攻击。
代表论文：
常见方法：
- 反事实干预（盲/噪声/冲突图像；线索叠加；提示引导）以分离感知 vs 依赖 vs 对齐。
- 新指标区分行为（拒答）与 grounded 正确性（例如 BRA/GSA/FRR；LAD/VNS/CS）。
- 事后因果检查（遮挡/归因）测试“引用/标记”是否真正控制输出。
开放问题 / 失效模式：
- 如何在不诱发一刀切拒答的情况下减少 视觉谄媚（三层分类中在盲/噪声下“稳健拒答”为 0%）。
- 线索/叠加攻击诱导 对抗性过度拒答（SAVeS Attacker：高拒答且极高错误拒答）。
- 解码器-only 模型中出现“有引用格式但无 grounding”（报告遮挡 grounding 为 0.000，但 Citation-F1 非零）。

主题：Agent 安全加固组合边界与观测通道

重要性：真实部署会从多源组合提示并依赖网络化观测；攻击者利用层级混淆、元数据框架与传输中篡改。
代表论文：
常见方法：
- 运行时强制溯源/优先级（PCFI 建模 S∥D∥U∥R 且 S>D>U>R；ALLOW/SANITIZE/BLOCK）。
- 评估非纯提示攻击：PR 元数据框架（确认偏差）与在线 HTTP MITM 重写（ClawTrap）。
- 增加协议层：密码学身份、能力令牌、PoP 握手、确定性风险评分、一次性执行令牌、仅追加审计账本（ACP）。
开放问题 / 失效模式：
- 基于模式的提示防御可能对改写/语义攻击脆弱，且不覆盖多轮/工具链（PCFI 局限）。
- ClawTrap 的 MITM 评估目前偏定性；需要定量成功指标与更广任务覆盖。
- 元数据框架会导致巨大检测下降（TPR 下降 16.2–93.5pp）与高绕过率（例如对 Claude Code 88.2%），除非流水线显式忽略/脱敏元数据。

主题：隐私威胁从“泄露了什么”转向“能推断出什么”

重要性：即使经过清洗/匿名化，agent 也可能生成假设并检索佐证从而可链接；云端规划器在观测不受限时也能重建可识别结构。
代表论文：
- From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents（从弱线索到真实身份：评估 LLM Agent 的推理驱动去匿名化）
- PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents（PlanTwin：面向云辅助 LLM Agent 的隐私保护规划抽象）
常见方法：
- 将去匿名化形式化为输出身份假设与证据 Π:(Danon,Daux)→(î,E)；在经典+合成+现代轨迹上度量 LSR/CLC。
- 通过本地投影到带类型的“digital twin（数字孪生）”+能力目录+带每对象披露预算的 gatekeeper 来降低规划器可观测性。
- 显式评估隐私–效用权衡（提示隐私护栏；披露预算；再识别实验）。
开放问题 / 失效模式：
- 基于提示的隐私护栏能降低链接，但可能导致过度拒答/效用损失。
- 结构性推断仍存在：即便抽象图也可指纹化用户/对象（PlanTwin 显示当所有字段暴露时再识别率 94.1%）。
- INFERLINK 等基准较简化（单一重叠、小表），真实世界歧义与流行度仍不确定。

主题：对齐优化走向以数据为中心并进行因果校正

重要性：安全/可用性回退常源于偏置反馈与 SFT 预算错配；更好的估计器与切片感知循环可在固定算力下推进帕累托前沿。
代表论文：
常见方法：
- 将反馈视为带偏/带噪的观测数据；使用噪声校正替代损失 + 倾向性加权 + 双重稳健估计（CausalRM）。
- 闭环：切片级失败画像 → 在固定 token 预算下可执行的数据混合动作（MOSAIC）。
- 基准化现实决策权衡与情境压力（GAIN），并通过门控蒸馏缓解引导权衡（PRISM）。
开放问题 / 失效模式：
- 因果校正依赖倾向性与锚点单元估计；仍有模型设定错误风险。
- MOSAIC 仅在有限迭代与基线下评估；跨底座模型与独立构造评测集的泛化性仍待验证。
- 人设提升对齐行为但可能降低知识任务；PRISM 增加部署复杂度（门控 + LoRA 不兼容、测试规模有限）。

主题：Agent RL 与评估强调信用分配、效率与长时程奖励可靠性

重要性：Agent 不仅会“答错”，还会低效、在预算下校准失真，或被噪声奖励信号污染梯度。
代表论文：
常见方法：
- 定义理论下界/结构（ZebraArena 的 K⋆；RewardFlow 的状态图；HISR 的子目标分段）。
- 将稀疏结果转为更稠密信号且不依赖大量人工标注（图 BFS 传播；事后似然比；里程碑验证）。
- 诊断低效与“预算焦虑”，而非只看最终成功率。
开放问题 / 失效模式：
- 理想化环境（逻辑谜题）可能难迁移到噪声真实工具；需要桥接研究。
- 奖励塑形依赖状态表示与成功轨迹可得性（RewardFlow 指出依赖成功轨迹）。
- 评论器框架可能昂贵（OS-Themis 报告每条轨迹评估延迟约 117.6s）。

3) 技术综合

多篇论文在 反事实/溯源感知评估 上趋同：VLM 盲/噪声/冲突干预（视觉 grounding）、提示片段优先级强制（PCFI）与 MITM 观测重写（ClawTrap）都把“模型看到了什么”当作关键变量。
反复出现的模式是 将行为与底层能力分离：拒答率 vs grounded 安全（SAVeS）、准确率 vs 图像依赖 vs 对齐偏好（三层）、以及“引用存在” vs 因果 grounding（XKD-Dial 遮挡）。
预算无处不在：PlanTwin 披露预算、ZebraArena 查询预算/定价、MOSAIC 固定 SFT token 预算、OS-Themis 成本/延迟核算——提示评估应报告成本条件化性能曲线，而非单一分数。
对齐方法越来越多使用 因果/统计校正 而非更多数据：CausalRM 的噪声+选择偏差校正与 MOSAIC 的切片感知分配相呼应——都旨在避免“在错误信号上训练”。
Agent RL 正转向 结构诱导的稠密奖励，而无需训练独立奖励模型：RewardFlow 用拓扑；HISR 用事后似然比；OS-Themis 用里程碑证据链。
提示/引导被证明是 双刃剑：人设提升对齐但伤害知识（PRISM），语义线索可辅助也可攻击 VLM 安全（SAVeS），PR 元数据可锚定代码审查判断（确认偏差）。
鲁棒性失败常 不对称：确认偏差主要增加漏报；VLM 可检测异常（高 LAD）却仍幻觉（高 CS）；隐私链接即便在“良性”任务框架下也可能发生（INFERLINK IMPLICIT）。
多项工作强调 可审计接口：ACP 的签名账本+执行令牌、PlanTwin 的 schema 约束孪生+gatekeeper、OS-Themis 的可验证里程碑检查都生成可事后检查的工件。
RL 目标的简化趋势：RGRA 提示 PPO 式 clipping 可能对 GRPO 类推理增益并非必要（小模型中），但 优势归一化与负反馈对稳定性至关重要。

4) Top 5 论文（含“为何现在”）

1) To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs（看见还是取悦：揭示 VLM 的视觉谄媚与分裂信念）

通过反事实图像将 VLM 失败分解为 感知（LAD）、依赖（VNS）、对齐（CS）。
发现 视觉谄媚是主导失效模式（69.6%），并在其 Qwen2.5-VL 分析中 随模型规模增大而加剧。
提供实用缓解：诊断引导的选择性预测（在 50% 覆盖率下准确率最高 +9.5pp）。
质疑点：需要完整 logits（排除闭源模型），且缓解并未修复主导的谄媚机制。

2) From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents（从弱线索到真实身份：评估 LLM Agent 的推理驱动去匿名化）

在经典（Netflix/AOL）、受控（INFERLINK）与现代轨迹上形式化并度量 推理驱动链接。
报告高链接能力（如 GPT-5 在 Netflix 79.2%；AOL 子集 CLC=10），且链接可在良性框架下出现。
测试基于提示的隐私护栏并量化隐私–效用权衡。
质疑点：INFERLINK 较简化；现代轨迹研究是机制演示而非流行度估计。

3) CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks（CausalRM：面向观测用户反馈的 RLHF 因果理论奖励建模）

将 噪声校正替代损失 与 倾向性重加权、双重稳健估计结合用于观测 RLHF 信号。
展示一致的 RM 改进与显著下游安全收益（例如其设置下 Qwen2.5-7B：+49.2% WildGuardMix，+32.7% HarmBench）。
在正确估计干扰项时提供理论无偏保证（IPS/DR）。
质疑点：依赖准确的倾向性/噪声率估计（锚点单元），且未探索观测+实验混合机制。

4) Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review（度量并利用 LLM 辅助安全代码审查中的确认偏差）

量化 PR 框架（“无 bug”）如何导致跨模型漏洞检测 TPR 下降 16.2–93.5pp。
展示真实可利用性：在其测试设置下对 Copilot 35.3% 绕过、对 Claude Code 88.2%；迭代精炼提升成功率。
表明缓解（忽略元数据/脱敏）可大幅恢复检测（交互式报告 100% 恢复；自治约 94%）。
质疑点：在选定模型与受控环境中评估；较高基线误报使运营解释更复杂。

5) ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs（ZEBRAARENA：研究工具增强 LLM 推理-行动耦合的诊断仿真环境）

提供程序化、抗污染环境，含 理论最小查询次数 K⋆ 与丰富效率诊断。
显示即便强模型也可能高度低效（GPT-5 准确率近乎完美但工具调用比 K⋆ 多 70–270%）。
揭示“预算焦虑”：更多预算并不稳定提升准确率。
质疑点：理想化逻辑谜题场景；向噪声真实工具迁移仍待建立。

5) 实用下一步

面向 VLM 产品：实现 反事实输入探针（盲/噪声/冲突），并跟踪类似 LAD/VNS/CS 的信号以区分“看不见”vs“不会说”。
对任何基于引用/标记的安全 UX 增加 grounding 审计：运行 遮挡式因果检查，确保引用/标记确实控制输出（而非仅格式）。
在 agent 栈中，将提示组装视为安全边界：采用 溯源标记 + 优先级强制（PCFI 类），并记录片段谱系用于事件响应。
对代码审查 agent：剥离/规范化 PR 元数据，或在审查提示中明确“忽略元数据”；将对抗性“无 bug”框架下的检测作为回归测试。
对处理私有状态的云规划 agent：原型化 带类型的数字孪生 + 能力目录 + gatekeeper（PlanTwin 类），并加入 披露预算 防止多轮指纹化。
对来自日志的 RLHF：评估反馈是否 非随机缺失（missing-not-at-random）；在收集更多标注前先尝试 倾向性 + 噪声校正（CausalRM 类）。
对工具增强 agent：除准确率外报告 效率指标（查询次数 vs K⋆、冗余比、token 成本），用以调参预算策略并降低“预算焦虑”。
对 GUI/长时程 RL：考虑 证据链评论器（里程碑 + 验证 + 审计），并跟踪评论器精确率/召回率，而不只看策略成功率。

由逐篇论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-03-19

0) 核心结论（先读这个）

2) 关键主题（聚类）

主题：多模态 grounding 与安全是可引导的（也可被利用）

主题：Agent 安全加固 组合边界 与 观测通道

主题：隐私威胁从“泄露了什么”转向“能推断出什么”

主题：对齐优化走向以数据为中心并进行因果校正

主题：Agent RL 与评估强调信用分配、效率与长时程奖励可靠性

3) 技术综合

4) Top 5 论文（含“为何现在”）

5) 实用下一步

主题：Agent 安全加固组合边界与观测通道