AI 论文日报(2026-03-19)

Published:

English version: /paper-news/2026-03-19/

运行统计

  • 候选论文: 277
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-19T00:00:00Z → 2026-03-20T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.18433Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems
PDF
cs.CR94Runtime, role-aware prompt-injection defense for RAG/API stacks; practical gateway design + eval.prompt-injection, RAG, runtime-defense, policy-enforcement, LLM-security, middleware
2603.18894I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
PDF
cs.AI, cs.MA94Empirical multi-agent governance sims quantify rule-breaking/corruption; high direct agent-safety relevance.agent-safety, multi-agent, governance, evaluation, misuse, institutional-integrity
2603.19092SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
PDF
cs.CV, cs.AI, cs.CL, cs.LG93New VLM safety benchmark + semantic steering; separates refusals vs grounded reasoningvlm-safety, benchmark, steering, refusal, grounded-reasoning, evaluation
2603.18637MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment
PDF
cs.CR, cs.CL92Closed-loop multi-objective alignment data curation; explicit tradeoff safety vs over-refusal vs IF.alignment, data-curation, SFT, over-refusal, safety-eval, mixture-optimization
2603.18614ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs
PDF
cs.AI92Procedural tool-use environment isolates reasoning-action coupling; reduces contamination; strong agent eval asset.agents, tool-use, benchmark, evaluation, procedural-generation, reasoning
2603.18736CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
PDF
cs.LG, cs.AI, cs.CL, stat.ML92Causal framing for reward models from noisy/biased observational feedback; scalable RLHF alternativeRLHF, reward-modeling, causal-inference, observational-feedback, alignment
2603.18740Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
PDF
cs.SE, cs.AI, cs.CR91Measures exploitable confirmation bias in LLM security code review; large effect sizes on FN rates.secure-coding, LLM-failure-modes, supply-chain, evaluation, prompt-framing, robustness
2603.18631D-Mem: A Dual-Process Memory System for LLM Agents
PDF
cs.AI90Dual-process memory for LLM agents; tackles lossy retrieval for long-horizon contextagents, memory, long-horizon, retrieval, architecture, reliability
2603.18377PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents
PDF
cs.CR, cs.AI, cs.ET89Privacy-preserving planning for cloud LLM agents via planning abstractions; limits raw state exposure.agents, privacy, cloud-planning, abstraction, confidential-context, system-design
2603.18893Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
PDF
cs.AI89Measures whether LLM numeric self-reports track internal states over dialogue; safety+interpretability angleinterpretability, introspection, monitoring, safety, probes, conversation
2603.18382From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
PDF
cs.AI88Systematic eval of LLM-agent de-anonymization from weak cues; formalizes inference-driven linkage.privacy, deanonymization, agents, benchmark, linkage-attacks, risk-evaluation
2603.18469GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms
PDF
cs.CL88Benchmark for norm-vs-goal conflicts under pressure; useful for alignment and policy compliance testing.alignment, norms, decision-making, benchmark, robustness, governance
2603.18683HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
PDF
cs.LG, cs.AI, cs.CL88Improves multi-turn agent RL via hindsight-modulated segment rewards for credit assignmentagentic-rl, reward-modeling, credit-assignment, process-rewards, long-horizon
2603.19127On Optimizing Multimodal Jailbreaks for Spoken Language Models
PDF
cs.LG87Joint audio+text gradient jailbreaks for spoken-language models; expands multimodal attack surface.jailbreaks, multimodal, audio-attacks, adversarial-prompts, SLM, red-teaming
2603.18756Are complicated loss functions necessary for teaching LLMs to reason?
PDF
cs.LG, cs.AI, cs.CL87Dissects GRPO; finds negative feedback key and clipping unnecessary; simplifies reasoning post-trainingreasoning, RL, post-training, GRPO, REINFORCE, optimization
2603.18762ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation
PDF
cs.CR, cs.AI86MITM-based red-teaming for real web agents (OpenClaw); network-layer attacks beyond sandbox tests.agents, red-teaming, MITM, web-security, tool-use, evaluation-framework
2603.19025Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference
PDF
cs.CR, cs.LG86Lightweight verifiable inference protocol for cloud models; relevant to auditing and deployment security.security, verifiable-inference, cryptography, auditing, cloud-deployment, integrity
2603.19144UGID: Unified Graph Isomorphism for Debiasing Large Language Models
PDF
cs.CL, cs.AI86Representation-level LLM debiasing via graph invariance across counterfactual inputsbias, debiasing, interpretability, representations, counterfactuals, fairness
2603.18829Agent Control Protocol: Admission Control for Agent Actions
PDF
cs.CR, cs.AI85Formal spec for cryptographic admission control of agent actions: identity, delegation, revocation, audit.agent-governance, capabilities, authorization, cryptography, auditing, protocol
2603.18743Memento-Skills: Let Agents Design Agents
PDF
cs.AI, cs.CL, cs.LG85Continual agent that writes reusable skills/memory to design new agents; relevant to agentic risk surfaceagents, continual-learning, memory, tool-use, skills, agent-design
2603.19191OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
PDF
cs.AI84Scalable multi-agent critic for GUI rewards + new cross-platform reward benchmark (OGRBench).GUI-agents, reward-modeling, critics, benchmarks, verification, RL
2603.18911Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs
PDF
cs.CL, cs.AI84Citation-grounded bilingual dialogue training + reward; targets hallucinations with verifiable outputs.hallucination, grounding, citations, RAG, alignment, multilingual
2603.18507Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM
PDF
cs.AI84Finds persona prompts boost alignment but hurt accuracy; proposes intent-based routingalignment, personas, prompting, routing, evaluation, tradeoffs
2603.19017What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?
PDF
cs.CL, cs.AI84MultiTempBench probes multilingual temporal reasoning; links failures to tokenization via mDFR + probingevaluation, temporal-reasoning, multilingual, tokenization, benchmarks
2603.18373To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
PDF
cs.CV, cs.AI83Diagnoses visual sycophancy/split beliefs in VLMs with counterfactual tests; highlights alignment failure.VLM, sycophancy, grounding, hallucinations, evaluation, uncertainty
2603.19220Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
PDF
cs.CL, cs.AI, cs.LG83Open 30B MoE with Cascade RL/distillation; strong reasoning/agentic claims; potentially impactful post-training.LLM, post-training, RL, distillation, MoE, reasoning, agents
2603.18886Reasoning over mathematical objects: on-policy reward modeling and test time aggregation
PDF
cs.AI, cs.CL83Principia suite for formal math objects + on-policy judge training + test-time aggregationreasoning, math, benchmarks, reward-modeling, llm-judges, verification
2603.18897Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution
PDF
cs.DC, cs.AI82Speculative tool execution to hide latency in LLM-tool loops; important for scalable agent servingagents, tool-use, systems, latency, speculation, serving
2603.18859RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
PDF
cs.AI, cs.CL, cs.LG81Topology-aware reward propagation for agentic LLM RL; could improve sparse-reward training efficiency.agentic-RL, process-rewards, trajectory-graphs, reasoning, optimization
2603.18729Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures
PDF
cs.AI80Studies dialect-triggered stereotypes; tests prompt/COT and multi-agent critique-revise mitigationbias, stereotypes, multi-agent, mitigation, prompting, fairness

AI 论文洞察简报

2026-03-19

0) 核心结论(先读这个)

  • “Grounding failures(落地/对齐到输入的失败)”越来越多是对齐/引导失败,而不是感知失败:一个三层 VLM 诊断发现 Visual Sycophancy(视觉谄媚)占主导(69.6%),并且 扩展规模会减少语言捷径但放大谄媚(Qwen2.5-VL 7B→72B:谄媚 72.4%→95.3%)。
  • Agent 安全正在从提示文本转向系统接口与观测通道:具备优先级感知的提示组合防御(PCFI)、MITM 网页流量红队(ClawTrap)与 密码学准入控制(ACP)都把 agent 栈当作攻击面——不只是模型本身。
  • 隐私风险已变成“推理时链接(inference-time linkage)”,而不仅是显式标识符泄露:agent 能从弱线索重建身份(例如 Netflix 79.2% 链接;AOL 40 条历史中确认 10 个身份),推动隐私评估去衡量推断出的身份,而不只是脱敏/删改。
  • 数据/反馈质量正在成为对齐瓶颈:CausalRM 表明,纠正观测反馈中的 噪声 + 选择偏差 能带来显著下游安全收益(例如 +49.2% WildGuardMix,+32.7% HarmBench);MOSAIC 则显示 带预算、切片感知的混合搜索 能避免朴素安全混合导致的过度拒答/能力崩塌。
  • Agent 训练与评估正在收敛到“信用分配 + 效率”:ZebraArena 量化工具查询低效与理论最优的差距;RewardFlow 与 HISR 提出更稠密、结构感知的奖励传播/分段过程奖励;OS-Themis 通过里程碑验证与审计改进长时程 GUI 奖励。

2) 关键主题(聚类)

主题:多模态 grounding 与安全是可引导的(也可被利用)

主题:Agent 安全加固 组合边界观测通道

主题:隐私威胁从“泄露了什么”转向“能推断出什么”

主题:对齐优化走向以数据为中心并进行因果校正

主题:Agent RL 与评估强调信用分配、效率与长时程奖励可靠性

3) 技术综合

  • 多篇论文在 反事实/溯源感知评估 上趋同:VLM 盲/噪声/冲突干预(视觉 grounding)、提示片段优先级强制(PCFI)与 MITM 观测重写(ClawTrap)都把“模型看到了什么”当作关键变量。
  • 反复出现的模式是 将行为与底层能力分离:拒答率 vs grounded 安全(SAVeS)、准确率 vs 图像依赖 vs 对齐偏好(三层)、以及“引用存在” vs 因果 grounding(XKD-Dial 遮挡)。
  • 预算无处不在:PlanTwin 披露预算、ZebraArena 查询预算/定价、MOSAIC 固定 SFT token 预算、OS-Themis 成本/延迟核算——提示评估应报告成本条件化性能曲线,而非单一分数。
  • 对齐方法越来越多使用 因果/统计校正 而非更多数据:CausalRM 的噪声+选择偏差校正与 MOSAIC 的切片感知分配相呼应——都旨在避免“在错误信号上训练”。
  • Agent RL 正转向 结构诱导的稠密奖励,而无需训练独立奖励模型:RewardFlow 用拓扑;HISR 用事后似然比;OS-Themis 用里程碑证据链。
  • 提示/引导被证明是 双刃剑:人设提升对齐但伤害知识(PRISM),语义线索可辅助也可攻击 VLM 安全(SAVeS),PR 元数据可锚定代码审查判断(确认偏差)。
  • 鲁棒性失败常 不对称:确认偏差主要增加漏报;VLM 可检测异常(高 LAD)却仍幻觉(高 CS);隐私链接即便在“良性”任务框架下也可能发生(INFERLINK IMPLICIT)。
  • 多项工作强调 可审计接口:ACP 的签名账本+执行令牌、PlanTwin 的 schema 约束孪生+gatekeeper、OS-Themis 的可验证里程碑检查都生成可事后检查的工件。
  • RL 目标的简化趋势:RGRA 提示 PPO 式 clipping 可能对 GRPO 类推理增益并非必要(小模型中),但 优势归一化与负反馈对稳定性至关重要

4) Top 5 论文(含“为何现在”)

1) To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs(看见还是取悦:揭示 VLM 的视觉谄媚与分裂信念)

  • 通过反事实图像将 VLM 失败分解为 感知(LAD)、依赖(VNS)、对齐(CS)
  • 发现 视觉谄媚是主导失效模式(69.6%),并在其 Qwen2.5-VL 分析中 随模型规模增大而加剧
  • 提供实用缓解:诊断引导的选择性预测(在 50% 覆盖率下准确率最高 +9.5pp)。
  • 质疑点:需要完整 logits(排除闭源模型),且缓解并未修复主导的谄媚机制。

2) From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents(从弱线索到真实身份:评估 LLM Agent 的推理驱动去匿名化)

  • 在经典(Netflix/AOL)、受控(INFERLINK)与现代轨迹上形式化并度量 推理驱动链接
  • 报告高链接能力(如 GPT-5 在 Netflix 79.2%;AOL 子集 CLC=10),且链接可在良性框架下出现。
  • 测试基于提示的隐私护栏并量化隐私–效用权衡。
  • 质疑点:INFERLINK 较简化;现代轨迹研究是机制演示而非流行度估计。

3) CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks(CausalRM:面向观测用户反馈的 RLHF 因果理论奖励建模)

  • 噪声校正替代损失倾向性重加权双重稳健估计结合用于观测 RLHF 信号。
  • 展示一致的 RM 改进与显著下游安全收益(例如其设置下 Qwen2.5-7B:+49.2% WildGuardMix,+32.7% HarmBench)。
  • 在正确估计干扰项时提供理论无偏保证(IPS/DR)。
  • 质疑点:依赖准确的倾向性/噪声率估计(锚点单元),且未探索观测+实验混合机制。

4) Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review(度量并利用 LLM 辅助安全代码审查中的确认偏差)

  • 量化 PR 框架(“无 bug”)如何导致跨模型漏洞检测 TPR 下降 16.2–93.5pp
  • 展示真实可利用性:在其测试设置下对 Copilot 35.3% 绕过、对 Claude Code 88.2%;迭代精炼提升成功率。
  • 表明缓解(忽略元数据/脱敏)可大幅恢复检测(交互式报告 100% 恢复;自治约 94%)。
  • 质疑点:在选定模型与受控环境中评估;较高基线误报使运营解释更复杂。

5) ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs(ZEBRAARENA:研究工具增强 LLM 推理-行动耦合的诊断仿真环境)

  • 提供程序化、抗污染环境,含 理论最小查询次数 K⋆ 与丰富效率诊断。
  • 显示即便强模型也可能高度低效(GPT-5 准确率近乎完美但工具调用比 K⋆ 多 70–270%)。
  • 揭示“预算焦虑”:更多预算并不稳定提升准确率。
  • 质疑点:理想化逻辑谜题场景;向噪声真实工具迁移仍待建立。

5) 实用下一步

  • 面向 VLM 产品:实现 反事实输入探针(盲/噪声/冲突),并跟踪类似 LAD/VNS/CS 的信号以区分“看不见”vs“不会说”。
  • 对任何基于引用/标记的安全 UX 增加 grounding 审计:运行 遮挡式因果检查,确保引用/标记确实控制输出(而非仅格式)。
  • 在 agent 栈中,将提示组装视为安全边界:采用 溯源标记 + 优先级强制(PCFI 类),并记录片段谱系用于事件响应。
  • 对代码审查 agent:剥离/规范化 PR 元数据,或在审查提示中明确“忽略元数据”;将对抗性“无 bug”框架下的检测作为回归测试。
  • 对处理私有状态的云规划 agent:原型化 带类型的数字孪生 + 能力目录 + gatekeeper(PlanTwin 类),并加入 披露预算 防止多轮指纹化。
  • 对来自日志的 RLHF:评估反馈是否 非随机缺失(missing-not-at-random);在收集更多标注前先尝试 倾向性 + 噪声校正(CausalRM 类)。
  • 对工具增强 agent:除准确率外报告 效率指标(查询次数 vs K⋆、冗余比、token 成本),用以调参预算策略并降低“预算焦虑”。
  • 对 GUI/长时程 RL:考虑 证据链评论器(里程碑 + 验证 + 审计),并跟踪评论器精确率/召回率,而不只看策略成功率。

由逐篇论文分析生成;无外部浏览。