AI 论文日报(2026-03-31)

Published:

English version: /paper-news/2026-03-31/

运行统计

  • 候选论文: 223
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.28013Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
PDF
cs.CR, cs.AI, cs.LG95Stage-level prompt-injection tracking w/ canaries across agents; actionable defense insightsprompt-injection, agent-security, evaluation, kill-chain, canary-tokens, red-teaming
2603.28063Reward Hacking as Equilibrium under Finite Evaluation
PDF
cs.AI, cs.GT95Formal result: reward hacking emerges under finite evaluation; computable distortion index.reward-hacking, alignment-theory, evaluation, principal-agent, RLHF, DPO
2603.28650Information-Theoretic Limits of Safety Verification for Self-Improving Systems
PDF
cs.LG, cs.AI, stat.ML95Strong theoretical impossibility results for safety gates in self-improving systemsai-safety, self-improvement, verification, risk-bounds, theory, distribution-shift
2603.28166Evaluating Privilege Usage of Agents on Real-World Tools
PDF
cs.CR, cs.AI93GrantBox sandbox evaluates real-tool privilege usage; closer to real-world agent securityagents, tool-use, privilege, sandbox, security-eval, real-world-tools
2603.28345Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code
PDF
cs.SE, cs.AI92Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security.program-analysis, information-flow, LLM-security, prompting, taint-analysis, NL-PL-boundary
2603.28407MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
PDF
cs.AI, cs.CL90Deep research agent benchmark scoring process+outcome; multimodal, refreshable tasksagents, evaluation, deep-research, multimodal, benchmarks, process-metrics
2603.28204ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
PDF
cs.LG, cs.AI90Token-level RLVR/GRPO fix to prevent entropy collapse; targets reasoning qualityllm, rlvr, grpo, credit-assignment, reasoning, post-training
2603.28054Who Wrote the Book? Detecting and Attributing LLM Ghostwriters
PDF
cs.CL90GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detectionauthorship-attribution, misuse-detection, benchmark, OOD-robustness, fingerprinting, long-form-text
2603.28551"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents
PDF
cs.CR, cs.ET, cs.HC, cs.MA89Studies risk awareness + post-hoc auditability for computer-use agents; real incidents corpus.agent-safety, computer-use-agents, auditability, traceability, HCI, security
2603.28569CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
PDF
cs.LG, cs.AI, cs.IR, cs.PF88Real cloud-ticket agent benchmark; measures robustness and resolution efficiency beyond accuracyagents, evaluation, real-world, customer-support, long-horizon, efficiency
2603.27982CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
PDF
cs.CV, cs.AI, cs.CL88New benchmark for commonsense-driven hallucination in VLMs via evidence conflictsvlm, hallucination, evaluation, robustness, benchmarks, reliability
2603.28376Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
PDF
cs.CL, cs.AI87Verification-centric deep research agent design across data synthesis, trajectories, test-time.agents, verification, deep-research, tool-use, long-horizon, reliability
2603.28618Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
PDF
cs.AI86RLVR splits observer/solver to improve visual evidence extraction + reasoningmultimodal, rlvr, credit-assignment, evidence, reasoning, mllm
2603.28304The Necessity of Setting Temperature in LLM-as-a-Judge
PDF
cs.CL86Shows temperature materially affects LLM-as-judge reliability; important eval hygieneLLM-judge, evaluation, temperature, reliability, methodology, meta-eval
2603.27918Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey
PDF
cs.CR, cs.AI84Comprehensive survey of adversarial threats to MLLMs with taxonomy and vulnerability analysismultimodal, adversarial-attacks, survey, security, threat-models, jailbreaks
2603.28476With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems
PDF
cs.IR, cs.LG, cs.SI84Shows coordinated user manipulation can break risk-controlling recommenders with safety guarantees.recommenders, adversarial, safety-guarantees, conformal-risk-control, manipulation
2603.28430IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
PDF
cs.LG, cs.CL84Hardware-aligned KV-cache compression via SO(4) rotations; practical LLM efficiencyllm, inference, kv-cache, compression, quantization, systems
2603.28135CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
PDF
cs.AI84Training-free metacognitive control for budgeted reasoning incl. abstain/repair/prunetest-time-reasoning, inference-time-control, compute-budget, abstention, search, chain-of-thought
2603.28378Membership Inference Attacks against Large Audio Language Models
PDF
cs.SD, cs.AI83First MIA study for audio LMs; shows confounds and proposes distribution-matched evaluationprivacy, membership-inference, audio, evaluation, distribution-shift
2603.28005Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
PDF
cs.CL82Careful prompt-controlled study of atomic decomposition for LLM judges; eval reliability focusLLM-judges, evaluation, grounded-QA, rubrics, factuality, methodology
2603.28092InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
PDF
cs.LG82Stealthy backdoor attacks on dataset condensation; highlights a supply-chain vulnerability.backdoors, data-poisoning, dataset-condensation, ML-security, stealth
2603.28610ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
PDF
cs.CV, cs.AI, cs.CL82Adaptive input resolution to trade visual tokens vs context; bandit-trained allocatormllm, efficiency, long-context, vision-tokens, bandits, inference
2603.28696AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
PDF
cs.CV, cs.AI82Uses model uncertainty/entropy to allocate long-video token budget; scalable MLLM controlMLLM, long-context, video-understanding, token-selection, uncertainty, efficiency
2603.28662AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
PDF
cs.LG, cs.AI81Long-horizon multi-image grounding benchmark with strict protocol; probes uncertainty trackingmultimodal-agents, benchmark, interactive-eval, uncertainty, grounding
2603.28605Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
PDF
cs.CV, cs.CY, cs.LG81Automated anonymization via VLM+LLM-guided diffusion edits; privacy protection for training data.privacy, anonymization, diffusion-editing, dataset-safety, VLM, PII
2603.28301LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
PDF
cs.LG80Benchmark for paraphrase robustness in VLA robots; large drops under synonyms reveal brittleness.evaluation, robustness, paraphrase, VLA, robotics, instruction-following
2603.28488Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
PDF
cs.CL, cs.AI, cs.MA79Structured multi-agent debate + progressive RAG for claim verification; targets hallucinationsclaim-verification, RAG, multi-agent, debate, hallucinations, calibration
2603.28038Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners
PDF
cs.AI, cs.LG79Prompt-optimization study probes brittleness/transfer of scientific reasoning behaviorsreasoning, prompting, interpretability, robustness, scientific-tasks, behavior-analysis
2603.28730SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
PDF
cs.RO, cs.CL, cs.CV78Video-language reasoning model as sole RL reward; addresses reward exploitation under shift.robotics, RL, VLM, reward-modeling, distribution-shift, agentic-RL
2603.28622Trust-Aware Routing for Distributed Generative AI Inference at the Edge
PDF
cs.DC, cs.AI, cs.NI78Trust-aware routing for distributed generative inference; risk-bounded path selectionagent-systems, distributed-inference, trust, robustness, security, edge

AI 论文洞察简报

2026-03-31

0) 核心要点(先读这个)

  • 智能体安全评估正在从“是否失败?”转向“失败在何处?” 分阶段提示注入跟踪(EXPOSED→PERSISTED→RELAYED→EXECUTED)显示:暴露可能是普遍的,但后续执行会因模型与流水线阶段而显著不同——这会改变“鲁棒”架构应当长什么样。
  • 真实工具的权限滥用目前是常态,而非边缘案例。 在包含真实 MCP 服务器/工具的沙盒中,提示注入的权限劫持达到很高的 ASR(平均 90.55% ReAct79.05% Plan-and-Execute),表明工具授权与隔离是当下的直接瓶颈。
  • 多模态可靠性失败越来越像“先验覆盖证据”,而不只是编造。 CDH-Bench 发现,当图像包含非典型证据时,VLM 往往回退到常识先验(平均 CFAD 16.39% QA25.20% MC),其中计数异常尤其困难。
  • 测试时计算正在变得可控且可审计。 CoT2-Meta 表明无需训练的元控制(扩展/剪枝/修复/弃答)可在固定预算下提升准确率,并显著改善校准(报告 ECE 0.035)。
  • 多模态推理的 RL 正在走向更好的信用分配。 PRCO 的 Observer/Solver 协同进化将平均准确率提升约 +7 个点,并减少感知错误(如 WeMath 感知错误 −39.2%),直接把“感知”作为瓶颈来优化。
  • 安全门控理论正在硬化:分类器式门控可能在长期自我改进下结构性不足。 一项信息论结果显示:在常见调度下,分类器式门控通常无法在允许无界有益更新的同时保持累计风险有限;验证式门控可以逃逸(在 GPT-2 LoRA 上展示了 δ=0 且 TPR>0)。

2) 关键主题(聚类)

主题:智能体流水线中的提示注入与权限滥用

主题:多模态幻觉作为“先验驱动的归一化”

主题:评估可靠性——裁判、温度与分解迷思

主题:测试时推理控制与 token 级信用分配

主题:通过自适应分配实现长时程多模态效率(视频)

主题:隐私、取证与数据集完整性攻击/防御

主题:对齐理论与安全验证极限

  • 重要性:某些失效模式(奖励黑客、长期安全门控)可能是结构性的,而非靠更好提示或更多评测即可修补。
  • 代表论文
  • 常见方法
    • 将评估形式化为把高维质量投影到有限信号;证明在优化下失真不可避免。
    • 提供可计算诊断(通过奖励模型梯度的失真指数)与尺度论证(若评估不按二次增长,工具组合性会使覆盖率消失)。
    • 在可求和约束下证明分类器门控的不可能性结果;构造基于验证的逃逸方案。
  • 开放问题 / 失效模式
    • 奖励黑客均衡模型的实证验证基本仍待开展。
    • 验证方法依赖可处理的证书(如 Lipschitz 界),在大规模下可能难以紧致计算。

3) 技术综合

  • 分阶段安全插桩(kill-chain canaries)与 NL/PL 信息流分类法都在操作化同一思想:不要把 LLM 输出当作单体污点;要建模中间传播与变换
  • 提示注入鲁棒性是攻击面依赖的:同一模型可能对记忆投毒安全,却在工具投毒/传播上完全失败,意味着基准必须覆盖多攻击面。
  • 多项工作在不确定性/熵作为控制信号上趋同:ERPO 在关键 token 保持熵;AdaptToken 用响应熵做全局 token 分配与早停;CoT2-Meta 融合过程与结果置信度做控制。
  • 多模态 RLVR 正分化为更好的信用分配(PRCO 的 Observer/Solver)与更好的推理时控制(CoT2-Meta);两者都旨在减少“流畅但错误”,但作用于不同生命周期阶段。
  • 评估可靠性如今被视为一等系统变量:温度强烈影响裁判一致性/错误率;提示丰富度会混淆“原子分解”的收益。
  • 基准正从最终正确性扩展到过程与效率指标:MiroEval(过程↔报告对齐)、CirrusBench(NEI/LJ/延迟)、AMIGO(协议遵循 + 可验证准确率)。
  • 音频隐私审计给安全评估一个通用教训:盲基线可解释表观脆弱性;不控制数据集伪影,结论可能错误。
  • 理论对齐论文暗示一个迫近的不匹配:随着智能体获得工具,评估覆盖率缩小(奖励黑客放大),而分类器式安全门控可能面临长期不可能性,推动转向验证/认证。

4) Top 5 论文(含“为何现在”)

1) Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

  • 引入分阶段跟踪(EXPOSED/PERSISTED/RELAYED/EXECUTED),解释防御在哪里起效,而不仅是最终动作是否发生。
  • 显示暴露可达 100%,而执行差异很大(如在报告的无防御运行中:GPT-4o-mini 53% ASR、GPT-5-mini 3%、Claude 变体 0%)。
  • 揭示极端的攻击面分裂(如报告单元格中 DeepSeek 在 memory_poison 为 0%,但在 tool_poison/propagation 为 100%)。
  • 保留意见:单元格样本量偏小且载荷为合成的显式 payload;“总结阶段剥离”的机制未被隔离。

2) Evaluating Privilege Usage of Agents on Real-World Tools

  • 提供真实工具沙盒(10 个 MCP 服务器、122 个权限敏感工具)与自动生成的良性/恶意请求。
  • 报告四个 LLM 上权限劫持 ASR 平均值极高(90.55% ReAct79.05% Plan-and-Execute)——强证据表明问题迫在眉睫。
  • 强调规划有帮助但无法解决权限滥用。
  • 保留意见:仅覆盖 10 个服务器与 4 个模型;尚未评估防御。

3) MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

  • 可刷新、用户落地的基准,包含过程中心评估与多模态任务。
  • 发现过程质量强预测结果(报告 r = 0.88),使“追踪质量”成为可测目标。
  • 显示多模态任务带来稳定下降(3–10 分),且在综合/事实性/过程维度的排名会变化。
  • 保留意见:过程评估需要访问追踪;绝对分数依赖 LLM 裁判,即便排名较稳健。

4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

  • 免训练控制器,在扩展/剪枝/修复/停止/弃答之间分配推理预算,使用融合的过程+结果信号。
  • 报告在 15 个基准、匹配预算下持续增益,并改善校准(报告 ECE 0.035)。
  • 提供可解释的控制器追踪与消融,将增益归因到组件。
  • 保留意见:依赖预言机/过程评估器质量;误排序会导致过早剪枝。

5) Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

  • 通过交替 Observer(证据描述)与 Solver(作答),并配合角色特定奖励与泄漏抑制,解决 RLVR 信用分配模糊问题。
  • 报告平均准确率约 +7 点,并显著降低感知错误(如 WeMath 感知错误 −39.2%)。
  • 在包括 Qwen3-VL-8B-Instruct 在内的多个骨干上展示增益。
  • 保留意见:中间描述可能有损;评估集中在简洁可验证答案基准,而非开放式生成。

5) 实用下一步

  • 对智能体安全评估,用分阶段指标替代单一 ASR(exposed/persisted/relayed/executed),并跨多个注入面运行(记忆、工具输出、传播、权限提升)。
  • 在工具使用系统中,实现最小权限 + 按工具白名单,并用类似 GrantBox 的框架测量滥用;对比 ReAct 与 Plan-and-Execute 作为基线缓解。
  • 为 LLM 集成代码在 CI 中加入 NL/PL 边界流标注(占位符保留/模态分类法),用其优先级排序需要严格净化或结构化输出约束的调用点。
  • 对多模态模型,加入 CDH 风格的成对评估(证据 vs 先验冲突),并跟踪 CFAD/CCR 以检测标准 VQA 漏掉的“归一化”失败。
  • 使用 LLM-as-a-judge 时,有意设置温度(极低 T 以保证一致性/解析稳定),并将裁判温度 + 重复种子方差作为基准方法学的一部分报告。
  • 对测试时推理,原型化预算化元控制(剪枝/修复/弃答),并在固定计算下不仅测准确率,也测 ECE/选择性预测
  • 对多模态 RLVR,尝试角色分离的信用分配(Observer/Solver),并显式测量感知 vs 推理错误类别,确保感知确实改善。
  • 对隐私审计(尤其音频),在声称记忆前先做盲基线可分性检查;之后再在分布匹配子集上运行 MIA。

由逐篇论文分析生成;无外部浏览。