AI 论文日报(2026-04-01)

Published:

English version: /paper-news/2026-04-01/

运行统计

  • 候选论文: 223
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.28013Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
PDF
cs.CR, cs.AI, cs.LG95Stage-level prompt-injection tracking w/ canaries across models, surfaces, and safety tiersprompt-injection, agent-security, evaluation, canary-tokens, kill-chain, frontier-models
2603.28063Reward Hacking as Equilibrium under Finite Evaluation
PDF
cs.AI, cs.GT95Formal result: reward hacking emerges under finite evaluation; computable distortion index.reward-hacking, principal-agent, evaluation, alignment-theory, RLHF, DPO
2603.28650Information-Theoretic Limits of Safety Verification for Self-Improving Systems
PDF
cs.LG, cs.AI, stat.ML95Formal impossibility bounds for safety gates in self-improving systems; high AI safety relevanceAI safety, self-improvement, verification, risk bounds, theory, TPR/FPR, impossibility
2603.28166Evaluating Privilege Usage of Agents on Real-World Tools
PDF
cs.CR, cs.AI92GrantBox sandbox evaluates real-world tool privilege usage—core risk for autonomous agentsagent-security, tool-use, privilege, sandbox, benchmark, real-world-tools
2603.28345Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code
PDF
cs.SE, cs.AI92Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security.program-analysis, information-flow, taint-analysis, LLM-security, prompting, software-engineering
2603.28551"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents
PDF
cs.CR, cs.ET, cs.HC, cs.MA90Empirical study + corpus on computer-use agent risk awareness and post-hoc auditability.computer-use-agents, auditability, traceability, user-safety, agent-security, HCI
2603.28204ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
PDF
cs.LG, cs.AI90Token-level entropy regulation for RLVR/GRPO improves credit assignment in reasoning chainsLLM reasoning, RLVR, GRPO, post-training, credit assignment, entropy, optimization
2603.28054Who Wrote the Book? Detecting and Attributing LLM Ghostwriters
PDF
cs.CL90GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detection.authorship-attribution, model-fingerprinting, misuse-detection, dataset, robustness, OOD
2603.28407MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
PDF
cs.AI, cs.CL89MiroEval benchmarks multimodal deep-research agents on process + outcome with refreshable tasksagents, evaluation, deep-research, multimodal, process-metrics, benchmark
2603.28376Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
PDF
cs.CL, cs.AI88Verification-centric deep research agent design across synthesis/trajectories/test-time scaling.deep-research-agents, verification, tool-use, long-horizon, test-time-scaling, RAG
2603.27982CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
PDF
cs.CV, cs.AI, cs.CL88New benchmark for commonsense-driven hallucination when vision conflicts with priorsVLM, hallucination, evaluation, robustness, benchmarks, visual grounding, reliability
2603.28569CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
PDF
cs.LG, cs.AI, cs.IR, cs.PF87Real cloud-ticket agent benchmark measuring robustness and resolution efficiency beyond accuracyagents, evaluation, real-world, customer-support, long-horizon, reliability
2603.28387The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
PDF
cs.AI, cs.LG86Shows prompt framing can fake multimodal clinical gains; important eval artifact warning.evaluation, prompting, multimodal, VLM, clinical-AI, spurious-cues
2603.28430IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
PDF
cs.LG, cs.CL86Hardware-aligned SO(4) rotations for low-bit KV-cache compression; practical LLM efficiency gainLLM efficiency, KV cache, compression, quantization, inference, systems, long-context
2603.28304The Necessity of Setting Temperature in LLM-as-a-Judge
PDF
cs.CL86Shows temperature materially affects LLM-as-judge reliability; key for eval validity.LLM-as-a-judge, evaluation, reproducibility, temperature, meta-eval, benchmarking
2603.28618Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
PDF
cs.AI85Dual-role RLVR separates perception vs reasoning credit; targets evidence extraction failuresmultimodal, RLVR, credit assignment, VLM, reasoning, perception, reliability
2603.27918Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey
PDF
cs.CR, cs.AI84Comprehensive survey of adversarial attacks on MLLMs with taxonomy across modalities/settingsmultimodal, adversarial-attacks, survey, security, jailbreaks, threat-models
2603.28092InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
PDF
cs.LG84Stealthy backdoor attack on dataset condensation; highlights new data-poisoning surface.backdoors, data-poisoning, dataset-condensation, adversarial-ML, security
2603.28135CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
PDF
cs.AI84Training-free metacognitive control for test-time reasoning: prune/repair/abstain under budget.test-time-compute, reasoning, abstention, search, inference, reliability
2603.28301LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
PDF
cs.LG83Benchmark for paraphrase robustness in VLA robots; large drops under simple synonyms.robustness, paraphrase, VLA, robotics, benchmark, instruction-following
2603.28610ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
PDF
cs.CV, cs.AI, cs.CL83Learns input-side adaptive resolution via bandits to cut visual tokens while keeping reasoningmultimodal, efficiency, adaptive compute, token budget, context length, bandits, MLLM
2603.28005Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
PDF
cs.CL82Prompt-controlled test questions whether atomic decomposition truly helps LLM judges in QA evalLLM-judges, evaluation, factuality, reference-grounding, prompting, methodology
2603.28605Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
PDF
cs.CV, cs.CY, cs.LG82Automated privacy-risk detection + diffusion editing to anonymize images while preserving utility.privacy, data-sanitization, diffusion-editing, dataset-curation, VLM
2603.28696AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
PDF
cs.CV, cs.AI82Entropy-guided token budgeting for long-video MLLMs; principled stop/allocate mechanism.multimodal, long-context, video-understanding, efficiency, uncertainty, token-selection
2603.28488Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
PDF
cs.CL, cs.AI, cs.MA80Structured multi-agent debate + progressive RAG for controversial claim verification robustnessclaim-verification, RAG, multi-agent, debate, hallucinations, calibration
2603.28476With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems
PDF
cs.IR, cs.LG, cs.SI80Analyzes coordinated user manipulation against risk-controlling recommenders with safety guarantees.recommenders, adversarial-manipulation, conformal-risk-control, safety, social-computing
2603.28622Trust-Aware Routing for Distributed Generative AI Inference at the Edge
PDF
cs.DC, cs.AI, cs.NI80Trust-aware routing for distributed generative inference; robustness to misbehaving edge peersagentic systems, distributed inference, trust, robustness, security, edge AI, systems
2603.28026When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA
PDF
cs.AI80Training-free contrastive decoding to reduce answer-choice priors; improves figure grounding.multimodal-eval, grounding, decoding, bias, scientific-qa, robustness
2603.28378Membership Inference Attacks against Large Audio Language Models
PDF
cs.SD, cs.AI79First systematic MIA study for audio LMs; shows confounds and proposes distribution-matched evalprivacy, membership-inference, audio, evaluation, dataset-shift, leakage
2603.28662AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
PDF
cs.LG, cs.AI78AMIGO tests long-horizon multi-image grounding via constrained questioning under uncertaintymultimodal-agents, benchmark, interactive-eval, uncertainty, grounding, long-horizon

AI Paper Insight Brief

2026-04-01

0) 执行要点(先读这个)

  • 评估正在成为新的攻击面:多篇论文表明,如何评估(提示框架、温度、分解方式、选择题 vs 问答格式)往往主导被测得的“能力”,并掩盖扎根(grounding)失败——因此基准设计如今是一项一等安全议题。
  • 智能体安全失败取决于流水线与攻击面,而非模型本身:提示注入与权限滥用会随注入面智能体阶段剧烈变化;仅用结果导向的 ASR 不足以选择防御或架构。
  • “先验胜过像素”仍是多模态核心失败模式:VLM 往往把异常“归一化”为常识(CDH),在科学选择题中跟随选项先验,或在“有 MRI 可用”的脚手架提示下不看图也作答——说明扎根干预必须显式对抗先验主导。
  • 以验证为中心的设计正在成为统一的可靠性杠杆:从深度研究智能体(在综合/轨迹/推理处验证)到分阶段金丝雀与过程中心基准,验证正在被操作化为“仪表化 + 控制”,而不仅是事后打分。
  • 训练期与数据供应链风险正在加剧:数据集凝缩中的隐蔽后门(InkDrop)与广泛的 MLLM 攻击分类表明,多模态系统会继承编码器、融合与指令跟随各环节的脆弱性——且常可在黑盒设置下迁移。
  • 测试时算力正从“更多 token”转向“更好的控制”:元认知控制器(CoT2-Meta)与长视频 token/分辨率分配器(AdaptToken、ResAdapt)在固定预算下通过把算力分配到高价值步骤/帧而稳定增益。

2) 关键主题(聚类)

主题:智能体提示注入与权限滥用需要按阶段/攻击面定位

主题:由先验与提示脚手架驱动的多模态扎根失败

主题:评估方法学很脆弱(裁判、提示、温度、分解)

主题:验证中心的智能体与预算约束下的测试时控制

主题:跨模态与数据流水线的安全与隐私威胁

3) 技术综合

  • 阶段分解反复出现:kill-chain canaries 分解注入传播;MiroEval 将结果分解为综合/事实性/过程;CoT2-Meta 将推理分解为 expand/prune/repair/stop;PRCO 将训练分解为 Observer vs Solver 角色。
  • 不确定性/熵正在成为控制信号,贯穿训练与推理:ERPO 用 token 熵在“关键决策枢纽”门控 RL 更新;AdaptToken 用响应熵做组分配 + 早停;CoT2-Meta 用过程/结果评分决定动作。
  • “先验 vs 证据”以多种形式出现:常识驱动幻觉(成对反事实)、选项诱导先验(MCQA)、脚手架提示(临床 VLM)都表明即使有图像,文本上下文也可能占主导
  • 基准格式会实质性改变失败率:CDH-Bench 发现 MC 格式比二元 QA 更糟;同一模型的提示注入 ASR 会随攻击面在 0–100% 间变化;裁判温度会改变一致性与可解析性。
  • 验证正在被操作化为训练数据卫生与推理时扩展:Marco 在综合中做唯一性验证并用验证器引导测试时扩展;PROCLAIM 的渐进检索 + 司法小组;MiroEval 的智能体事实性验证器。
  • 防御评估必须匹配威胁模型:kill-chain 结果显示主动防御在攻击面不匹配下会灾难性失败;音频 MIA 表明隐私审计必须控制分布漂移;多模态攻击综述强调攻击者知识设定(黑盒占主导)。
  • 显式处理算力预算:CoT2-Meta 将所有调用计入预算 C;AdaptToken-Lite 通过早停将推理时间减半;ResAdapt 以编码器前像素预算为目标,在空间证据与时间证据间权衡。
  • 仪表化 + 可追溯性正从开发者扩展到终端用户:AgentTrace 提升用户理解与异常检测;GrantBox 记录出站请求/授权参数;NL/PL 分类法支持污点/切片决策。

4) Top 5 论文(含“为何现在”)

1) Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

  • 引入阶段级提示注入指标(EXPOSED/PERSISTED/RELAYED/EXECUTED),解释防御在哪里起作用。
  • 显示暴露是普遍的(100%);安全性取决于下游传播。
  • 展示攻击面依赖性:同一模型在不同注入面下 ASR 可为 0% 或 100%(如 DeepSeek 的 memory_poison vs tool_poison/propagation)。
  • 质疑点:每个单元样本量较小且载荷为合成;模型差异根因(如 Claude 的 write_memory 过滤)未被隔离。

2) CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

  • 定义常识驱动幻觉,并用成对反事实设计隔离先验崩塌。
  • 报告系统性差距:平均 CFAD 16.39%(QA)/ 25.20%(MC);8 个模型中 7 个在反事实上退化。
  • 强调MC 格式会放大先验驱动错误,与许多真实产品 UI 相关。
  • 质疑点:合成图像与规模有限(300 对);未展示更广泛的真实异常覆盖。

3) Evaluating Privilege Usage of Agents on Real-World Tools

  • 提供 GrantBox:在容器中集成 10 个真实 MCP 服务器 / 122 个权限敏感工具并记录日志。
  • 在其设置下发现提示注入成功率极高:平均 ASR 90.55%(ReAct)79.05%(Plan-and-Execute)
  • 通过真实出站请求日志而非玩具工具,使权限滥用可测量。
  • 质疑点:当前评估聚焦无防御的“原生”智能体行为;环境搭建复杂度可能影响可复现性。

4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

  • 免训练控制器,在固定预算下用融合的过程+结果评分分配到 expand/prune/repair/stop/abstain
  • 在 15 个基准、匹配预算下报告稳定增益并改善校准(如 ECE ~0.035)。
  • 给出具体失败分类(search-not-converged、evaluator misjudgment、over-pruning),便于调试。
  • 质疑点:依赖在线过程评估信号质量;手工设计的控制器/元状态可能不泛化。

5) The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

  • 显示仅提及 MRI 可用性就能驱动大部分“多模态”增益(置信度变化的 ~70–80%),即使图像几乎不提供信号。
  • 通过强消融(伪模态、替换图像)与专家轨迹审阅论证增益常非证据驱动
  • 表明对齐干预(MPO)可抑制 MRI 引用但会压塌性能,凸显修复根因的难度。
  • 质疑点:仅限两个队列与开源权重模型;脚手架估计来自最敏感的模型。

5) 实用下一步

  • 在你的智能体栈中采用阶段级注入遥测(金丝雀 token + EXPOSED/PERSISTED/RELAYED/EXECUTED 日志),并要求防御报告在哪个阶段阻断传播,而不仅是 ASR。
  • 上线前进行多攻击面提示注入评估(记忆投毒、工具输出投毒、转发传播、权限升级),将“攻击面不匹配”视为主要失败模式。
  • 增加权限使用审计:记录工具调用 + 授权参数(GrantBox 风格),并为对抗性提示下的“最小权限”违规建立回归测试。
  • 通过加入 CDH 风格的反事实 vs 常识成对样例与提示脚手架消融(如“模态可用”前导语、替换图像)来加固多模态扎根评估
  • 对科学/MCQA 产品,比较多模态 vs 纯文本 logits 来测试选项诱导先验;在视觉证据优势强时考虑 SCICON 风格减法,并衡量当先验正确时的受损案例率。
  • 稳定 LLM 裁判流水线:固定并报告温度;跨随机种子测解析错误与一致性;当完整性/部分支持是关键时考虑匹配的整体 rubric(并跟踪 token 成本)。
  • 为研究型智能体仪表化过程质量(MiroEval 风格):收集过程日志、计算过程指标,并与事实性/结果相关联,以发现“过程很差但报告看起来很好”的情况。
  • 将凝缩/合成数据集视为供应链产物:对数据集凝缩输出增加后门扫描与溯源控制;假设隐蔽触发器可不可感知(InkDrop)。
  • 对音频隐私审计,始终运行盲基线(元数据/文本/声学)与分布匹配切分,再得出基于记忆的泄露结论。
  • 预算化推理:若要扩展测试时算力,优先采用控制器(剪枝/修复/停止)与不确定性引导分配(基于熵的早停),而非均匀“更多采样”,并在准确率之外跟踪校准。

由逐篇论文分析生成;未进行外部浏览。