AI 论文日报（2026-04-01）

Published: April 01, 2026

English version: /paper-news/2026-04-01/

运行统计

候选论文: 223
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.28013`	Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers PDF	cs.CR, cs.AI, cs.LG	95	Stage-level prompt-injection tracking w/ canaries across models, surfaces, and safety tiers	prompt-injection, agent-security, evaluation, canary-tokens, kill-chain, frontier-models
`2603.28063`	Reward Hacking as Equilibrium under Finite Evaluation PDF	cs.AI, cs.GT	95	Formal result: reward hacking emerges under finite evaluation; computable distortion index.	reward-hacking, principal-agent, evaluation, alignment-theory, RLHF, DPO
`2603.28650`	Information-Theoretic Limits of Safety Verification for Self-Improving Systems PDF	cs.LG, cs.AI, stat.ML	95	Formal impossibility bounds for safety gates in self-improving systems; high AI safety relevance	AI safety, self-improvement, verification, risk bounds, theory, TPR/FPR, impossibility
`2603.28166`	Evaluating Privilege Usage of Agents on Real-World Tools PDF	cs.CR, cs.AI	92	GrantBox sandbox evaluates real-world tool privilege usage—core risk for autonomous agents	agent-security, tool-use, privilege, sandbox, benchmark, real-world-tools
`2603.28345`	Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code PDF	cs.SE, cs.AI	92	Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security.	program-analysis, information-flow, taint-analysis, LLM-security, prompting, software-engineering
`2603.28551`	"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents PDF	cs.CR, cs.ET, cs.HC, cs.MA	90	Empirical study + corpus on computer-use agent risk awareness and post-hoc auditability.	computer-use-agents, auditability, traceability, user-safety, agent-security, HCI
`2603.28204`	ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models PDF	cs.LG, cs.AI	90	Token-level entropy regulation for RLVR/GRPO improves credit assignment in reasoning chains	LLM reasoning, RLVR, GRPO, post-training, credit assignment, entropy, optimization
`2603.28054`	Who Wrote the Book? Detecting and Attributing LLM Ghostwriters PDF	cs.CL	90	GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detection.	authorship-attribution, model-fingerprinting, misuse-detection, dataset, robustness, OOD
`2603.28407`	MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome PDF	cs.AI, cs.CL	89	MiroEval benchmarks multimodal deep-research agents on process + outcome with refreshable tasks	agents, evaluation, deep-research, multimodal, process-metrics, benchmark
`2603.28376`	Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design PDF	cs.CL, cs.AI	88	Verification-centric deep research agent design across synthesis/trajectories/test-time scaling.	deep-research-agents, verification, tool-use, long-horizon, test-time-scaling, RAG
`2603.27982`	CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models PDF	cs.CV, cs.AI, cs.CL	88	New benchmark for commonsense-driven hallucination when vision conflicts with priors	VLM, hallucination, evaluation, robustness, benchmarks, visual grounding, reliability
`2603.28569`	CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments PDF	cs.LG, cs.AI, cs.IR, cs.PF	87	Real cloud-ticket agent benchmark measuring robustness and resolution efficiency beyond accuracy	agents, evaluation, real-world, customer-support, long-horizon, reliability
`2603.28387`	The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation PDF	cs.AI, cs.LG	86	Shows prompt framing can fake multimodal clinical gains; important eval artifact warning.	evaluation, prompting, multimodal, VLM, clinical-AI, spurious-cues
`2603.28430`	IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression PDF	cs.LG, cs.CL	86	Hardware-aligned SO(4) rotations for low-bit KV-cache compression; practical LLM efficiency gain	LLM efficiency, KV cache, compression, quantization, inference, systems, long-context
`2603.28304`	The Necessity of Setting Temperature in LLM-as-a-Judge PDF	cs.CL	86	Shows temperature materially affects LLM-as-judge reliability; key for eval validity.	LLM-as-a-judge, evaluation, reproducibility, temperature, meta-eval, benchmarking
`2603.28618`	Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning PDF	cs.AI	85	Dual-role RLVR separates perception vs reasoning credit; targets evidence extraction failures	multimodal, RLVR, credit assignment, VLM, reasoning, perception, reliability
`2603.27918`	Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey PDF	cs.CR, cs.AI	84	Comprehensive survey of adversarial attacks on MLLMs with taxonomy across modalities/settings	multimodal, adversarial-attacks, survey, security, jailbreaks, threat-models
`2603.28092`	InkDrop: Invisible Backdoor Attacks Against Dataset Condensation PDF	cs.LG	84	Stealthy backdoor attack on dataset condensation; highlights new data-poisoning surface.	backdoors, data-poisoning, dataset-condensation, adversarial-ML, security
`2603.28135`	CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning PDF	cs.AI	84	Training-free metacognitive control for test-time reasoning: prune/repair/abstain under budget.	test-time-compute, reasoning, abstention, search, inference, reliability
`2603.28301`	LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models PDF	cs.LG	83	Benchmark for paraphrase robustness in VLA robots; large drops under simple synonyms.	robustness, paraphrase, VLA, robotics, benchmark, instruction-following
`2603.28610`	ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning PDF	cs.CV, cs.AI, cs.CL	83	Learns input-side adaptive resolution via bandits to cut visual tokens while keeping reasoning	multimodal, efficiency, adaptive compute, token budget, context length, bandits, MLLM
`2603.28005`	Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation PDF	cs.CL	82	Prompt-controlled test questions whether atomic decomposition truly helps LLM judges in QA eval	LLM-judges, evaluation, factuality, reference-grounding, prompting, methodology
`2603.28605`	Unsafe2Safe: Controllable Image Anonymization for Downstream Utility PDF	cs.CV, cs.CY, cs.LG	82	Automated privacy-risk detection + diffusion editing to anonymize images while preserving utility.	privacy, data-sanitization, diffusion-editing, dataset-curation, VLM
`2603.28696`	AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding PDF	cs.CV, cs.AI	82	Entropy-guided token budgeting for long-video MLLMs; principled stop/allocate mechanism.	multimodal, long-context, video-understanding, efficiency, uncertainty, token-selection
`2603.28488`	Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification PDF	cs.CL, cs.AI, cs.MA	80	Structured multi-agent debate + progressive RAG for controversial claim verification robustness	claim-verification, RAG, multi-agent, debate, hallucinations, calibration
`2603.28476`	With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems PDF	cs.IR, cs.LG, cs.SI	80	Analyzes coordinated user manipulation against risk-controlling recommenders with safety guarantees.	recommenders, adversarial-manipulation, conformal-risk-control, safety, social-computing
`2603.28622`	Trust-Aware Routing for Distributed Generative AI Inference at the Edge PDF	cs.DC, cs.AI, cs.NI	80	Trust-aware routing for distributed generative inference; robustness to misbehaving edge peers	agentic systems, distributed inference, trust, robustness, security, edge AI, systems
`2603.28026`	When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA PDF	cs.AI	80	Training-free contrastive decoding to reduce answer-choice priors; improves figure grounding.	multimodal-eval, grounding, decoding, bias, scientific-qa, robustness
`2603.28378`	Membership Inference Attacks against Large Audio Language Models PDF	cs.SD, cs.AI	79	First systematic MIA study for audio LMs; shows confounds and proposes distribution-matched eval	privacy, membership-inference, audio, evaluation, dataset-shift, leakage
`2603.28662`	AMIGO: Agentic Multi-Image Grounding Oracle Benchmark PDF	cs.LG, cs.AI	78	AMIGO tests long-horizon multi-image grounding via constrained questioning under uncertainty	multimodal-agents, benchmark, interactive-eval, uncertainty, grounding, long-horizon

AI Paper Insight Brief

2026-04-01

0) 执行要点（先读这个）

评估正在成为新的攻击面：多篇论文表明，如何评估（提示框架、温度、分解方式、选择题 vs 问答格式）往往主导被测得的“能力”，并掩盖扎根（grounding）失败——因此基准设计如今是一项一等安全议题。
智能体安全失败取决于流水线与攻击面，而非模型本身：提示注入与权限滥用会随注入面与智能体阶段剧烈变化；仅用结果导向的 ASR 不足以选择防御或架构。
“先验胜过像素”仍是多模态核心失败模式：VLM 往往把异常“归一化”为常识（CDH），在科学选择题中跟随选项先验，或在“有 MRI 可用”的脚手架提示下不看图也作答——说明扎根干预必须显式对抗先验主导。
以验证为中心的设计正在成为统一的可靠性杠杆：从深度研究智能体（在综合/轨迹/推理处验证）到分阶段金丝雀与过程中心基准，验证正在被操作化为“仪表化 + 控制”，而不仅是事后打分。
训练期与数据供应链风险正在加剧：数据集凝缩中的隐蔽后门（InkDrop）与广泛的 MLLM 攻击分类表明，多模态系统会继承编码器、融合与指令跟随各环节的脆弱性——且常可在黑盒设置下迁移。
测试时算力正从“更多 token”转向“更好的控制”：元认知控制器（CoT2-Meta）与长视频 token/分辨率分配器（AdaptToken、ResAdapt）在固定预算下通过把算力分配到高价值步骤/帧而稳定增益。

2) 关键主题（聚类）

主题：智能体提示注入与权限滥用需要按阶段/攻击面定位

重要性：智能体部署会在不同位置失败（写入记忆、转发、工具参数），并通过不同攻击面发生（工具输出 vs 记忆投毒）。没有阶段级仪表化，防御可能看似有效却在真实多攻击面流水线中失效。
代表论文：
常见方法：
- 用阶段级信号（EXPOSED→PERSISTED→RELAYED→EXECUTED）对流水线做仪表化，而不是只看单一 ASR。
- 覆盖多种注入面与真实工具栈（真实 MCP 服务器；NL/PL 占位符流）。
- 增加可追溯性/日志（传播记录器、远程请求记录器、面向用户的动作轨迹）。
开放问题 / 失败模式：
- 防御“成功”其实是攻击面不匹配（防御在假设通道外失效）。
- NL/PL 流中的二层可达性（Layer 2 reachability）难以分析（分类法可说“传播”，但代码级可达性仍可能失败）。
- 即便有日志，人类操作者仍缺乏事后可审计性与对持久化影响的理解。

主题：由先验与提示脚手架驱动的多模态扎根失败

重要性：模型可能在忽略视觉证据时仍显得很强——在对异常敏感的领域（临床、质检、取证）与科学图表解读中尤其危险。
代表论文：
常见方法：
- 构造反事实 vs 常识成对输入以隔离先验驱动的崩塌（CF-Acc vs CS-Acc；CFAD/CCR/RPD）。
- 使用推理时去偏（对比解码：减去纯文本 logits）。
- 做提示消融（MRI 前导语 vs 实际图像；替换/OOD 图像；伪模态前导语）。
开放问题 / 失败模式：
- 选择题格式会放大先验崩塌（CDH-Bench 显示 MC 下 CFAD 更大）。
- 去偏在文本先验正确时可能有害（SCICON 在视觉证据优势弱/为负时会伤害）。
- 提示脚手架会制造虚假的多模态增益；对齐微调（MPO）可能通过“压塌表现”而非修复扎根来抑制该现象。

主题：评估方法学很脆弱（裁判、提示、温度、分解）

重要性：若评估流水线不稳定，我们就无法信任进展宣称或安全回归——尤其当 LLM 裁判被用于训练信号与基准评测时。
代表论文：
常见方法：
- 提示受控对比（匹配 rubric 丰富度/token 预算；多个冻结提示变体）。
- 将温度视为因果处理（AIPW 因果估计；重复随机种子）。
- 将评估从结果扩展到过程 + 事实性验证（过程日志→过程指标；基于网页+附件的陈述级验证）。
开放问题 / 失败模式：
- 原子分解并非天然更好；整体 rubric 在不完整性检测上可能更准确且更便宜。
- 更高温度会增加解析错误与不一致性，尤其在 CoT 提示下。
- 过程评估需要访问轨迹，限制其在闭源系统中的适用性。

主题：验证中心的智能体与预算约束下的测试时控制

重要性：长时程智能体与推理系统会因错误传播而失败；预算内算力应花在验证/修复与证据获取上，而不只是更长的生成。
代表论文：
常见方法：
- 加入显式验证回路（综合中的唯一性检查；验证器智能体；discard-all 重置）。
- 用控制器在固定预算下扩展/剪枝/修复/停止/弃答（UCB 风格选择；过程+结果融合）。
- 用不确定性/熵信号分配算力（token 级熵用于早停；组级熵用于长视频）。
开放问题 / 失败模式：
- 重验证流水线可能极度耗 token（PROCLAIM 约 210.9K tokens/claim）。
- 控制器依赖“预言机”质量；误判会导致过度剪枝或浪费修复。
- 免训练不确定性启发式可能对不同领域/模型阈值敏感。

主题：跨模态与数据流水线的安全与隐私威胁

重要性：多模态系统扩大攻击面（编码器、融合、指令跟随），而新 ML 工作流（数据集凝缩）产生高度集中的供应链产物，极具攻击吸引力。
代表论文：
常见方法：
- 按攻击者目标（完整性、越狱、控制/注入、投毒/后门）与攻击者知识组织威胁。
- 展示隐蔽性 + 有效性权衡（InkDrop 用含 LPIPS + 标签/特征对齐的多目标损失）。
- 用混杂控制审计隐私（音频 MIA 的盲基线以检测数据集可分性；模态解耦）。
开放问题 / 失败模式：
- 文献偏斜：攻击研究以视觉为主（综述：65 篇中 58 篇实证涉及图像/视频）。
- MIA 可能被数据集伪影虚高（音频声学特征可近乎完美地区分训练/测试）。
- 长文本归因方法（TRACE）未在混淆/对抗性改写下测试。

3) 技术综合

阶段分解反复出现：kill-chain canaries 分解注入传播；MiroEval 将结果分解为综合/事实性/过程；CoT2-Meta 将推理分解为 expand/prune/repair/stop；PRCO 将训练分解为 Observer vs Solver 角色。
不确定性/熵正在成为控制信号，贯穿训练与推理：ERPO 用 token 熵在“关键决策枢纽”门控 RL 更新；AdaptToken 用响应熵做组分配 + 早停；CoT2-Meta 用过程/结果评分决定动作。
“先验 vs 证据”以多种形式出现：常识驱动幻觉（成对反事实）、选项诱导先验（MCQA）、脚手架提示（临床 VLM）都表明即使有图像，文本上下文也可能占主导。
基准格式会实质性改变失败率：CDH-Bench 发现 MC 格式比二元 QA 更糟；同一模型的提示注入 ASR 会随攻击面在 0–100% 间变化；裁判温度会改变一致性与可解析性。
验证正在被操作化为训练数据卫生与推理时扩展：Marco 在综合中做唯一性验证并用验证器引导测试时扩展；PROCLAIM 的渐进检索 + 司法小组；MiroEval 的智能体事实性验证器。
防御评估必须匹配威胁模型：kill-chain 结果显示主动防御在攻击面不匹配下会灾难性失败；音频 MIA 表明隐私审计必须控制分布漂移；多模态攻击综述强调攻击者知识设定（黑盒占主导）。
显式处理算力预算：CoT2-Meta 将所有调用计入预算 C；AdaptToken-Lite 通过早停将推理时间减半；ResAdapt 以编码器前像素预算为目标，在空间证据与时间证据间权衡。
仪表化 + 可追溯性正从开发者扩展到终端用户：AgentTrace 提升用户理解与异常检测；GrantBox 记录出站请求/授权参数；NL/PL 分类法支持污点/切片决策。

4) Top 5 论文（含“为何现在”）

1) Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

引入阶段级提示注入指标（EXPOSED/PERSISTED/RELAYED/EXECUTED），解释防御在哪里起作用。
显示暴露是普遍的（100%）；安全性取决于下游传播。
展示攻击面依赖性：同一模型在不同注入面下 ASR 可为 0% 或 100%（如 DeepSeek 的 memory_poison vs tool_poison/propagation）。
质疑点：每个单元样本量较小且载荷为合成；模型差异根因（如 Claude 的 write_memory 过滤）未被隔离。

2) CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

定义常识驱动幻觉，并用成对反事实设计隔离先验崩塌。
报告系统性差距：平均 CFAD 16.39%（QA）/ 25.20%（MC）；8 个模型中 7 个在反事实上退化。
强调MC 格式会放大先验驱动错误，与许多真实产品 UI 相关。
质疑点：合成图像与规模有限（300 对）；未展示更广泛的真实异常覆盖。

3) Evaluating Privilege Usage of Agents on Real-World Tools

提供 GrantBox：在容器中集成 10 个真实 MCP 服务器 / 122 个权限敏感工具并记录日志。
在其设置下发现提示注入成功率极高：平均 ASR 90.55%（ReAct）与79.05%（Plan-and-Execute）。
通过真实出站请求日志而非玩具工具，使权限滥用可测量。
质疑点：当前评估聚焦无防御的“原生”智能体行为；环境搭建复杂度可能影响可复现性。

4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

免训练控制器，在固定预算下用融合的过程+结果评分分配到 expand/prune/repair/stop/abstain。
在 15 个基准、匹配预算下报告稳定增益并改善校准（如 ECE ~0.035）。
给出具体失败分类（search-not-converged、evaluator misjudgment、over-pruning），便于调试。
质疑点：依赖在线过程评估信号质量；手工设计的控制器/元状态可能不泛化。

5) The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

显示仅提及 MRI 可用性就能驱动大部分“多模态”增益（置信度变化的 ~70–80%），即使图像几乎不提供信号。
通过强消融（伪模态、替换图像）与专家轨迹审阅论证增益常非证据驱动。
表明对齐干预（MPO）可抑制 MRI 引用但会压塌性能，凸显修复根因的难度。
质疑点：仅限两个队列与开源权重模型；脚手架估计来自最敏感的模型。

5) 实用下一步

在你的智能体栈中采用阶段级注入遥测（金丝雀 token + EXPOSED/PERSISTED/RELAYED/EXECUTED 日志），并要求防御报告在哪个阶段阻断传播，而不仅是 ASR。
上线前进行多攻击面提示注入评估（记忆投毒、工具输出投毒、转发传播、权限升级），将“攻击面不匹配”视为主要失败模式。
增加权限使用审计：记录工具调用 + 授权参数（GrantBox 风格），并为对抗性提示下的“最小权限”违规建立回归测试。
通过加入 CDH 风格的反事实 vs 常识成对样例与提示脚手架消融（如“模态可用”前导语、替换图像）来加固多模态扎根评估。
对科学/MCQA 产品，比较多模态 vs 纯文本 logits 来测试选项诱导先验；在视觉证据优势强时考虑 SCICON 风格减法，并衡量当先验正确时的受损案例率。
稳定 LLM 裁判流水线：固定并报告温度；跨随机种子测解析错误与一致性；当完整性/部分支持是关键时考虑匹配的整体 rubric（并跟踪 token 成本）。
为研究型智能体仪表化过程质量（MiroEval 风格）：收集过程日志、计算过程指标，并与事实性/结果相关联，以发现“过程很差但报告看起来很好”的情况。
将凝缩/合成数据集视为供应链产物：对数据集凝缩输出增加后门扫描与溯源控制；假设隐蔽触发器可不可感知（InkDrop）。
对音频隐私审计，始终运行盲基线（元数据/文本/声学）与分布匹配切分，再得出基于记忆的泄露结论。
预算化推理：若要扩展测试时算力，优先采用控制器（剪枝/修复/停止）与不确定性引导分配（基于熵的早停），而非均匀“更多采样”，并在准确率之外跟踪校准。

由逐篇论文分析生成；未进行外部浏览。

Di Tang

AI Paper Insight Brief

2026-04-01

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：智能体提示注入与权限滥用需要按阶段/攻击面定位

主题：由先验与提示脚手架驱动的多模态扎根失败

主题：评估方法学很脆弱（裁判、提示、温度、分解）

主题：验证中心的智能体与预算约束下的测试时控制

主题：跨模态与数据流水线的安全与隐私威胁

3) 技术综合

4) Top 5 论文（含“为何现在”）

5) 实用下一步