AI 论文日报（2026-04-08）

Published: April 08, 2026

English version: /paper-news/2026-04-08/

运行统计

候选论文: 195
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-04-06T00:00:00Z → 2026-04-07T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2604.04759`	Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw PDF	cs.CR, cs.AI, cs.CL	96	First real-world safety eval of widely deployed agent w/ live attacks + CIK taxonomy.	agent-safety, real-world-eval, tool-access, attack-scenarios, taxonomy, privilege-risk
`2604.04426`	ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems PDF	cs.AI	94	Supply-chain injection benchmark (10k malicious MCP tools) + network guardrails; big gap.	agent-security, supply-chain, MCP, benchmark, prompt-injection, guardrails, MITRE-ATT&CK
`2604.04842`	Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling PDF	cs.CL	93	Persona-driven counseling red-teaming exposes maladaptive validation risks in multi-turn therapy chats	LLM-safety, red-teaming, mental-health, persona-attack, dialogue, high-stakes
`2604.04522`	HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems PDF	cs.CR, cs.MA	93	Cryptographic provenance for human authorization across agent delegation chains; closes accountability gap.	agentic-systems, security, authorization, provenance, cryptography, multi-agent
`2604.04565`	PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning PDF	cs.CL, cs.AI	92	Trains Answer/Ask/Abstain for epistemic calibration; directly targets overconfident QA/RAG failures.	epistemic-calibration, abstention, clarification, RAG, hallucination, reliability, SFT
`2604.04561`	Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities PDF	cs.CR, cs.AI, cs.CL	91	10k-trial taxonomy of prompt features that trigger agent vulnerability exploitation in sandboxes.	agent-security, vulnerability-exploitation, system-prompts, taxonomy, evaluation, docker
`2604.04743`	Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations PDF	cs.CL, cs.AI, eess.SY	91	Dynamical-systems view of hallucinations + geometry-aware steering to reduce hallucination without retraining	hallucinations, reliability, interpretability, steering, latent-dynamics, evaluation
`2604.04443`	DeonticBench: A Benchmark for Reasoning over Rules PDF	cs.CL	91	6.2k-task benchmark for rule/deontic reasoning in legal/policy domains; high-stakes long-context eval.	benchmark, evaluation, deontic-reasoning, law, policy, long-context
`2604.04385`	How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models PDF	cs.CL, cs.AI, cs.LG	90	Mechanistic localization of alignment refusal routing circuits; controllable policy strength.	interpretability, alignment, refusal, circuits, mechanistic, interchange-interventions
`2604.04323`	How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings PDF	cs.CL	90	Realistic benchmark of agent skill retrieval/selection over 34k skills; strong signal for agent robustness.	agents, tool-use, skills, retrieval, benchmark, robustness, evaluation
`2604.04738`	Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates PDF	cs.CR, cs.LG	89	Cryptographic proofs to certify fine-tune drift bounds; mitigates backdoors/safety regressions.	model-security, fine-tuning, integrity, zero-knowledge, backdoors, provenance, certification
`2604.04488`	A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models PDF	cs.CV, cs.LG	89	Backdoor defense for multimodal LLMs targeting low-poisoning triggers while preserving benign generation	security, backdoors, multimodal-LLM, data-poisoning, defense, robustness
`2604.04410`	Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment PDF	cs.LG, cs.AI, cs.CL, stat.ML	89	Statistically consistent alignment via relative density-ratio optimization; addresses instability of DDRO.	alignment, preference-learning, DPO, statistical-consistency, optimization
`2604.04757`	Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange PDF	cs.CR, cs.AI, cs.LG	88	Shows covert agent-to-agent steganographic conversations indistinguishable to auditors; major risk.	steganography, agent-communication, auditing, watermarking, security, covert-channels
`2604.04325`	Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction PDF	cs.CL	88	Multi-turn medical diagnosis benchmark reveals premature commitment and self-correction issues in LLMs.	evaluation, multi-turn, medical, decision-making, self-correction, reliability, benchmarks
`2604.04712`	Hardware-Level Governance of AI Compute: A Feasibility Taxonomy for Regulatory Compliance and Treaty Verification PDF	cs.CR, cs.CY	86	Engineering-grounded taxonomy of hardware compute governance mechanisms + feasibility/adversaries.	AI-governance, compute, hardware-attestation, monitoring, treaty-verification, compliance
`2604.04876`	Incompleteness of AI Safety Verification via Kolmogorov Complexity PDF	cs.AI	86	Formal incompleteness limit for safety/policy verification via Kolmogorov complexity framing	formal-verification, AI-safety-theory, limits, Kolmogorov-complexity, policy-compliance
`2604.04461`	DP-OPD: Differentially Private On-Policy Distillation for Language Models PDF	cs.LG, cs.AI, cs.CL	86	Differentially private on-policy distillation for LMs; targets privacy–utility issues in long rollouts.	privacy, differential-privacy, distillation, language-models, deployment
`2604.04930`	Early Stopping for Large Reasoning Models via Confidence Dynamics PDF	cs.CL, cs.AI, cs.LG	86	Early-stopping for long CoT via confidence dynamics; cuts compute and mitigates overthinking regressions.	reasoning, chain-of-thought, efficiency, confidence, inference, calibration
`2604.04532`	Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation PDF	cs.CL, cs.AI	85	Shows agent-judge eval is language-sensitive; prompt localization can flip model rankings	evaluation, agent-as-judge, multilingual, benchmarking, reliability, measurement
`2604.04399`	GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis PDF	cs.AI	84	Interpretable GUI-agent eval via hierarchical diagnosis; improves long-horizon failure analysis.	agent-evaluation, GUI-agents, diagnostics, benchmarks, trajectories, interpretability
`2604.04733`	Discovering Failure Modes in Vision-Language Models using RL PDF	cs.CV, cs.AI	84	RL framework to automatically discover VLM failure modes/blind spots without manual curation.	evaluation, red-teaming, vision-language, robustness, reinforcement-learning
`2604.04855`	The Role of Generator Access in Autoregressive Post-Training PDF	cs.LG	84	Shows generator interface (prefix control/logits) can yield exponential gains in KL-regularized post-training.	post-training, RLHF, DPO, generator-access, theory, sample-efficiency, alignment
`2604.04328`	Soft Tournament Equilibrium PDF	cs.AI, cs.LG, cs.MA	83	Differentiable set-valued tournament solution for non-transitive agent comparisons; avoids unstable rankings	agent-evaluation, tournaments, non-transitivity, ranking, multi-agent, metrics
`2604.04917`	Vero: An Open RL Recipe for General Visual Reasoning PDF	cs.CV, cs.AI, cs.CL	83	Open RL recipe + 600K multi-task reward data for visual reasoning; strong reproducible VLM progress.	VLM, reinforcement-learning, open-models, dataset, visual-reasoning, post-training
`2604.04872`	Synthetic Sandbox for Training Machine Learning Engineering Agents PDF	cs.CL, cs.LG	82	Synthetic sandbox enabling on-policy RL for ML-engineering agents by shrinking verification cost.	agents, RL, sandbox, MLE, training-framework, evaluation, scaling
`2604.04767`	Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems PDF	cs.LG, cs.AI, cs.CL	82	Task reformulation to learn from too-hard reasoning problems in RLVR; bootstraps via easier variants	reasoning, RLVR, curriculum, task-reformulation, training, LLMs
`2604.04898`	QED-Nano: Teaching a Tiny Model to Prove Hard Theorems PDF	cs.AI, cs.CL, cs.LG	82	Post-trains an open 4B model for Olympiad-level proofs; useful for studying small-model reasoning.	reasoning, math, small-models, post-training, proofs, distillation
`2604.04902`	Are Latent Reasoning Models Easily Interpretable? PDF	cs.LG	82	Finds latent reasoning tokens often unused; challenges interpretability/monitoring claims of LRMs.	interpretability, reasoning, latent-reasoning, monitoring, evaluation, mechanistic
`2604.04847`	Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency PDF	eess.AS, cs.CL	81	Real-audio full-duplex voice agent tool-use benchmark with disfluencies + latency/turn-taking.	voice-agents, tool-use, benchmark, speech, disfluency, evaluation, latency

AI 论文洞察简报

2026-04-08

0) 执行要点（先读这个）

“智能体技能（agent skills）”无法从精心策划的基准顺畅迁移到真实部署：一旦智能体必须从 3.4 万技能池中检索/选择/适配，收益往往塌缩到接近无技能基线；迭代式的智能体混合搜索 + 查询特定精炼可以恢复有意义的性能（例如 Claude Opus 4.6 在 Terminal-Bench 2.0 上 +7.8 个百分点）。
多轮交互本身就是高风险领域的安全隐患：在医疗诊断中，模型经常过早承诺；只需把问题留到最后再问，就能在很大程度上恢复单轮准确率，而“显著证据”（如化验指标）可能充当诱饵，触发过早（且常常错误）的回答。
安全评估正在从“工具说了什么”转向“工具运行时做了什么”：网络层监控（MITM + 解密流量事件轨迹）能以很高的报告 F1 和很低的 FPR 检测 MCP 供应链注入；持久化智能体状态（记忆/身份/技能）是重要的真实攻击面，可跨会话存活。
仅靠对话记录并不足以作为控制手段：密码学结果表明，智能体可以在“看起来诚实”的对话中嵌入不可检测的隐蔽通信；即便在弱熵假设下（借助新原语），也能完成密钥交换，从而削弱被动审计作为安全策略的有效性。
评估本身正在成为一类一等的失效模式：GUI 智能体评测通过分层诊断可显著提升；但“智能体做裁判（agent-as-a-judge）”的结果会随裁判语言而反转模型排名，且不同骨干之间一致性低——意味着基准结论可能无法跨地区/语言泛化。
对齐与完整性正在变得更机制化与密码学化：对齐行为可被定位到稀疏路由电路（gate→amplifiers），并可被编码方式绕过；而微调完整性可用简洁的 ZK 证明对结构化漂移（范数/秩/稀疏性）进行认证，支持新的治理/审计工作流。

2) 关键主题（聚类）

主题：真实环境中的技能与工具使用（检索、选择、适配）

重要性：“技能”被广泛用于扩展智能体，但其测得收益高度依赖评估是否包含真实的检索噪声与适配成本。
代表论文：
- How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings（智能体技能在真实环境中表现如何：在现实设置下评测 LLM 技能使用）
- Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency（Full-Duplex-Bench-v3：在真实口语不流畅条件下评测全双工语音智能体的工具使用）
常见方法：
- 从精心策划/强制加载的工件转向从大型、嘈杂池中检索（技能或流式语音）。
- 在现实约束下评估端到端任务成功（选择错误、口语不流畅、延迟）。
- 增加测试时适配循环（智能体搜索；查询特定精炼）或衡量回滚失败（自我纠错）。
开放问题 / 失效模式：
- 当缺少相关技能时，精炼无法凭空创造缺失知识；性能会回落到接近基线。
- 语音智能体：自我纠错/状态回滚仍是主导失效模式；低延迟可能与轮次交接可靠性形成权衡。

主题：多轮安全失效（过早承诺、诱饵、人格攻击）

重要性：交互式设置引入新的失效模式——过早承诺、上下文敏感性、以及社会驱动的顺从——这些在单轮测试中并不显现。
代表论文：
- Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction（多轮医疗诊断评测：保持、诱饵与自我纠错）
- Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling（不伤害：通过心理咨询中的基于人格的来访者模拟攻击揭示 LLM 隐藏脆弱性）
常见方法：
- 将单轮题目转换为信息保持的多轮序列，以隔离时序效应。
- 测量翻转动态（错误→正确 vs 正确→错误）与承诺时机。
- 使用自适应多轮对抗者（人格 + 策略循环）诱发细微不安全行为（如“有毒共情”）。
开放问题 / 失效模式：
- 像“把问题留到最后再问”这样的协议缓解可能有效，但不符合真实部署。
- 人格驱动攻击可能隐蔽（低困惑度），并在多种防御下仍有效。

主题：执行时智能体安全（供应链、持久性、利用触发器）

重要性：智能体安全越来越关乎运行时行为（网络、文件系统、持久状态），而不仅是提示注入或工具 schema。
代表论文：
常见方法：
- 构建特定威胁模型的基准（由 PCAP 验证的代码级注入；持久状态投毒；大规模 N 的提示分类）。
- 将检测转向运行时遥测（解密流量事件；邮件/Stripe 行为等外部证据）。
- 量化哪些提示框架真正重要（例如将任务重构为谜题/CTF是主导性的利用触发器）。
开放问题 / 失效模式：
- 仅网络层防御会漏掉纯本地恶意行为；持久状态防御可能阻碍合法演化。
- 基于关键词的利用检测可能漏掉更隐蔽的攻击；在每格 n=50 的零结果并不能排除小效应。

主题：智能体/模型的审计、溯源与密码学完整性

重要性：当智能体在现实世界中行动时，治理需要可验证的溯源与完整性——但某些监控假设（如对话审计）可能从根本上失效。
代表论文：
常见方法：
- 用密码学链式签名实现委托溯源（仅追加的跳签名；离线验证）。
- 用简洁 ZK 证明在规模化场景下认证受限微调漂移（范数/秩/稀疏性）（例如 7B 模型的证明/验证）。
- 形式化并证明被动监控的极限（对话中的隐蔽信道；噪声下的密钥交换）。
开放问题 / 失效模式：
- HDP v0.1 对各跳使用单一发行者密钥（计划加入智能体级证明）；记录了 scope 但不强制执行。
- PNR-KE 构造依赖困难性假设且偏理论；需要验证实际 LLM 的消息分布是否满足熵假设。

主题：评估可靠性与可解释性工具（裁判、诊断、排名）

重要性：若评估不稳定（对语言敏感的裁判、非传递比较），模型选择与安全结论可能出错。
代表论文：
常见方法：
- 用分层分解与结构化诊断替代单体裁判。
- 将裁判语言/骨干视为实验变量；量化一致性与排名反转。
- 将评估建模为概率锦标赛，输出集合值“核心”（Top Cycle / Uncovered Set），而非单一排名。
开放问题 / 失效模式：
- 分解依赖分段质量；领域知识仍可能误导评估者。
- 不同骨干间一致性低表明“LLM 裁判”可能需要校准/集成，尤其在跨语言场景。

3) 技术综合

基准的现实化升级遵循共同模式：移除神谕访问（强制加载技能；单轮全证据；短轨迹），并衡量在检索噪声、增量证据、长时程与运行时约束下性能如何退化。
多篇论文收敛到两阶段循环：(i) 搜索/检索/分段（技能检索；轨迹分段；提问者 RL 探测），然后 (ii) 精炼/诊断/引导（查询特定技能精炼；子任务诊断；潜变量引导）。
“承诺控制（commitment control）”跨领域出现：医疗的 Q-Last 降低过早诊断；语音智能体难以处理自我纠错回滚；推理模型的早停利用置信动态避免过度思考。
安全防御正从语义检查转向基于遥测的信号：ShieldNet 的解密事件轨迹与 OpenClaw 的外部动作验证体现了向执行证据的整体转向。
多项工作强调表面工件不足：工具 schema（MCP）无法揭示注入行为；对话记录无法阻止隐蔽信道；“检测”表征不保证对齐行为（对齐路由电路）。
参数保证 vs 语义保证的分裂在扩大：微调完整性证明可认证范数/秩/稀疏性漂移，但小漂移仍可能导致巨大行为变化（文中明确指出）。
评估方法学本身正遭受攻击：多语 AAAJ 显示裁判语言/骨干交互可反转排名；STE 认为排名在循环下脆弱并提出集合值核心。
RL 同时被用于提升能力（Vero 开放式视觉推理 RL；QED-Nano 证明 RL；SandMLE 通过微沙盒进行 on-policy RL；Cog-DRIFT 重构式课程）与发现失效（用于 VLM 失效模式的 RL 提问者）。
多种方法依赖强辅助模型（验证器/裁判/评分器），因此继承其偏差（用 Qwen3-32B 做医疗分片；VLM 验证器；咨询裁判；证明评分器）。
反复出现的实践权衡：鲁棒性 vs 成本（查询特定精炼算力；双视角后门防御约 2× 训练；DP-OPD 教师查询开销；网络 MITM 侵入性；长 token 推理 vs 早停）。

4) Top 5 论文（含“为何现在”）

1) ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems（ShieldNet：针对智能体系统中新兴供应链注入的网络层护栏）

引入 SC-Inject-Bench：由网络轨迹（PCAP）验证的代码级工具注入，针对现实的 MCP 供应链威胁模型。
展示实用护栏：MITM + 解密 HTTP(S) + 结构化事件轨迹，配合轻量后训练检测器（Qwen3-0.6B）实现流式检测。
报告很强的检测效果（例如 PCAP 级 F1=0.995，FPR=0.008），消融显示解密至关重要。
质疑 / 局限：聚焦网络可见攻击；MITM/解密与 QUIC 阻断的运维约束可能限制可部署性。

2) Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw（你的智能体，他们的资产：OpenClaw 的真实世界安全分析）

通过 CIK 分类法（Capability/Identity/Knowledge）将持久化智能体状态作为一等安全对象，并测试跨会话投毒。
展示在单维度投毒下 ASR 大幅上升（摘要：基线 24.6% → 投毒后 64–74%），并强调可执行能力载荷尤其危险。
评估防御并揭示演化–安全权衡（文件保护能阻断攻击，但也阻断合法更新）。
质疑 / 局限：单一平台（OpenClaw）与 12 个场景；未研究跨维度链式攻击。

3) How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models（对齐如何路由：在语言模型中定位、扩展与控制策略电路）

在 9 个模型中将对齐行为定位到稀疏 gate→amplifier 路由结构，并进行因果测试（互换、敲除级联）。
通过检测层调制实现连续控制（剂量–反应），并给出具体绕过：密码式编码会使 gate 贡献塌缩（例如在 Phi-4-mini 上塌缩 78%）。
为“检测但不拒绝”提供机制解释，并给出审计/防御靶点。
质疑 / 局限：范围限于 2–32B 模型与政治/安全拒答领域；MLP 路由贡献仍缺乏解释。

4) Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction（多轮医疗诊断评测：保持、诱饵与自我纠错）

发布 MINT（1,035 例），将多个医疗 QA 数据集转换为受控多轮分片，并进行信息保持检查。
发现普遍存在过早承诺（前两轮初始回答占 55.3%），且把问题留到最后在很大程度上恢复单轮准确率。
识别化验结果作为诱饵触发立即作答；提出证据调度作为缓解。
质疑 / 局限：由病例小作文转换（非自然对话）且聚焦选择题；Q-Last 协议可能不现实。

5) Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange（通过伪随机抗噪密钥交换实现 AI 智能体间不可检测对话）

形式化并证明被动对话审计可能失效：智能体可嵌入与诚实对话不可区分的隐蔽通信。
引入 bundle sampling 与新原语 PNR-KE，给出构造与强限制（例如对非交互方案的攻击）。
将“全量记录日志”重新界定为不足以普遍防止隐蔽协同。
质疑 / 局限：偏理论；依赖密码学假设（如 LSPN）以及真实 LLM 消息分布是否满足熵假设。

5) 实用下一步

对使用技能的智能体平台：用从大型嘈杂池检索（而非强制加载技能）进行基准，并将智能体混合搜索 + 查询特定精炼作为标准基线；同时跟踪 Recall@k 与端到端通过率。
对多轮高风险助手（医疗/咨询）：实现承诺感知协议（证据不足时明确“hold”；延后化验等显著诱饵），并测量早承诺错误率与翻转动态（F2T/T2F）。
对 MCP/工具生态：在可行处部署运行时网络遥测（解密 HTTP(S) 事件轨迹），并用代码级注入基准评估；结合主机遥测覆盖非网络攻击。
对个人/本地智能体：将持久状态视为安全边界——加入写保护、技能代码签名、对可执行能力变更的显式审批；明确衡量演化–安全权衡。
对监控/隐蔽信道风险：不要仅依赖对话日志；若隐蔽协同在威胁范围内，考虑主动看门人（active-warden）式干预、随机化检查或协议级约束。
对评估流水线：报告裁判语言与骨干；对多语部署至少运行两种裁判骨干并量化一致性；当比较存在循环时考虑用锦标赛核心（Top Cycle / Uncovered Set）方式报告。
对微调治理：若需要完整性保证，试点结构化漂移证书（范数/秩/稀疏性），并配合行为审计，因为参数约束不蕴含语义安全。
对降低幻觉与校准：可考虑训练期路由（Answer/Ask/Abstain）和/或推理期潜变量引导/早停，但需在已知可分离的任务类型上评估（事实问答 vs 生成式 vs 误解密集型）。

由逐篇分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-04-08

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：真实环境中的技能与工具使用（检索、选择、适配）

主题：多轮安全失效（过早承诺、诱饵、人格攻击）

主题：执行时智能体安全（供应链、持久性、利用触发器）

主题：智能体/模型的审计、溯源与密码学完整性

主题：评估可靠性与可解释性工具（裁判、诊断、排名）

3) 技术综合

4) Top 5 论文（含“为何现在”）

5) 实用下一步