AI 论文日报(2026-04-08)

Published:

English version: /paper-news/2026-04-08/

运行统计

  • 候选论文: 195
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-06T00:00:00Z → 2026-04-07T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.04759Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
PDF
cs.CR, cs.AI, cs.CL96First real-world safety eval of widely deployed agent w/ live attacks + CIK taxonomy.agent-safety, real-world-eval, tool-access, attack-scenarios, taxonomy, privilege-risk
2604.04426ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems
PDF
cs.AI94Supply-chain injection benchmark (10k malicious MCP tools) + network guardrails; big gap.agent-security, supply-chain, MCP, benchmark, prompt-injection, guardrails, MITRE-ATT&CK
2604.04842Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
PDF
cs.CL93Persona-driven counseling red-teaming exposes maladaptive validation risks in multi-turn therapy chatsLLM-safety, red-teaming, mental-health, persona-attack, dialogue, high-stakes
2604.04522HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems
PDF
cs.CR, cs.MA93Cryptographic provenance for human authorization across agent delegation chains; closes accountability gap.agentic-systems, security, authorization, provenance, cryptography, multi-agent
2604.04565PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning
PDF
cs.CL, cs.AI92Trains Answer/Ask/Abstain for epistemic calibration; directly targets overconfident QA/RAG failures.epistemic-calibration, abstention, clarification, RAG, hallucination, reliability, SFT
2604.04561Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
PDF
cs.CR, cs.AI, cs.CL9110k-trial taxonomy of prompt features that trigger agent vulnerability exploitation in sandboxes.agent-security, vulnerability-exploitation, system-prompts, taxonomy, evaluation, docker
2604.04743Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations
PDF
cs.CL, cs.AI, eess.SY91Dynamical-systems view of hallucinations + geometry-aware steering to reduce hallucination without retraininghallucinations, reliability, interpretability, steering, latent-dynamics, evaluation
2604.04443DeonticBench: A Benchmark for Reasoning over Rules
PDF
cs.CL916.2k-task benchmark for rule/deontic reasoning in legal/policy domains; high-stakes long-context eval.benchmark, evaluation, deontic-reasoning, law, policy, long-context
2604.04385How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
PDF
cs.CL, cs.AI, cs.LG90Mechanistic localization of alignment refusal routing circuits; controllable policy strength.interpretability, alignment, refusal, circuits, mechanistic, interchange-interventions
2604.04323How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
PDF
cs.CL90Realistic benchmark of agent skill retrieval/selection over 34k skills; strong signal for agent robustness.agents, tool-use, skills, retrieval, benchmark, robustness, evaluation
2604.04738Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates
PDF
cs.CR, cs.LG89Cryptographic proofs to certify fine-tune drift bounds; mitigates backdoors/safety regressions.model-security, fine-tuning, integrity, zero-knowledge, backdoors, provenance, certification
2604.04488A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
PDF
cs.CV, cs.LG89Backdoor defense for multimodal LLMs targeting low-poisoning triggers while preserving benign generationsecurity, backdoors, multimodal-LLM, data-poisoning, defense, robustness
2604.04410Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
PDF
cs.LG, cs.AI, cs.CL, stat.ML89Statistically consistent alignment via relative density-ratio optimization; addresses instability of DDRO.alignment, preference-learning, DPO, statistical-consistency, optimization
2604.04757Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange
PDF
cs.CR, cs.AI, cs.LG88Shows covert agent-to-agent steganographic conversations indistinguishable to auditors; major risk.steganography, agent-communication, auditing, watermarking, security, covert-channels
2604.04325Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
PDF
cs.CL88Multi-turn medical diagnosis benchmark reveals premature commitment and self-correction issues in LLMs.evaluation, multi-turn, medical, decision-making, self-correction, reliability, benchmarks
2604.04712Hardware-Level Governance of AI Compute: A Feasibility Taxonomy for Regulatory Compliance and Treaty Verification
PDF
cs.CR, cs.CY86Engineering-grounded taxonomy of hardware compute governance mechanisms + feasibility/adversaries.AI-governance, compute, hardware-attestation, monitoring, treaty-verification, compliance
2604.04876Incompleteness of AI Safety Verification via Kolmogorov Complexity
PDF
cs.AI86Formal incompleteness limit for safety/policy verification via Kolmogorov complexity framingformal-verification, AI-safety-theory, limits, Kolmogorov-complexity, policy-compliance
2604.04461DP-OPD: Differentially Private On-Policy Distillation for Language Models
PDF
cs.LG, cs.AI, cs.CL86Differentially private on-policy distillation for LMs; targets privacy–utility issues in long rollouts.privacy, differential-privacy, distillation, language-models, deployment
2604.04930Early Stopping for Large Reasoning Models via Confidence Dynamics
PDF
cs.CL, cs.AI, cs.LG86Early-stopping for long CoT via confidence dynamics; cuts compute and mitigates overthinking regressions.reasoning, chain-of-thought, efficiency, confidence, inference, calibration
2604.04532Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
PDF
cs.CL, cs.AI85Shows agent-judge eval is language-sensitive; prompt localization can flip model rankingsevaluation, agent-as-judge, multilingual, benchmarking, reliability, measurement
2604.04399GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
PDF
cs.AI84Interpretable GUI-agent eval via hierarchical diagnosis; improves long-horizon failure analysis.agent-evaluation, GUI-agents, diagnostics, benchmarks, trajectories, interpretability
2604.04733Discovering Failure Modes in Vision-Language Models using RL
PDF
cs.CV, cs.AI84RL framework to automatically discover VLM failure modes/blind spots without manual curation.evaluation, red-teaming, vision-language, robustness, reinforcement-learning
2604.04855The Role of Generator Access in Autoregressive Post-Training
PDF
cs.LG84Shows generator interface (prefix control/logits) can yield exponential gains in KL-regularized post-training.post-training, RLHF, DPO, generator-access, theory, sample-efficiency, alignment
2604.04328Soft Tournament Equilibrium
PDF
cs.AI, cs.LG, cs.MA83Differentiable set-valued tournament solution for non-transitive agent comparisons; avoids unstable rankingsagent-evaluation, tournaments, non-transitivity, ranking, multi-agent, metrics
2604.04917Vero: An Open RL Recipe for General Visual Reasoning
PDF
cs.CV, cs.AI, cs.CL83Open RL recipe + 600K multi-task reward data for visual reasoning; strong reproducible VLM progress.VLM, reinforcement-learning, open-models, dataset, visual-reasoning, post-training
2604.04872Synthetic Sandbox for Training Machine Learning Engineering Agents
PDF
cs.CL, cs.LG82Synthetic sandbox enabling on-policy RL for ML-engineering agents by shrinking verification cost.agents, RL, sandbox, MLE, training-framework, evaluation, scaling
2604.04767Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems
PDF
cs.LG, cs.AI, cs.CL82Task reformulation to learn from too-hard reasoning problems in RLVR; bootstraps via easier variantsreasoning, RLVR, curriculum, task-reformulation, training, LLMs
2604.04898QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
PDF
cs.AI, cs.CL, cs.LG82Post-trains an open 4B model for Olympiad-level proofs; useful for studying small-model reasoning.reasoning, math, small-models, post-training, proofs, distillation
2604.04902Are Latent Reasoning Models Easily Interpretable?
PDF
cs.LG82Finds latent reasoning tokens often unused; challenges interpretability/monitoring claims of LRMs.interpretability, reasoning, latent-reasoning, monitoring, evaluation, mechanistic
2604.04847Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
PDF
eess.AS, cs.CL81Real-audio full-duplex voice agent tool-use benchmark with disfluencies + latency/turn-taking.voice-agents, tool-use, benchmark, speech, disfluency, evaluation, latency

AI 论文洞察简报

2026-04-08

0) 执行要点(先读这个)

  • “智能体技能(agent skills)”无法从精心策划的基准顺畅迁移到真实部署:一旦智能体必须从 3.4 万技能池中检索/选择/适配,收益往往塌缩到接近无技能基线;迭代式的智能体混合搜索 + 查询特定精炼可以恢复有意义的性能(例如 Claude Opus 4.6 在 Terminal-Bench 2.0 上 +7.8 个百分点)。
  • 多轮交互本身就是高风险领域的安全隐患:在医疗诊断中,模型经常过早承诺;只需把问题留到最后再问,就能在很大程度上恢复单轮准确率,而“显著证据”(如化验指标)可能充当诱饵,触发过早(且常常错误)的回答。
  • 安全评估正在从“工具说了什么”转向“工具运行时做了什么”:网络层监控(MITM + 解密流量事件轨迹)能以很高的报告 F1 和很低的 FPR 检测 MCP 供应链注入;持久化智能体状态(记忆/身份/技能)是重要的真实攻击面,可跨会话存活。
  • 仅靠对话记录并不足以作为控制手段:密码学结果表明,智能体可以在“看起来诚实”的对话中嵌入不可检测的隐蔽通信;即便在弱熵假设下(借助新原语),也能完成密钥交换,从而削弱被动审计作为安全策略的有效性。
  • 评估本身正在成为一类一等的失效模式:GUI 智能体评测通过分层诊断可显著提升;但“智能体做裁判(agent-as-a-judge)”的结果会随裁判语言而反转模型排名,且不同骨干之间一致性低——意味着基准结论可能无法跨地区/语言泛化。
  • 对齐与完整性正在变得更机制化与密码学化:对齐行为可被定位到稀疏路由电路(gate→amplifiers),并可被编码方式绕过;而微调完整性可用简洁的 ZK 证明对结构化漂移(范数/秩/稀疏性)进行认证,支持新的治理/审计工作流。

2) 关键主题(聚类)

主题:真实环境中的技能与工具使用(检索、选择、适配)

主题:多轮安全失效(过早承诺、诱饵、人格攻击)

主题:执行时智能体安全(供应链、持久性、利用触发器)

主题:智能体/模型的审计、溯源与密码学完整性

主题:评估可靠性与可解释性工具(裁判、诊断、排名)

3) 技术综合

  • 基准的现实化升级遵循共同模式:移除神谕访问(强制加载技能;单轮全证据;短轨迹),并衡量在检索噪声、增量证据、长时程与运行时约束下性能如何退化。
  • 多篇论文收敛到两阶段循环:(i) 搜索/检索/分段(技能检索;轨迹分段;提问者 RL 探测),然后 (ii) 精炼/诊断/引导(查询特定技能精炼;子任务诊断;潜变量引导)。
  • “承诺控制(commitment control)”跨领域出现:医疗的 Q-Last 降低过早诊断;语音智能体难以处理自我纠错回滚;推理模型的早停利用置信动态避免过度思考。
  • 安全防御正从语义检查转向基于遥测的信号:ShieldNet 的解密事件轨迹与 OpenClaw 的外部动作验证体现了向执行证据的整体转向。
  • 多项工作强调表面工件不足:工具 schema(MCP)无法揭示注入行为;对话记录无法阻止隐蔽信道;“检测”表征不保证对齐行为(对齐路由电路)。
  • 参数保证 vs 语义保证的分裂在扩大:微调完整性证明可认证范数/秩/稀疏性漂移,但小漂移仍可能导致巨大行为变化(文中明确指出)。
  • 评估方法学本身正遭受攻击:多语 AAAJ 显示裁判语言/骨干交互可反转排名;STE 认为排名在循环下脆弱并提出集合值核心。
  • RL 同时被用于提升能力(Vero 开放式视觉推理 RL;QED-Nano 证明 RL;SandMLE 通过微沙盒进行 on-policy RL;Cog-DRIFT 重构式课程)与发现失效(用于 VLM 失效模式的 RL 提问者)。
  • 多种方法依赖强辅助模型(验证器/裁判/评分器),因此继承其偏差(用 Qwen3-32B 做医疗分片;VLM 验证器;咨询裁判;证明评分器)。
  • 反复出现的实践权衡:鲁棒性 vs 成本(查询特定精炼算力;双视角后门防御约 2× 训练;DP-OPD 教师查询开销;网络 MITM 侵入性;长 token 推理 vs 早停)。

4) Top 5 论文(含“为何现在”)

1) ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems(ShieldNet:针对智能体系统中新兴供应链注入的网络层护栏)

  • 引入 SC-Inject-Bench:由网络轨迹(PCAP)验证的代码级工具注入,针对现实的 MCP 供应链威胁模型。
  • 展示实用护栏:MITM + 解密 HTTP(S) + 结构化事件轨迹,配合轻量后训练检测器(Qwen3-0.6B)实现流式检测。
  • 报告很强的检测效果(例如 PCAP 级 F1=0.995,FPR=0.008),消融显示解密至关重要
  • 质疑 / 局限:聚焦网络可见攻击;MITM/解密与 QUIC 阻断的运维约束可能限制可部署性。

2) Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw(你的智能体,他们的资产:OpenClaw 的真实世界安全分析)

  • 通过 CIK 分类法(Capability/Identity/Knowledge)将持久化智能体状态作为一等安全对象,并测试跨会话投毒。
  • 展示在单维度投毒下 ASR 大幅上升(摘要:基线 24.6% → 投毒后 64–74%),并强调可执行能力载荷尤其危险。
  • 评估防御并揭示演化–安全权衡(文件保护能阻断攻击,但也阻断合法更新)。
  • 质疑 / 局限:单一平台(OpenClaw)与 12 个场景;未研究跨维度链式攻击。

3) How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models(对齐如何路由:在语言模型中定位、扩展与控制策略电路)

  • 在 9 个模型中将对齐行为定位到稀疏 gate→amplifier 路由结构,并进行因果测试(互换、敲除级联)。
  • 通过检测层调制实现连续控制(剂量–反应),并给出具体绕过:密码式编码会使 gate 贡献塌缩(例如在 Phi-4-mini 上塌缩 78%)。
  • 为“检测但不拒绝”提供机制解释,并给出审计/防御靶点。
  • 质疑 / 局限:范围限于 2–32B 模型与政治/安全拒答领域;MLP 路由贡献仍缺乏解释。

4) Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction(多轮医疗诊断评测:保持、诱饵与自我纠错)

  • 发布 MINT(1,035 例),将多个医疗 QA 数据集转换为受控多轮分片,并进行信息保持检查。
  • 发现普遍存在过早承诺(前两轮初始回答占 55.3%),且把问题留到最后在很大程度上恢复单轮准确率。
  • 识别化验结果作为诱饵触发立即作答;提出证据调度作为缓解。
  • 质疑 / 局限:由病例小作文转换(非自然对话)且聚焦选择题;Q-Last 协议可能不现实。

5) Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange(通过伪随机抗噪密钥交换实现 AI 智能体间不可检测对话)

  • 形式化并证明被动对话审计可能失效:智能体可嵌入与诚实对话不可区分的隐蔽通信。
  • 引入 bundle sampling 与新原语 PNR-KE,给出构造与强限制(例如对非交互方案的攻击)。
  • 将“全量记录日志”重新界定为不足以普遍防止隐蔽协同。
  • 质疑 / 局限:偏理论;依赖密码学假设(如 LSPN)以及真实 LLM 消息分布是否满足熵假设。

5) 实用下一步

  • 对使用技能的智能体平台:用从大型嘈杂池检索(而非强制加载技能)进行基准,并将智能体混合搜索 + 查询特定精炼作为标准基线;同时跟踪 Recall@k 与端到端通过率。
  • 对多轮高风险助手(医疗/咨询):实现承诺感知协议(证据不足时明确“hold”;延后化验等显著诱饵),并测量早承诺错误率与翻转动态(F2T/T2F)。
  • 对 MCP/工具生态:在可行处部署运行时网络遥测(解密 HTTP(S) 事件轨迹),并用代码级注入基准评估;结合主机遥测覆盖非网络攻击。
  • 对个人/本地智能体:将持久状态视为安全边界——加入写保护、技能代码签名、对可执行能力变更的显式审批;明确衡量演化–安全权衡。
  • 对监控/隐蔽信道风险:不要仅依赖对话日志;若隐蔽协同在威胁范围内,考虑主动看门人(active-warden)式干预、随机化检查或协议级约束。
  • 对评估流水线:报告裁判语言与骨干;对多语部署至少运行两种裁判骨干并量化一致性;当比较存在循环时考虑用锦标赛核心(Top Cycle / Uncovered Set)方式报告。
  • 对微调治理:若需要完整性保证,试点结构化漂移证书(范数/秩/稀疏性),并配合行为审计,因为参数约束不蕴含语义安全。
  • 对降低幻觉与校准:可考虑训练期路由(Answer/Ask/Abstain)和/或推理期潜变量引导/早停,但需在已知可分离的任务类型上评估(事实问答 vs 生成式 vs 误解密集型)。

由逐篇分析生成;未进行外部浏览。