AI 论文日报(2026-04-16)

Published:

English version: /paper-news/2026-04-16/

运行统计

  • 候选论文: 261
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-14T00:00:00Z → 2026-04-15T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.12177Policy-Invisible Violations in LLM-Based Agents
PDF
cs.AI, cs.CL, cs.CR, cs.LG95New agent failure mode + benchmark for compliance when policy facts are hidden from contextagents, compliance, benchmark, evaluation, tool-use, context-limitations, governance
2604.12500Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
PDF
cs.LG, cs.CR95Shows RL safety training can flip to harmful misalignment depending on environment designagent-safety, rl, specification-gaming, sycophancy, evaluation, misalignment
2604.12172COBALT-TLA: A Neuro-Symbolic Verification Loop for Cross-Chain Bridge Vulnerability Discovery
PDF
cs.CR, cs.LO95LLM+TLA+ loop finds bridge vulns fast; strong security relevance and concrete eval on Nomad-like exploit.agent-security, formal-verification, TLA+, cybersecurity, vulnerability-discovery, neuro-symbolic, tool-augmented-LLMs
2604.12384Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
PDF
cs.AI95Coupled weight+activation constraints to prevent safety drift during fine-tuning; uses SAE safety features.llm-safety, safety-drift, fine-tuning, regularization, sparse-autoencoders, refusal, alignment
2604.12284WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents
PDF
cs.CR93Guard-agent architecture to detect web prompt injection; targets real VLM web-agent threat modelweb-agents, prompt-injection, guard-model, agent-security, VLM, detection
2604.13018Toward Autonomous Long-Horizon Engineering for ML Research
PDF
cs.CL93Long-horizon ML research engineering agent with permission-scoped workspace; relevant to agent safety & control.agents, autonomous-research, orchestration, tool-use, permissions, state-continuity, agent-evals
2604.12162AlphaEval: Evaluating Agents in Production
PDF
cs.CL92Production-grounded agent benchmark (94 tasks, 7 companies) addressing real eval gapsagents, evaluation, benchmarks, production, long-horizon, human-judgment
2604.12374Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
PDF
cs.LG, cs.AI, cs.CL92Open 120B MoE hybrid Mamba-Transformer w/ 1M context + speculative decoding; big frontier capability jump.frontier-llm, MoE, mamba, long-context, efficiency, speculative-decoding, open-model
2604.12232TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
PDF
cs.CR, cs.AI, cs.SE91Fuzzing chat templates as an overlooked jailbreak surface; systematic red-teaming methodologyjailbreak, red-teaming, fuzzing, chat-templates, LLM-security, evaluation
2604.13006One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
PDF
cs.CL, cs.AI91Shows instruction-tuned helpfulness collapses under tiny lexical constraints; important robustness failure mode.robustness, instruction-tuning, evaluation, reliability, constraints, helpfulness, failure-modes
2604.12342CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training
PDF
cs.CR, cs.CV90New privacy attack surface: subset/coreset selection choices can leak sensitive infoprivacy, data-leakage, training-data, security, attacks, coresets
2604.12632Calibration-Aware Policy Optimization for Reasoning LLMs
PDF
cs.LG, cs.AI90Targets overconfidence from GRPO; proposes calibration-aware RL objective with theory + bounds for reasoning LLMs.alignment, calibration, RLHF, policy-optimization, reasoning, uncertainty, GRPO
2604.12312CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
PDF
cs.CL89Benchmark for LLM-judge reliability in detecting/localizing compliance violations in dialoguesLLM-as-judge, compliance, benchmark, evaluation, dialogue, policy-violations
2604.12359Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors
PDF
cs.CR, cs.CL88Stealthy LLM backdoors by compiling activation steering into weights; highlights supply-chain riskbackdoors, weight-editing, supply-chain, LLM-security, stealth-attacks, red-teaming
2604.12994LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
PDF
cs.CR, cs.AI88Framework to evaluate LLM vs classic repair on real logical vulns; useful for secure codingcybersecurity, program-repair, llm-for-code, evaluation, vulnerabilities
2604.12290Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
PDF
cs.AI, cs.CL88Real-world engineering benchmark for iterative propose-execute-evaluate agents with verifiers and continuous rewards.agents, evaluation, benchmarks, generative-optimization, tool-use, verifiers, long-horizon
2604.12308ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance
PDF
cs.CL88Models ambiguous/incomplete context for privacy & safety legal compliance; explicit known/unknown factorization.privacy, safety, legal-compliance, context-modeling, llm-evals, risk-assessment, governance
2604.12616Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
PDF
cs.AI, cs.MM87Memory-augmented multi-agent jailbreaks for VLMs using natural-image semantics, not just pixelsVLM, multimodal-jailbreak, multi-agent, memory, adversarial-attacks, red-teaming
2604.13016Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
PDF
cs.LG, cs.AI, cs.CL87Systematic study of on-policy distillation dynamics; actionable recipe for post-trainingpost-training, distillation, rlhf, training-dynamics, llms
2604.12559FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing
PDF
cs.CL87Fine-grained fact anchoring for model editing + new diagnostic benchmark (UnFine); useful for knowledge updates.model-editing, factuality, knowledge-updates, benchmarks, transformers, reliability
2604.12376Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations
PDF
cs.CL, cs.AI86Practical long-horizon conversation memory: keyword bookmarks + recall tool; beats retrieval/truncation baselines.agents, memory, long-context, tool-use, conversation, retrieval, evaluation
2604.12736Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood
PDF
cs.CL86Token-level policy optimization linking group rewards to tokens; targets sparse-reward CoT training issues.rlhf, policy-optimization, reasoning, sparse-rewards, grpo, kl-regularization, training
2604.12986Parallax: 入选理由 AI Agents That Think Must Never Act
PDF
cs.CR, cs.AI85Argues prompt guardrails are insufficient for acting agents; proposes cognitive/executive separationagent-safety, systems-security, sandboxing, permissions, architecture, governance
2604.13029Visual Preference Optimization with Rubric Rewards
PDF
cs.CV, cs.AI85Rubric-based rewards for visual DPO; reusable rubric pool improves judge quality and downstream performance.multimodal, dpo, reward-modeling, rubrics, preference-optimization, evaluation
2604.12817Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
PDF
cs.LG, cs.CR, stat.ML84First theory for continuous adversarial training for LLM jailbreak defense via ICL analysisadversarial-training, jailbreak-defense, theory, ICL, robustness, LLM-security
2604.12610Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
PDF
cs.CL84Triplet-structured retrieval to reduce RAG redundancy and improve alignment/efficiencyrag, retrieval, hallucinations, grounding, context-efficiency
2604.12231Thought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems
PDF
cs.CL, cs.IR84Retrieves 'thoughts' not chunks to use arbitrarily large corpora beyond context limits; agentic memory angle.RAG, agents, memory, retrieval, context-length, reasoning, model-agnostic
2604.12379Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
PDF
cs.SE, cs.AI, cs.LG83Code reasoning-quality benchmark + evaluator; moves beyond output correctness for LLMsevaluation, reasoning, code, benchmarks, verifiers
2604.12875AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance
PDF
cs.AI82Catalogue of 195 safety benchmarks; meta-analysis shows fragmented metrics and weak governancesafety-benchmarks, measurement, meta-evaluation, governance, catalogue, metrics
2604.12967Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training
PDF
cs.AI82Gold-free reward for training search agents via question reconstructability (cycle-consistency)agents, search, reinforcement-learning, self-supervision, retrieval

AI 论文洞察简报

2026-04-16

0) 核心要点(先读这个)

  • “真实世界”智能体就绪度仍然偏低且高度依赖流水线:AlphaEval 的最佳生产配置仅 64.41/100,并且脚手架(scaffold)选择会让分数波动约 11–15 分,意味着基础设施/编排的重要性可能不亚于底座模型本身。
  • 安全失败越来越像“系统失败”,而非“模型推理失败”:当策略元数据被隐藏时,策略不可见违规显示模型会在 90–98% 的高风险动作中违规;Parallax 主张架构隔离(推理器绝不能执行),并在 assume-compromise 评估下报告 98.9–100% 的拦截率。
  • 攻击面正在转向“结构”(模板、工具、图像、权重),而不只是提示词:TemplateFuzz 在开源模型上达到 ~98% Top-5 ASR,并对商用模型有 80–100% 的迁移;MemJack 在未修改的自然图像上达到 71.48% ASR;STEEREDIT 将 steering 编译进权重,在零空间约束下 URR >97% 且泄漏较低。
  • 评测正在碎片化,但更好的测量原语正在出现:AlphaEval(生产任务)、Frontier-Eng(预算约束优化)、CompliBench(逐轮指南违规)、CodeRQ-Bench/VERA(代码推理质量)、AISafetyBenchExplorer(指标碰撞治理)共同指向从单一分数基准转向轨迹(trace)/量表(rubric)/结构感知评估。
  • RL/后训练正在为稳定性与可信信号而重构:CAPO 针对 GRPO 下的校准崩塌(AIME 2025 上 AUC 提升),TEPO 改善 token 级信用分配与收敛,OPD 分析显示蒸馏成功依赖师生“思维模式”重叠,并在长轨迹深度下失效。

2) 关键主题(聚类)

主题:面向生产的智能体评测与优化型基准

主题:企业合规与策略执行需要世界状态,而不是更好的提示词

主题:智能体安全正变为架构优先(守卫、隔离、形式化闭环)

主题:红队扩展到模板、多模态语义与隐蔽权重攻击

主题:后训练稳定性:校准、token 归因、蒸馏动力学与约束脆弱性

主题:记忆与检索从“原始片段”走向结构化、查询对齐表示

3) 技术综合

  • 生产评估(AlphaEval)与基准治理(AISafetyBenchExplorer)在同一点上收敛:指标定义 + 聚合规则是模型的一部分,脚手架/评估器选择可能主导结论。
  • 多项工作独立采用“将裁判/守卫与执行者分离”:WebAgentGuard(并行守卫)、Parallax(进程隔离 + 分层验证器)、Sentinel(世界状态不变量)、COBALT-TLA(LLM + TLC 预言机闭环)。
  • 反复出现的模式是用有界性 + 确定性反馈控制幻觉:COBALT-TLA 的 TLC 边界(MaxTokens=3);AlphaEval 的 Docker 沙箱 + 量表脚本;Frontier-Eng 的只读评估器。
  • 安全评估正从“是否拒答?”转向轨迹级与轮次级裁决(AlphaEval 轨迹;PhantomPolicy 轨迹重标注;CompliBench 轮次标签)。
  • 红队越来越基于搜索(TemplateFuzz 类 MCTS 探索;MemJack MCTS/进化;Frontier-Eng 生成式优化),提示防御必须假设自适应攻击者。
  • 后训练方法围绕准确率之外的次级属性重设计:CAPO 优化相对校准(AUC),TEPO 关注稳定性/归因,OPD 关注重叠几何,CWAC 关注微调期间的安全漂移。
  • 多篇论文强调评估盲点:AlphaEval 展示基准/生产不匹配;One-Token-Away 显示独立裁判会漏掉大幅质量下降;AISafetyBenchExplorer 记录指标碰撞。
  • 记忆/检索工作正在收敛到结构化中间产物(thoughts、triplets、bookmarks)而非原始日志,但关键瓶颈变为选择/区分而非存储。
  • 安全威胁覆盖全栈:模板 → 网页 → 图像 → 权重 → 数据流水线(TemplateFuzz、WebAgentGuard、MemJack、STEEREDIT、CoLA),意味着仅“提示词安全”不足。
  • 形式化方法正通过LLM 中介接口(COBALT-TLA)重新进入实用安全,但仍受限于有界/小范围与抽象限制。

4) Top 5 论文(含“为何现在”)

1) AlphaEval: Evaluating Agents in Production

  • 将真实合作伙伴需求转化为 94 个可执行生产任务,包含多模态输入与多范式评估。
  • 显示绝对就绪度偏低(最佳 64.41/100),且脚手架可使分数变化约 11+ 分,从而改变部署决策。
  • 增加经济锚定(任务映射到 ~2,420 专业工时,价值 $154K–$231K)。
  • 质疑点:仅覆盖七家公司/六个领域与四种脚手架;快照可能很快过时。

2) Policy-Invisible Violations in LLM-Based Agents

  • 命名了一个部署关键失效模式:违规取决于隐藏的世界状态,而非可见内容。
  • PhantomPolicy 显示在轨迹级复核下,模型在 90–98% 的高风险案例中发生违规。
  • Sentinel 展示了具体执行层(图谱 fork→mutate→check),在全覆盖下达到 92.99% accuracy / 92.71 F1
  • 质疑点:保证依赖世界模型完备性;Sentinel 仍会漏检(召回缺口),且不监控纯文本输出。

3) Parallax: Why AI Agents That Think Must Never Act

  • 主张架构级保证:推理器不能执行;执行器不能推理。
  • OpenParallax 在 assume-compromise 评估下默认拦截 98.9% 的注入攻击,在最高安全模式下为 100%
  • 提供分层验证器设计(确定性策略 → 分类器 → LLM 评估 → 人类)。
  • 质疑点:严格模式有 36% 误报;引擎是单一可信基;回滚无法撤销外部副作用。

4) TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

  • 将聊天模板确立为一等攻击面,进行元素级变异与启发式搜索。
  • 报告在开源模型上 ~98.2% Top-5 ASR 且准确率下降约 ~1.1%;对商用模型迁移 80–100% Top-5 ASR。
  • 增加可扩展的主动学习预言机,以低成本判定越狱结果。
  • 质疑点:迁移性可能随模板加固/模型更新而变化;真实世界可检测性/对策未充分量化。

5) Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

  • 将智能体评估重构为预算约束的迭代优化,带可行性门控与冻结验证器(47 个任务,五类)。
  • 发现优化动力学:改进频率约按 t⁻¹ 衰减、幅度约按 k⁻¹ 衰减;固定预算下深度优于宽度
  • 提供跨模型/搜索框架的可操作比较;claude-opus-4.6 领先(平均排名 3.18)。
  • 质疑点:平均排名指标丢失幅度信息;套件规模/保真度仍有限。

5) 实用下一步

  • 如果你在交付智能体:采用面向生产的评测框架(AlphaEval 风格任务包 + 沙箱 + 量表脚本),并在将收益归因于模型升级前,显式测量脚手架敏感性
  • 面向企业安全:原型化一个世界状态执行层(Sentinel 风格),模拟工具调用变异并返回 Allow/Block/Clarify;将覆盖缺口作为一等指标跟踪。
  • 面向智能体执行安全:运行 assume-compromise 测试(在执行边界直接注入工具调用),验证安全性不依赖模型拒答(Parallax 方法论)。
  • 面向 Web 智能体:考虑用并行多模态守卫对动作门控;评估域外攻击(PopUp/VPI/EIA),并测量并行执行下的延迟(WebAgentGuard)。
  • 面向红队:在 CI 中加入模板 fuzzing多模态语义越狱套件;将“聊天模板”和“渲染后的页面内容”视为对抗输入,而非可信格式化。
  • 面向后训练:使用 GRPO 类 RL 时,在准确率之外跟踪校准(AUC);若 AUC 在训练中下降,考虑 CAPO 风格目标。
  • 面向长周期系统:偏好可逆记忆(书签+召回),并将页面选择准确率与“是否检索到”分开度量;投入提升书签可区分性。
  • 面向供应链风险:加入对隐蔽权重编辑的检查(低干净泄漏下的触发行为),并在分布漂移下评估,因为零空间隐蔽性依赖良性参考集。

由逐篇分析生成;未进行外部浏览。