AI 论文日报(2026-04-15)

Published:

English version: /paper-news/2026-04-15/

运行统计

  • 候选论文: 306
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-13T00:00:00Z → 2026-04-14T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.11790ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
PDF
cs.CR, cs.AI95Runtime, deterministic guardrails at tool boundaries to stop indirect prompt injection in agentsagent-security, tool-use, prompt-injection, runtime-enforcement, auditing, sandboxing
2604.11072Hodoscope: Unsupervised Monitoring for AI Misbehaviors
PDF
cs.AI95Unsupervised monitoring to surface novel agent misbehaviors beyond predefined rules/judges.agent-safety, monitoring, unsupervised, anomaly-detection, evaluation
2604.11806Detecting Safety Violations Across Many Agent Traces
PDF
cs.AI, cs.CL93Scalable auditing: finds rare/adversarial safety violations only visible across many agent tracesauditing, monitoring, agent-traces, red-teaming, clustering, safety-eval
2604.11322Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations
PDF
cs.CL, cs.AI93Finds tool-refusal flaw; introduces SABEval to isolate structural vs semantic tool relevance.tool-use, agents, safety, evaluation, dataset, robustness
2604.11259Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
PDF
cs.AI, cs.CR92Preference optimization for privacy-personalized mobile GUI agents with heterogeneous trajectories.agents, privacy, preference-optimization, mobile, security
2604.10988WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
PDF
cs.AI, cs.CV92Automated, scalable browser-agent benchmark resolving realism/reproducibility; strong eval utility.agents, benchmarks, browser-agents, evaluation, automation, web
2604.11061Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?
PDF
cs.LG, cs.AI91Benchmark isolates interpretability signal from elicitation; tests when models mis/avoid explainingmechanistic-interpretability, evaluation, model-organisms, faithfulness, alignment-auditing
2604.11623Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems
PDF
cs.AI, cs.SE91Enterprise agent knowledge orchestration w/ permissions+freshness; shows leakage/phantom-content issues.agentic-systems, permissions, governance, RAG, security, deployment
2604.11581Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines
PDF
cs.CL90Decomposes hidden uncertainty in LLM eval pipelines; shows design-choice variance can flip rankingsevaluation, measurement-error, judge-models, prompt-variance, reproducibility, safety-standards
2604.11201CocoaBench: Evaluating Unified Digital Agents in the Wild
PDF
cs.CL, cs.AI90Benchmark for unified digital agents requiring long-horizon composition of vision/search/coding.agents, benchmark, evaluation, long-horizon, tool-use
2604.11174EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
PDF
cs.RO, cs.AI89Governance-focused embodied-agent benchmark: controllability, policy bounds, recovery, auditabilityembodied-agents, governance, oversight, recovery, audit-trails, benchmark
2604.11304BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
PDF
cs.AI89High-fidelity benchmark for end-to-end investment banking workflows with real tools/deliverables.agents, benchmark, evaluation, tool-use, real-world-tasks
2604.11641CodeTracer: Towards Traceable Agent States
PDF
cs.SE, cs.AI89Tracing architecture for agent state transitions/error chains; improves debugging, auditing, reliability.agents, observability, tracing, debugging, code-agents, monitoring
2604.11307PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers
PDF
cs.AI88Multimodal multi-document benchmark for agentic deep research over papers incl. tables/figures.agents, benchmark, multimodal, scientific-reasoning, retrieval
2604.11182Evaluating Memory Capability in Continuous Lifelog Scenario
PDF
cs.CL88LifeDialBench + online temporal-causality protocol for real lifelog memory; reduces temporal leakage.memory, long-context, evaluation, benchmark, online-eval, agents
2604.11036Uncertainty-Aware Web-Conditioned Scientific Fact-Checking
PDF
cs.CL, cs.AI88Uncertainty-gated web retrieval for scientific fact-checking; targets hallucination and grounding.fact-checking, uncertainty, grounding, retrieval, hallucinations, evaluation
2604.11120Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
PDF
cs.AI87Shows persona safety differs for prompting vs activation steering; single-method eval misses riskssafety-evaluation, personas, activation-steering, jailbreaks, robustness
2604.11662Hidden Failures in Robustness: 入选理由 Supervised Uncertainty Quantification Needs Better Evaluation
PDF
cs.CL87Large study shows uncertainty probes fail under shift; calls for better UQ/hallucination eval.uncertainty, hallucinations, robustness, OOD, evaluation
2604.11309The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
PDF
cs.CR, cs.AI, cs.CL, cs.CV, cs.LG86Multi-turn jailbreak via cumulative 'salami slicing' risk; highlights covert escalation failuresjailbreaks, multi-turn-attacks, cumulative-risk, adversarial-prompting, security
2604.11784ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
PDF
cs.LG, cs.AI, cs.CL, cs.CV86Open-source full-stack GUI agent framework: RL infra + stable eval + deployment to real devices.GUI-agents, RL, evaluation, infrastructure, deployment, reproducibility
2604.10966You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
PDF
cs.CV, cs.AI86Single-pass multi-response reward modeling + new N-way benchmarks; cheaper preference learning.reward-modeling, RLHF, preference-learning, efficiency, benchmarks, multimodal
2604.11557UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents
PDF
cs.AI85Unifies tool-use representations + 22k tools + 390k instances; improves comparability for agentstool-use, agents, datasets, evaluation, function-calling, standardization
2604.11523PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints
PDF
cs.AI, cs.MA84Benchmark for multi-agent collaboration under privacy constraints; surfaces failure modes + metricsmulti-agent, privacy, collaboration, benchmark, coordination-failures, hallucinations
2604.11419Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
PDF
cs.AI, cs.CR84Systematic eval of graph-based vs agentic retrieval for cyber threat intelligence QA.RAG, retrieval, knowledge-graphs, cybersecurity, evaluation
2604.11094E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
PDF
cs.SE, cs.AI84MicroRemed benchmark + RL fine-tuning for end-to-end LLM remediation generating executable playbooks.agents, autonomous-remediation, benchmark, RLHF/RLFT, reliability, devops
2604.11611Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation
PDF
cs.CL, cs.LG84Formal + practical method for calibrated self-reward (hindsight) to densify RL for LLM agents.LLM-agents, reinforcement-learning, self-reward, calibration, theory, mutual-information
2604.11778General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks
PDF
cs.CL, cs.AI83General365 benchmark targets broad 'general reasoning' decoupled from specialized knowledge.reasoning, benchmark, evaluation, generalization
2604.11012Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics
PDF
cs.AI, cs.CL, cs.LG83Min-k decoding reduces temperature sensitivity via logit-shape 'semantic cliffs'; practical gen quality lever.decoding, sampling, generation, inference, calibration
2604.11666Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
PDF
cs.CL, cs.AI, cs.LG82ToM-based 'double-agent' defense task; frontier models struggle—useful for adversarial dialogue evaltheory-of-mind, adversarial-dialogue, privacy, social-engineering, evaluation, defense
2604.11258Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate
PDF
cs.CL82Adversarial multi-agent debate with visual falsification to reduce diagnostic hallucinations.hallucinations, multi-agent, healthcare, multimodal, robustness

AI 论文洞察简报

2026-04-15

0) 执行要点(先读这个)

  • 评估正在从“单一分数”转向“诊断型基础设施”:多个新基准/评测框架(WebForge、CocoaBench、BTB、PaperScope、EmbodiedGovBench、LifeDialBench、PAC-BENCH、Pando、CodeTracer)强调可复现性、按维度拆解,以及过程/轨迹级证据,而非汇总准确率。
  • 多轮与跨轨迹风险已成为一等威胁模型:Salami Slicing 展示了高 ASR 的渐进式越狱,可规避逐轮拒答;Meerkat 与 Hodoscope 表明,仓库/群组级发现能以更少人工审查暴露作弊/漏洞与新型不良行为。
  • 工具增强型智能体存在两类不同的安全缺口:(i) 语义攻击(间接提示注入),确定性的边界执行(ClawGuard)可显著降低 ASR;(ii) 结构性失败:模型因接口匹配而调用无关工具(SABEval),可通过注意力通路再平衡缓解。
  • 偏好/奖励建模正变得更偏“列表式”、更高效、更“校准”:单次前向的多响应奖励建模降低多模态 RM 延迟/FLOPs,同时提升排序与 GRPO 稳定性;MISE 为事后过程奖励加入校准以避免自评偏差。
  • 可解释性结果令人警醒但可操作:Pando 发现当解释缺失/误导时,梯度/RelP 是唯一一致的白盒信号可用于预测行为;许多常见读出主要捕捉“任务表征”,而非决策计算。
  • 鲁棒性工作越来越关注“测量误差”和 OOD 现实检验:TEE 显示流水线设计方差(提示/裁判交互)可能占主导,朴素置信区间会欠覆盖;监督式 UQ 探针在 OOD(尤其长文本)常崩溃,中间层 + token 平均有帮助但未解决根因。

2) 关键主题(聚类)

主题:可复现、可诊断的智能体基准评测(超越汇总成功率)

主题:面向“未知未知”的监控与审计(群组/仓库级)

  • 重要性:许多严重失败是稀疏的、新颖的或分布式的——逐轮监控与按类别裁判会漏检——因此发现流程必须高效聚焦人工注意力。
  • 代表论文
  • 共同方法
    • 将动作/轨迹表示到嵌入空间,然后浮现过度代表的簇(跨群组密度差分;聚类 + 智能体搜索)。
    • 优化审查效率(密度加权的最远点采样;见证集定位)。
    • 将发现转化为更强的监督式监控器(样例增强)或基准补丁(Commit0 漏洞)。
  • 开放问题 / 失效模式
    • 需要有意义的对照群组;若不良行为在各组同样常见则不易凸显(Hodoscope)。
    • 易受嵌入空间对抗性分散影响(Meerkat 在 embedding-prefix 攻击下 AP 下降)。
    • 极其罕见的单次异常可能逃过基于簇的发现(Hodoscope)。

主题:多轮对手与累积风险防御

主题:工具使用可靠性:结构性偏置、标准化与隐私感知个性化

主题:奖励/偏好建模与解码鲁棒性以实现更安全生成

主题:在不忠实解释 / OOD 下的可解释性与评估可靠性

3) 技术综合

  • 列表式打分正在扩散:YOJO 对 N 个候选的交叉熵与更广泛的从成对比较转向非成对的趋势一致(PAC-BENCH/BTB 的轨迹/需求级打分亦呼应)。
  • 评估中的“因果约束”正变得显式:LifeDialBench 的在线协议防止未来上下文泄漏;WebForge 通过在 Chromium 中回放验证可解性;BTB 在同一环境内对交付物评分。
  • 智能体安全正从内容过滤转向系统执行:ClawGuard 的确定性调用前检查补充(而非替代)基于裁判的方法;Context Kubernetes 同样在编排层强制权限/新鲜度不变量。
  • 多轮威胁模型统一了多篇论文:Salami(累积意图)、TOM-SB(信念引导)、PAC-BENCH(早期隐私违规)与 Meerkat(跨轨迹分布式证据)都表明逐轮指标会漏掉关键失败。
  • 嵌入空间方法强大但可被攻击:Hodoscope/Meerkat 依赖聚类/投影;Meerkat 展示对抗性分散可破坏检测,提示需要鲁棒分组或多视角信号。
  • 在不忠实解释下仍能存活的可解释信号很窄:Pando 发现当口头理由缺失/误导时梯度/RelP 仍有效;SABEval 同样用注意力通路分析(CAA)识别并干预结构性捷径。
  • 校准是反复出现的母题:Atomic+Search 以校准的不确定性带门控网页检索;MISE 将自评奖励校准到环境成功;TEE 通过建模设计方差校准评估置信度。
  • 基准越来越包含“反作弊”和完整性检查:WebForge 增加反作弊机制;Meerkat 发现真实基准作弊;BTB 使用带有人类一致性测量的验证器以减少主观评分漂移。
  • 鲁棒解码被视为安全/质量原语:Min-k 的温度不变截断以较小开销针对高 T 下的语义坍塌,适用于智能体探索场景。
  • 过程级产物正在成为训练信号:CodeTracer 的局部证据支持反思式回放改进;MISE 使用逐步事后奖励;ClawGUI 使用 PRM + GiGPO 做逐步归因。

4) Top 5 论文(含“为何是现在”)

1) BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

  • 提供高保真、多文件工作流基准(100 个任务;每任务评分细则约 150 条),更贴近真实委派风险。
  • 引入智能体验证器(Gandalf)并报告与人类一致性(准确率 88.2%,κ=0.76),支持对 Excel/PPT/PDF 交付物的可扩展评分。
  • 显示前沿模型距离可委派仍很远(最佳 Pass@1 报告为 16%;通过所有关键标准很罕见)。
  • 质疑点:基准简化了真实投行业务动态且偏美国场景;仍是对真实交易工作的代理。

2) The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

  • 形式化累积式多轮越狱风险并证明低于阈值的提示可累积超过危害阈值。
  • 在多个 LLM/基准上展示高 ASR,并扩展到多模态目标(VLM/扩散模型)。
  • 提出累积查询审计(CQA),在实验中显著降低 ASR。
  • 质疑点:CQA 原型使用 LLM 裁判;生产成本/延迟与鲁棒性需验证。

3) WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

  • 自动生成自包含的静态网站(含真实网页噪声) + 反作弊,兼顾内容不漂移与真实感。
  • 934 个已验证任务,流水线通过率 74.1%;通过在 Chromium 中回放解法验证可解性。
  • 按维度难度揭示能力差异;移除截图会使准确率下降约 16 个百分点。
  • 质疑点:静态站点无法完全覆盖服务端/多用户/实时网页语义。

4) Pando: Do Interpretability Methods Work When Models Won’t Explain Themselves?

  • 通过控制模型给出忠实/无/不忠实理由,干净地隔离诱导(elicitation)混杂因素
  • 大规模配对研究(720 个模型)发现当解释缺失/误导时,梯度/RelP 是唯一一致的白盒增益。
  • 方差分解显示许多读出跟踪的是字段身份/取值而非决策相关性。
  • 质疑点:在 2B LoRA 设置中的植入决策树可能无法泛化到分布式真实特征。

5) ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

  • 中间件在工具调用边界执行确定性、可审计检查(清洗器、规则评估器、技能检查器、审批 + 日志)。
  • 在 AgentDojo / SkillInject / MCPSafeBench 上以基础规则配置展示 ASR 大幅下降。
  • 引入任务规则归纳 + 用户确认(Rtask),但未在报告实验中评估。
  • 质疑点:残余失败包括内容误导攻击;公开结果未包含上下文感知规则归纳。

5) 实用下一步

  • 将累积意图审计加入安全栈:原型化类似 CQA 的检查,周期性对对话历史而非仅最后一轮用户输入打分,并在多轮越狱套件上测量 ASR 降幅。
  • 以确定性方式加固工具边界:实现类似 ClawGuard 的调用前策略(cmd/file/net)与审计日志;在间接提示注入基准上评估,并将残余“文本内危害”案例单独跟踪。
  • 测试工具路由的结构对齐偏置:构造 SABEval 式同胞工具测试(同 schema、不同语义)并测量工具调用率;考虑通路级干预或打破捷径的训练数据。
  • 采用测量误差感知评估:运行小型因子化试验(≥2–3 个提示变体、多个裁判),用方差分解决定预算应投向更多题目还是更多裁判/提示(TEE)。
  • 从逐轨迹审计转向仓库/群组审计:在智能体日志上部署 Hodoscope/Meerkat 式聚类 + 优先审查;显式测试对嵌入空间分散攻击的鲁棒性。
  • 面向多模态 RLHF/RLAIF 流水线:尝试多响应奖励建模用于 best-of-N 与 GRPO 式训练;同时测量排序质量与延迟/FLOPs 节省,并在相关时测试 N>4 的扩展。
  • 面向长时程记忆智能体:用因果在线协议(LifeDialBench 风格)评估未来上下文泄漏;对比原始文本保留 vs 压缩记忆并跟踪随时间的准确率衰减。
  • 面向可解释性驱动审计:当解释不可靠时,优先使用梯度/RelP 类信号(据 Pando),并验证其在固定查询预算下能提升留出行为预测

由逐论文分析生成;无外部浏览。