AI 论文日报(2026-05-16)

Published:

English version: /paper-news/2026-05-16/

运行统计

  • 候选论文: 371
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-05-14T00:00:00Z → 2026-05-15T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2605.15030WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
PDF
cs.CR, cs.AI95Robust prompt-injection defense for web agents with large-scale dataset and deployment focus.agent-safety, prompt-injection, web-agents, defense, benchmark, security
2605.14421MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
PDF
cs.CR, cs.AI95Cryptographic provenance for agent memory directly targets persistent prompt-injection risks.agent-safety, memory-security, provenance, prompt-injection, guardrails
2605.14746Selective Safety Steering via Value-Filtered Decoding
PDF
cs.LG94Decoding-time safety steering that reduces unnecessary interventions while improving safety.safety, alignment, decoding, steering, reliability
2605.14271Auditing Agent Harness Safety
PDF
cs.CL, cs.CY93Audits full agent trajectories for permission and info-flow violations beyond final outputs.agent-safety, auditing, execution-traces, permissions, information-flow, evaluation
2605.14786Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
PDF
cs.CR, cs.AI, cs.HC, cs.LG93Shows passive fingerprinting of browser agents, enabling targeted attacks on known model weaknesses.agent-security, browser-agents, fingerprinting, privacy, adversarial
2605.14865Holistic Evaluation and Failure Diagnosis of AI Agents
PDF
cs.AI, cs.CL93Strong agent evaluation/diagnosis framework with span-level localization and reported SOTA gains.agents, evaluation, diagnosis, benchmarks, reliability
2605.15188FutureSim: Replaying World Events to Evaluate Adaptive Agents
PDF
cs.LG, cs.AI, cs.CL93Grounded benchmark for adaptive agents in evolving real-world settings; strong eval value.agents, evaluation, benchmark, forecasting, real-world
2605.14859Do Coding Agents Understand Least-Privilege Authorization?
PDF
cs.CR, cs.AI92Least-privilege benchmark for coding agents targets a core real-world deployment safety gap.agent-safety, coding-agents, authorization, least-privilege, benchmark, security
2605.14605One Step to the Side: 入选理由 Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
PDF
cs.CR, cs.AI, cs.LG91Strong adaptive-attack critique showing current anti-malicious-finetuning defenses broadly fail.alignment, robustness, finetuning, adaptive-attacks, safety-evaluation, open-weights
2605.14750EVA: Editing for Versatile Alignment against Jailbreaks
PDF
cs.CR, cs.AI91Model-editing defense against jailbreaks for LLMs/VLMs with safety-utility focus.jailbreak-defense, alignment, model-editing, VLM, robustness
2605.15040Orchard: An Open-Source Agentic Modeling Framework
PDF
cs.AI, cs.CL91Open-source agentic modeling stack with sandbox primitives and scalable training recipes.agents, frameworks, open-source, sandboxing, training
2605.15109入选理由 Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG
PDF
cs.AI, cs.IR91Important Agentic GraphRAG citation-faithfulness study with trajectory-level provenance framing.RAG, agents, evaluation, citations, provenance, factuality
2605.15134Training ML Models with Predictable Failures
PDF
cs.LG91Targets deployment-scale failure prediction for safety assessment with concrete training objective.safety, evaluation, reliability, failure-prediction, deployment
2605.15118Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks
PDF
cs.CR, cs.CL90Threat-surface taxonomy and coverage audit for LLM attack benchmarks; highly reusable evaluation lens.llm-security, taxonomy, benchmarking, attacks, evaluation, agents
2605.15138Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
PDF
cs.LG, cs.CL, cs.ET90Important unlearning result: quantization can undo forgetting; proposes mechanistic fix.unlearning, quantization, model-safety, mechanistic-interpretability, deployment
2605.15000Quantifying and Mitigating Premature Closure in Frontier LLMs
PDF
cs.CL, cs.AI90Directly studies unsafe premature commitment in frontier LLMs with mitigation on medical tasks.llm-safety, reliability, uncertainty, abstention, evaluation
2605.15152Widening the Gap: Exploiting LLM Quantization via Outlier Injection
PDF
cs.LG, cs.AI89Practical attack on advanced quantization schemes exposes deployment-time LLM security risk.quantization, model-security, backdoor-risk, deployment, adversarial
2605.14498GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
PDF
cs.CL89Useful benchmark for multi-user agent memory, belief tracking, and audience-aware responses.benchmark, agents, memory, multi-party, evaluation
2605.14404Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
PDF
cs.CL89Multilingual unlearning metrics for cross-lingual privacy leakage; highly relevant to LLM safety.unlearning, privacy, multilingual, evaluation, llm-safety
2605.14454LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
PDF
cs.LG, cs.CL, cs.CR88Practical framework for adapting guardrails from sparse deployment feedback in agent settings.guardrails, agent-safety, online-adaptation, policy-learning, deployment, reliability
2605.15128MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
PDF
cs.CV, cs.CL, cs.IR88Targets multimodal agent memory with visual-grounded benchmark for evidence preservation.multimodal, agents, memory, evaluation, benchmark
2605.15155Self-Distilled Agentic Reinforcement Learning
PDF
cs.LG, cs.AI, cs.CL88Post-training method for LLM agents combining RL with self-distillation for long-horizon tasks.llm-agents, reinforcement-learning, post-training, self-distillation, reasoning
2605.14290Web Agents Should Adopt the Plan-Then-Execute Paradigm
PDF
cs.CR, cs.AI, cs.CL, cs.SE87Argues plan-then-execute reduces web prompt-injection control-flow risk by design.web-agents, prompt-injection, agent-architecture, security, plan-then-execute
2605.14968GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation
PDF
cs.AI87Formally verifiable workflows for reliable agentic automation in mission-critical settings.agents, formal-methods, reliability, workflows, verification
2605.15077Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs
PDF
cs.CL, cs.AI, cs.LG87Practical agent efficiency advance: async tool calling without model changes or retraining.agents, tool-use, systems, efficiency, function-calling
2605.14483LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
PDF
cs.AI87Learns executable multi-agent orchestration with counterfactual credit assignment; useful for agents.multi-agent, orchestration, reinforcement-learning, agents, automation
2605.14604Sycophancy is an Educational Safety Risk: 入选理由 LLM Tutors Need Sycophancy Benchmarks
PDF
cs.AI, cs.HC86Sycophancy benchmark for LLM tutors highlights a concrete, underexplored safety failure mode.sycophancy, benchmark, education, alignment, evaluation
2605.14747Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining
PDF
cs.CL, cs.AI, cs.CV, cs.LG86Large-scale GUI-agent pretraining data from unlabeled videos could materially boost agent capability.agents, gui, pretraining, datasets, multimodal
2605.14570Uncertainty Quantification for Large Language Diffusion Models
PDF
cs.CL86First systematic uncertainty quantification study for language diffusion models; safety-relevant.uncertainty, reliability, diffusion-lm, hallucination, evaluation
2605.14932Toward Securing AI Agents Like Operating Systems
PDF
cs.CR85OS-inspired security framing for AI agents offers useful systems perspective on isolation and privilege.agent-security, systems, sandboxing, privilege-separation, architecture, survey

AI 论文洞察简报

2026-05-16

0) 执行要点(请先阅读)

  • Agent 安全评估正从最终答案评分转向轨迹级、harness 级与来源级审计。多篇论文表明,高任务完成率可以与严重的边界违规、不安全的记忆复用或误导性引用同时存在。
  • 当前最强的安全模式是将控制与不可信内容进行架构性分离:面向 Web Agent 的先规划后执行、类操作系统的运行时隔离、具备谱系感知的记忆门控,以及类型化/可验证工作流,都旨在移除攻击路径,而不只是检测不良输出。
  • 多篇论文揭示了部署阶段的安全反转:遗忘在量化后可能失效,恶意行为可能只在量化后被激活,而微调防御在自适应攻击者面前会失效。忽略下游部署变换的安全结论正变得越来越不可靠。
  • 记忆正成为一类一等失败面。基准与防御研究显示,当前系统会丢失说话者锚定、时间有效性、视觉证据与来源信息;在某些场景中,简单的检索基线仍然优于复杂的记忆摄取流水线。
  • 实用缓解措施正变得更加选择性且可校准:value-filtered decoding 对不必要干预进行约束,LiSA 从稀疏反馈中保守地适配护栏,SDAR 则通过门控特权蒸馏来稳定 Agent 强化学习。
  • 基础设施与模型质量同样重要:开放环境层、异步工具执行、大规模 GUI 预训练数据,以及更好的编排学习都能提升 Agent 能力——但它们也扩大了对更强运行时控制与审计的需求。

2) 关键主题(聚类)

主题:Agent 安全正从输出转向执行轨迹

主题:结构性防御正在取代仅靠提示词的 Web 与 Agent 安全防御

主题:提示注入仍是核心问题,但防御手段正在多样化

主题:记忆已成为核心能力,也成为核心攻击面

主题:部署变换正在打破许多安全假设

主题:更好的 Agent 基础设施正在提升能力——也在澄清瓶颈

3) 技术综合

  • 多篇论文汇聚到一个控制/数据分离原则:PTE 将控制流与网页内容隔离;MemLineage 将来源与内容分离;GraphFlow 将可验证结构与运行时非确定性分离;类 OS 的 Agent 安全将运行时强制执行与模型意图分离。
  • 评估正变得原生面向轨迹:HarnessAudit 将轨迹规范化为统一模式,TRAIL 风格诊断对叶子 span 打分,而 GraphRAG 来源研究则通过图消融测试答案依赖,而不只是检查引用。
  • 多项工作用因子化评分替代单体判断:安全遵循 vs 任务完成、检索失败 vs 推理失败、充分性 vs 紧致性,或广域 vs 局部策略记忆。
  • 一个反复出现的失败模式是效用与安全错位:高任务完成率可以与边界违规、权限过宽、陈旧记忆检索或不安全中间动作并存。
  • 记忆相关论文一致表明,摄取是瓶颈:GroupMemBench 发现检索失败占主导;MemEye 显示陈旧证据选择和描述丢失;MemLineage 表明来源丢失会启用“洗白”攻击。
  • 部署鲁棒性越来越需要变换后评估:量化会改变遗忘结果、激活隐藏攻击,因此应被视为威胁模型的一部分,而不是下游实现细节。
  • 多种方法采用选择性干预而非一刀切式引导:value-filtered decoding 只在超过阈值时干预,LiSA 用 Beta 后验置信度门控广域策略,SDAR 用 detached sigmoid 门控 token 级蒸馏。
  • 正在明显转向类型化接口与结构化工件:YAML 编排规范、类型化站点 API、future-valued 函数模式、CBOR 来源条目,以及显式权限白名单,都让 Agent 行为更可审计。
  • 自适应评估正成为基线预期:WARD 使用攻击者–guard 共同进化,SIDESTEPPER 用混合目标攻击 MFT 防御,而浏览器 Agent 指纹识别研究则考察了感知重训练的对手。
  • 系统类论文表明,执行层改动无需改模型也能带来显著收益:AsyncFC 加速工具使用,Orchard 降低 rollout 成本/延迟,而 PTE 将 Web Agent 安全重构为一种架构选择,而非鲁棒性补丁。

4) Top 5 论文(附“为什么是现在”)

  • Auditing Agent Harness Safety
    • 将 Agent 安全重新定义为一个轨迹级 harness 问题,而不是输出级问题。
    • 引入 HarnessAudit-Bench,包含 8 个领域的 210 个任务和 525 个扰动案例。
    • 发现任务完成与安全执行之间对齐很差;多 Agent 设置会放大违规。
    • 现在很有用,因为许多团队正在发布多 Agent/工具使用系统,但对中间轨迹失败几乎没有可见性。
    • 质疑 / 局限:论文很擅长暴露失败,但缓解策略并非其主要重点。
  • Web Agents Should Adopt the Plan-Then-Execute Paradigm
    • 提出一个强有力的架构主张:在 Web 上,ReAct 天生脆弱,因为不可信内容恰好出现在动作决策发生的位置。
    • 表明在可信 API 假设下,全部 860 个 WebArena 任务都与 PTE 兼容,其中 81.28% 无需运行时 LLM 子程序即可求解。
    • 现在很有用,因为 Web 上的提示注入正日益成为部署阻碍,而这提供的是结构性替代方案,而不是另一个检测器。
    • 质疑 / 局限:可部署性高度依赖完整、可信、类型化的 API 或维护良好的 SDK。
  • WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
    • 将大规模多模态数据集、面向 guard 的攻击训练和自适应对抗训练结合到一个紧凑型 guard 模型中。
    • 报告了强 OOD 检测、在 PIG 训练后对面向 guard 的注入达到 100% 召回、低误报率,以及高效并行部署。
    • 现在很有用,因为它是这一批中更偏部署导向的 Web Agent 防御之一。
    • 质疑 / 局限:明确不覆盖像素级不可感知攻击,而伪装成任务对齐的 UI 仍是失败模式。
  • Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
    • 表明标准遗忘可能被 4-bit 量化撤销,因为更新太小,无法在分箱后保留下来。
    • 提出 MANSU,结合电路定位、受限零空间投影和幅度下限,使遗忘在 NF4 量化后仍能保留。
    • 现在很有用,因为许多发布流水线会在安全工作之后再量化模型,使得量化前的遗忘结论并不完整。
    • 质疑 / 局限:证据主要集中在 8B 级模型和事实召回基准上,且计算成本不低。
  • Widening the Gap: Exploiting LLM Quantization via Outlier Injection
    • 揭示了一种供应链式攻击:一个看似良性的全精度模型,只有在用户量化后才会变得恶意。
    • 展示了在包括 GPTQ 和 AWQ 在内的实用量化器上,量化后具有高 ASR,同时保留全精度效用。
    • 现在很有用,因为量化模型分发已非常普遍,却常被视为良性的压缩步骤,而不是攻击面。
    • 质疑 / 局限:需要在发布前对白盒访问并修改权重,因此威胁相关性取决于模型来源与分发渠道。

5) 实际下一步

  • 在 Agent 评估中加入轨迹日志与隐藏策略审计;不要把最终答案成功当作安全代理指标。
  • 对 Web Agent,在一组高价值站点上原型化先规划后执行,配合类型化 API/SDK,并与 ReAct 比较安全性/延迟。
  • 将记忆视为安全边界:在允许基于记忆的动作之前,加入来源元数据、信任标签和敏感动作门控
  • 对所有遗忘与安全编辑,在部署变换之后进行评估:至少测试量化后、蒸馏后和多语言恢复路径。
  • 自适应混合目标攻击者而不只是仅优化有害损失的微调,来红队测试 MFT 防御。
  • 说话者锚定、时间有效性和视觉证据保留上对记忆系统做基准测试;在引入复杂摄取前,先与简单 BM25 或原始检索基线比较。
  • 尽可能使用选择性、可校准的引导:不仅测拒答/安全收益,也测不必要干预率。
  • 如果在构建 Agent 基础设施,应显式分离关注点:环境服务、harness、规划器、执行器和策略执行应当可以独立测试与替换。

基于逐篇论文分析生成;未进行外部浏览。