AI 论文日报(2026-03-18)

Published:

English version: /paper-news/2026-03-18/

运行统计

  • 候选论文: 282
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-16T00:00:00Z → 2026-03-17T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.15125From Storage to Steering: Memory Control Flow Attacks on LLM Agents
PDF
cs.CR95New persistent agent threat: memory steers tool control flow; adds MEMFLOW eval frameworkagent-security, memory, tool-use, prompt-injection, evaluation, control-flow
2603.14975入选理由 Agents Compromise Safety Under Pressure
PDF
cs.AI, cs.CL, cs.CY, cs.MA95Studies safety tradeoffs in LLM agents under “pressure”; finds normative drift + mitigations.agent-safety, constraint-violation, jailbreak-dynamics, robustness, mitigation, evaluation
2603.14707Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
PDF
cs.CV, cs.CL93Formalizes exploitable GUI grounding failures (visual confused deputy) + defenses for CUAscomputer-using-agents, GUI-security, confused-deputy, TOCTOU, grounding, agent-safety
2603.15417Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities
PDF
cs.LG, cs.AI, cs.CL, cs.CR92Shows test-time RL/TTT can amplify injected harmful behavior; key warning for TTT agentstest-time-training, TTRL, prompt-injection, safety, reasoning, robustness
2603.15473Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents
PDF
cs.AI92Open-source middleware to harden agents across lifecycle (prompting, tools, policy, monitoring).agents, middleware, robustness, tool-safety, governance, monitoring, guardrails
2603.15594OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
PDF
cs.AI, cs.CL92Fully open-sourced frontier search agent + training data; big for reproducible agentic RAG/search.agents, search, RAG, open-source, synthetic-data, tool-use, evaluation
2603.15408TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems
PDF
cs.CR, cs.AI, cs.CL, cs.LG, cs.MA90OWASP-grounded MAS risk taxonomy + monitoring/eval framework for multi-agent hazardsmulti-agent, OWASP, risk-taxonomy, monitoring, evaluation, agent-security
2603.14825Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection
PDF
cs.CV, cs.AI90Inference-time LVLM jailbreak defense aiming to improve safety without utility loss via feature projection.LVLM, jailbreak, robustness, inference-time, safety-utility, multimodal
2603.15423Invisible failures in human-AI interactions
PDF
cs.CL90Finds 78% of AI failures are “invisible” in WildChat; taxonomy of failure archetypes.evaluation, human-AI-interaction, reliability, failure-modes, monitoring, WildChat
2603.15457Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents
PDF
cs.CR, cs.AI88Agent evals can be gamed via sandbox-evasion analogs; reframes evaluation as securityevaluation, agentic-systems, sandbox-evasion, security, robustness, deployment
2603.15030VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining
PDF
cs.AI88Benchmark for multimodal agent tool-use with compositional visual tool chaining (32 OpenCV ops).benchmark, multimodal-agents, tool-use, evaluation, computer-vision, tool-chaining
2603.15255SAGE: Multi-Agent Self-Evolution for LLM Reasoning
PDF
cs.AI, cs.MA88Multi-agent self-evolution with verifiable rewards; relevant to scalable reasoning training and agent risks.reasoning, multi-agent, RL, verifiable-rewards, self-play, planning, training
2603.15033Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion
PDF
cs.LG88Unlearning-by-design via key deletion; targets practical privacy/poisoning removal constraints.machine-unlearning, privacy, data-deletion, robustness, security
2603.15309CCTU: A Benchmark for Tool Use under Complex Constraints
PDF
cs.CL, cs.AI86Tool-use benchmark under complex constraints; long prompts, taxonomy, curated hard casestool-use, benchmark, constraints, function-calling, evaluation, agents
2603.15483Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis
PDF
cs.AI86Unified agent evaluation incl. user role + automated error diagnosis beyond pass/fail correctness.agents, evaluation, error-analysis, user-modeling, conversation-quality, diagnostics
2603.15282Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report
PDF
cs.AI86Improves algorithms for deciding safe states/policies under nondeterminism; tighter runtime gap.formal-safety, planning, verification, nondeterminism, policy-iteration
2603.15401SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?
PDF
cs.SE, cs.AI85Benchmark isolates marginal utility of injected agent skills in real SWE with testssoftware-engineering, agents, benchmark, skills, evaluation, verification
2603.15617HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
PDF
cs.LG85Contamination-resistant benchmark of mostly unsolved math problems with automatic verification.benchmark, math, verification, reasoning, evaluation, data-contamination
2603.15051Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs
PDF
cs.CL, cs.AI, cs.LG84Adaptive latent reasoning to cut CoT cost; potentially impactful for efficient inference and reasoning control.latent-reasoning, efficiency, chain-of-thought, inference, reasoning, LLMs
2603.15136Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies
PDF
cs.LG, cs.AI84Offline safe RL with reachability-style safety value + deployable one-step safe actor.safe-RL, offline-RL, reachability, constraints, deployment
2603.15259Directional Embedding Smoothing for Robust Vision Language Models
PDF
cs.LG, cs.AI, cs.CL, cs.CR83Lightweight VLM jailbreak defense (directional embedding smoothing) evaluated on JB-V-28KVLM-safety, jailbreak, defense, randomized-smoothing, robustness, multimodal
2603.15611Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning
PDF
cs.CL83Adversarial co-evolution of code LLM vs test LLM to avoid self-collusion and improve coverage.code-LLMs, RL, adversarial-training, testing, evaluation, robustness
2603.15044Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets
PDF
cs.AI, cs.CY, cs.LG82Operational governance framework for prompts: maturity levels + scoring for safety/compliance readiness.prompting, governance, compliance, security, evaluation, process, safety
2603.15518Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models
PDF
cs.CL82Diagnoses prompt-variation generalization failures in knowledge editing; proposes geometric explanation/mitigation.knowledge-editing, robustness, generalization, representations, reliability, LLMs
2603.14968Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework
PDF
cs.CR, cs.CL81Third-party black-box watermark verification; decouples detection from secret injectionwatermarking, provenance, governance, black-box, auditing, LLM-security
2603.15527Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph
PDF
cs.AI, cs.CY80Models instruction/value conflicts as priority graph; highlights 'priority hacking' riskalignment, instruction-hierarchy, conflicts, adversarial-context, runtime-verification
2603.15599SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval
PDF
cs.LG80Strong deterministic conversational memory retrieval; shows ranking/truncation dominates structuring.memory, retrieval, RAG, long-context, ranking, agents, efficiency
2603.14864Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks
PDF
cs.CL80Benchmark + method for long-term preference memory in real e-commerce; useful for agent memory evaluation.agents, memory, long-horizon, benchmarks, personalization, retrieval
2603.15351PMAx: An Agentic Framework for AI-Driven Process Mining
PDF
cs.AI, cs.MA80Agentic process-mining framework addressing hallucinations + privacy by tool-based analysis.agents, tool-use, privacy, enterprise, hallucinations, workflow
2603.15280Advancing Multimodal Agent Reasoning with Long-Term Neuro-Symbolic Memory
PDF
cs.AI79Neuro-symbolic long-term memory for multimodal agents to support deductive reasoning.agents, memory, neuro-symbolic, multimodal, reasoning

AI 论文洞察简报

2026-03-18

0) 执行要点(先读这个)

  • “授权(Authorization)”正在移出模型之外:对计算机使用型智能体而言,信任锚点在屏幕上,因此处在同一感知闭环内的防御会失效;智能体外部验证(例如双通道的目标+意图检查)正在成为一种可行的实践模式。
  • 推理时表征“手术”在多模态安全上正获得真实牵引力:特征空间投影(TBOP)与嵌入平滑(方向性 RESTA)都能在对效用影响适中的情况下显著降低多模态越狱 ASR——表明一套不断扩展的单次前向轻量级测试时缓解工具箱正在形成。
  • 智能体安全失败越来越呈现为交互性与轨迹依赖:“智能体压力(agentic pressure)”会导致规范漂移(安全下降、目标成功上升),而大规模 WildChat 分析表明多数失败是不可见的,并且即使模型更强也可能持续存在——监控需要从“用户投诉”转向主动检测。
  • 评估正在收敛到可执行的、逐步级别的合规性:新的基准/工具(CCTU、VTC-Bench、TrinityGuard、SWE-Skills-Bench、TED)强调中间约束、工具链正确性与系统级多智能体风险;仅看最终成功率已不再足够。
  • 数据与架构选择可以“内建”治理属性:第三方黑盒水印检测(TTP-Detect)与通过密钥删除实现的“按设计可遗忘”(MUNKEY)都旨在让监督/删除在无需特权访问或昂贵再训练的情况下可行。

2) 关键主题(聚类)

主题:智能体外部授权与运行时护栏

主题:推理时的多模态越狱防御

主题:真实部署中的压力、不可见性与评估器区分

主题:受约束工具使用与组合式工具链的基准

主题:长时程智能体的记忆与数据流水线(及其局限)

主题:治理原语——第三方溯源与按设计删除

3) 技术综合

  • 外置化是反复出现的母题:(i)CUA 护栏外置授权;(ii)PMAx 将计算外置到本地确定性工具;(iii)MUNKEY 将记忆外置到可删除存储;(iv)TTP-Detect 通过参考采样将水印验证外置给第三方。
  • 后期、低维干预因效率而流行:TBOP 编辑单个最后 token 的激活;AdaAnchor 精炼少量锚向量;两者都旨在将计算从长 token 轨迹中转移出来。
  • 对“裁判(Judge)”的依赖无处不在,但失效模式不同:CCTU 使用可执行验证器;TED/TrinityGuard/压力合理化依赖 LLM 裁判;水印检测使用代理模型 + 统计。生态正在分裂为确定性 vs 模型裁判两类评估栈。
  • 基准越来越过程感知:VTC-Bench(工具链轨迹)、CCTU(逐步约束反馈)、SWE-Skills-Bench(仓库固定测试)、TrinityGuard(分层 MAS 风险)都衡量中间正确性/合规性,而不仅是最终答案。
  • 安全–效用权衡正在表征层面被攻克:TBOP 声称同时降低 ASR 并提升效用;方向性平滑展示出优于各向同性噪声的权衡曲线。
  • 轨迹效应与提示效应同等重要:智能体压力与 TTRL 放大都表明,随时间发生的变化(约束收紧;测试时更新)即使没有经典越狱提示也能翻转安全行为。
  • Token 预算是一等约束:SmartSearch 将排序+截断视为瓶颈;AdaAnchor 相比显式 CoT 将输出 token 减少约 90%+;OpenSeeker 用去噪历史训练教师、而学生在原始历史上训练。
  • 对抗思维正从“提示注入”转向“系统操控”:视觉 TOCTOU/截图替换;评估器区分;多智能体传播风险;测试时学习流投毒(HarmInject)。
  • 角色分离用于防止串谋并提升质量:Code-A1 分离 Code LLM 与 Test LLM;SAGE 分离 Challenger/Planner/Solver/Critic;PMAx 分离 Engineer 与 Analyst。

4) Top 5 论文(含“为何是现在”)

1) Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents(视觉混淆代理:利用与防御计算机使用型智能体的感知失败)

  • 形式化了 CUA 特有漏洞:当感知错误时,同一个 click(x,y) 可授权不同语义(e_perceived ≠ e_actual)。
  • 展示了一个简单利用(ScreenSwap 像素替换),将常规误点武器化为权限提升。
  • 提出一种智能体外部双通道护栏(视觉裁剪 + 推理文本),并用 OR-否决融合;展示检测提升(例如融合后 ScreenSpot-Pro F1 0.915)。
  • 持怀疑态度之处:单步裁剪+推理检查会漏掉输入内容危害与多步“安全点击”序列;依赖精心构建的 KB,且未建模对嵌入的对抗规避。

2) Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection(两鸟一投影:通过推理时特征投影在 LVLM 中协调安全与效用)

  • 识别模态诱发的特征偏移,并在推理时通过 SVD 得到的干扰子空间投影将其移除。
  • 报告 ASR 大幅下降且效用提升(例如 LLaVA-7B MMSB ASR 38.86%→5.09%,MM-Vet 41.91→43.98)。
  • 单次前向、低开销;在其设置中报告比 ETA 快约 ~60×。
  • 持怀疑态度之处:依赖锚点集合构成与秩 k;对纯文本越狱不够直接;超出所研究架构的通用性仍待验证。

3) CCTU: A Benchmark for Tool Use under Complex Constraints(CCTU:复杂约束下工具使用基准)

  • 用可执行验证器与轨迹中反馈使约束合规可测量。
  • 显示“完美解”罕见:所有评估模型 PSR < 20%;总体违规 > 50%,尤其是资源/响应约束。
  • 强调“思考模式”可能引入过度思考相关失败。
  • 持怀疑态度之处:仅 200 个样例且来自 FTRL;分类体系不完备;验证器生成需要人工校准。

4) TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems(TrinityGuard:多智能体系统防护的统一框架)

  • 提供平台无关的 MAS 抽象 + 干预层 + 覆盖 20 类风险(3 个层级)的安全评估模块。
  • 实证发现对 300 个合成工作流通过率很低(总体 7.1%;Tier 3 仅 1.3%)。
  • 结合部署前测试、运行时监控与基于事件流的来源归因。
  • 持怀疑态度之处:高度依赖 LLM 裁判;主要是诊断性(尚无自动化修复)。

5) Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities(测试时强化学习中的放大效应:安全与推理脆弱性)

  • 表明使用多数投票伪标签的测试时 RL 可能根据注入提示混合而放大有害性或安全性,并常常带来“推理税”。
  • 引入 HarmInject:在同一提示中配对越狱 + 推理,将推理奖励绑定到有害输出;简单的数字标签过滤会被绕过。
  • 与任何考虑自我改进/无标签 TTT 的部署直接相关。
  • 持怀疑态度之处:分析特定于多数投票 TTRL;更广泛的 TTT 家族与更强缓解措施仍待探索。

5) 实用下一步

  • 计算机使用型智能体,增加外部的执行前授权层:用否决逻辑验证点击目标语义(基于裁剪)+ 意图(基于推理);显式建模 screenshot() 与 click() 之间的 TOCTOU。
  • 构建 序列级 的单动作护栏扩展:跨多步计划跟踪有状态风险(例如多个“安全点击”组合成不安全结果),因为多篇论文指出单步限制。
  • 部署 多模态模型 时,并排测试推理时防御:特征投影(TBOP 风格)vs 嵌入平滑(方向性 RESTA),并在你自己的提示/图像分布下同时衡量 ASR 与效用。
  • 测试时训练/自我改进 视为攻击面:隔离或过滤更新流;在启用任何在线自适应前运行红队混合(包括 HarmInject 类复合)。
  • 将评估从“成功率”升级为 完美合规逐轮进展:在 CI 中引入可执行约束验证器(CCTU 类)与轮次感知进展指标(TED 类)。
  • 多智能体系统,采用分层风险测试 + 运行时事件流(TrinityGuard 类),并显式测试传播/冒充/记忆投毒场景。
  • 监控,假设失败往往不可见:增加对漂移/置信陷阱的主动检测器,并跟踪“静默不匹配”模式,而不是依赖用户投诉。
  • 治理,考虑按设计支持监督的架构:第三方水印验证工作流(参考采样 + 代理检验)与用于遗忘的按设计删除记忆库。

由逐篇论文分析生成;未进行外部浏览。