AI 论文日报(2026-04-11)
Published:
English version: /paper-news/2026-04-11/
运行统计
- 候选论文: 285
- 入选论文: 30
- 已精读完成: 30
- 时间窗口 (UTC): 2026-04-09T00:00:00Z → 2026-04-10T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
| arXiv ID | 标题 / 链接 | 分类 | 评分 | 入选理由 | 标签 |
|---|---|---|---|---|---|
2604.08407 | Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain | cs.CR | 95 | First systematic study of malicious LLM API routers: injection+secret exfiltration threat model & measurements | agent-security, supply-chain, tool-calling, api-routers, exfiltration, threat-model, measurement |
2604.07667 | From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation | cs.AI, cs.MA, cs.SI | 95 | Calibrated act-vs-escalate layer for multi-agent debate; conformal guarantees against wrong consensus. | agent-safety, multi-agent, debate, calibration, conformal-prediction, decision-making, escalation |
2604.08499 | PIArena: A Platform for Prompt Injection Evaluation | cs.CR, cs.AI, cs.CL, cs.LG | 93 | Unified, extensible prompt-injection evaluation platform to compare attacks/defenses across datasets/tasks | prompt-injection, evaluation, benchmarking, security, defenses, attacks, platform |
2604.08523 | ClawBench: Can AI Agents Complete Everyday Online Tasks? | cs.CL, cs.AI | 93 | Realistic live-web agent benchmark (153 tasks/144 platforms); strong eval value for agentic systems | agents, benchmark, web, evaluation, tool-use, robustness |
2604.07988 | LogAct: Enabling Agentic Reliability via Shared Logs | cs.DC, cs.AI | 93 | Shared-log state-machine abstraction enables pre-execution veto, recovery, and auditing for agents. | agents, reliability, safety, auditing, execution-control, fault-tolerance, governance |
2604.07775 | ACIArena: Toward Unified Evaluation for Agent Cascading Injection | cs.AI, cs.CL, cs.CR | 92 | Unified evaluation for multi-agent cascading injection across surfaces/objectives; fills MAS security gap | multi-agent, agent-security, cascading-injection, evaluation-suite, robustness, exfiltration |
2604.07749 | Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models | cs.CL | 92 | New benchmark for epistemic attacks beyond sycophancy; useful for robustness evals | LLM-safety, robustness, evaluation, jailbreaks, sociotechnical, benchmark |
2604.07695 | AITH: A Post-Quantum Continuous Delegation Protocol for Human-AI Trust Establishment | cs.CR, cs.AI | 91 | Cryptographic continuous delegation + revocation for AI agents; concrete protocol for bounded autonomy | agent-security, delegation, access-control, cryptography, post-quantum, governance |
2604.08064 | ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models | cs.AI | 91 | New benchmark for implicit (procedural/priming/conditioning) memory in LLM agents; safety-relevant behavior drift. | benchmark, agent-memory, evaluation, behavior, implicit-learning, reliability |
2604.08178 | Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling | cs.AI | 90 | Trajectory-level RM benchmark for tool-using agents; tests safety refusal & tool constraints beyond RLHF | reward-modeling, agents, benchmark, tool-use, trajectory-eval, alignment, safety |
2604.08423 | Synthetic Data for any Differentiable Target | cs.CL, cs.AI, cs.LG, stat.ML | 90 | Optimizes synthetic data to steer models via higher-order gradients; big safety+misuse implications | data-poisoning, model-steering, synthetic-data, data-attribution, security, alignment |
2604.08059 | Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules | cs.RO, cs.AI | 89 | Systems framework for safe capability upgrades with compatibility checks and runtime rollback in robots | embodied-agents, governance, safe-upgrades, rollback, runtime-safety, modularity |
2604.07776 | Structured Distillation of Web Agent Capabilities Enables Generalization | cs.LG | 89 | Structured synthetic trajectories distill web-agent skills into 9B; strong WebArena gains for open models. | web-agents, distillation, synthetic-data, tool-use, evaluation, open-weights |
2604.07831 | Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection | cs.CR, cs.CL, cs.CV | 88 | Practical red-teaming for GUI agents via semantic UI overlay injection; no white-box access required | gui-agents, red-teaming, vision-attacks, visual-grounding, adversarial-ui, robustness |
2604.08395 | Phantasia: Context-Adaptive Backdoors in Vision Language Models | cs.CV, cs.AI | 88 | Shows stealth of VLM backdoors overestimated; proposes/assesses defenses for multimodal backdoors | security, backdoors, VLM, data-poisoning, adversarial, defenses |
2604.08525 | Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest | cs.AI, cs.CL, cs.CY | 88 | Analyzes LLM conflicts of interest with ads; important for deployment incentives & alignment | alignment, deployment, conflicts-of-interest, ads, policy, AI-governance |
2604.08005 | Preference Redirection via Attention Concentration: An Attack on Computer Use Agents | cs.LG | 87 | Vision-side attack on computer-use agents by attention redirection to adversarial patch; preference hijack | computer-use-agents, multimodal-security, adversarial-patch, attention, manipulation, vision |
2604.08297 | Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models | cs.CR | 86 | ESI identifies safety-critical parameters and enables targeted safety interventions across dense vs MoE | mechanistic-safety, parameter-intervention, MoE, interpretability, safety-control, robustness |
2604.07755 | An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations | cs.CL, cs.SE | 86 | Quantifies how far static analysis can detect/mitigate library hallucinations; clear limits + upper bounds | hallucinations, code, static-analysis, reliability, evaluation, tooling |
2604.07927 | EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools | cs.AI | 86 | Adds structured query/evidence tools to deep-research agents; reduces redundancy and improves evidence use. | agents, deep-research, web-search, tooling, evidence, reasoning |
2604.08527 | Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models | cs.CL, cs.LG | 86 | Identifies OPD length-inflation instability and proposes stabilization; relevant to post-training | LLM-training, distillation, RLHF, post-training, stability, repetition |
2604.08476 | Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization | cs.CV, cs.AI | 85 | Constrained RL (Faithful GRPO) targets CoT faithfulness/visual grounding, not just accuracy, in MRMs | multimodal, RLVR, GRPO, faithfulness, grounding, reasoning-eval |
2604.08046 | Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation | cs.CL | 85 | Targets RAG integration bottleneck via joint decoding that forces evidence extraction over parametric priors. | RAG, grounding, faithfulness, decoding, hallucination, knowledge-integration |
2604.07754 | The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training | cs.CR, cs.CL | 84 | Systematic study of how fine-tuning can misalign/realign safety-aligned LLMs; relevant for model reuse | misalignment, post-training, fine-tuning, realignment, safety, open-models |
2604.07877 | MemReader: From Passive to Active Extraction for Long-Term Agent Memory | cs.CL | 84 | Active long-term memory extraction (GRPO/ReAct) to reduce memory pollution and improve consistency in agents. | agent-memory, personalization, RL, GRPO, information-extraction, reliability |
2604.08426 | KV Cache Offloading for Context-Intensive Tasks | cs.LG, cs.AI, cs.CL | 84 | Evaluates KV-cache offloading on context-intensive tasks; releases Text2JSON benchmark | long-context, systems, KV-cache, efficiency, benchmark, information-extraction |
2604.08169 | Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence | cs.AI | 83 | Runtime activation steering methods to maintain aligned open-ended generation beyond first tokens | activation-steering, runtime-guardrails, alignment, robustness, representation, generation |
2604.07801 | TEMPER: Testing Emotional Perturbation in Quantitative Reasoning | cs.CL, cs.AI | 83 | TEMPER dataset shows emotion framing alone drops math accuracy 2–10pp; useful robustness stress test | robustness, evaluation, reasoning, emotion, dataset, GSM8K |
2604.07789 | ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents | cs.MA, cs.CL, cs.SE | 83 | Quantifies value of oracle signals for SWE agents; clarifies what context helps most | agents, software-engineering, evaluation, tool-use, oracles, benchmarks |
2604.07929 | Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems | cs.IR, cs.AI | 82 | Trace-level human vs GUI-agent behavior comparison in production search; goes beyond success metrics | GUI-agents, evaluation, human-comparison, behavior-traces, search, deployment |
AI 论文洞察简报
2026-04-11
0) 执行要点(先读这个)
- “安全可执行(Safe to act)”正在成为智能体系统的一等输出:用于辩论的保形(conformal)集合值决策(风险预算式升级)以及基于日志/投票的执行门控,都通过把不确定输出转化为结构化拒绝/复核来减少灾难性自动化行动。
- 智能体安全正从提示注入转向系统与供应链攻击面:级联式多智能体注入、恶意 API 路由器重写工具调用、以及对 GUI/CUA 智能体的视觉/UI 级操控,都能绕过经典的纯文本防御。
- 后训练是一个双用途战场:偏好调优(ORPO)可以快速让已对齐安全的开源模型失配(即便用 LoRA 的极少数据也行);而面向参数的定向方法(ESI→SET/SPA)与激活引导(activation steering)在拥有白盒访问时,能高效地重新对齐/保持对齐。
- 基准测试更真实也更具诊断性:真实在线任务(ClawBench)暴露出相对沙盒基准的巨大差距;轨迹级奖励建模(Plan-RewardBench)显示评估器在长上下文下崩溃;隐式记忆(ImplicitMemBench)揭示“无意识”适应失败,单靠检索无法修复。
- RAG 与代码可靠性正在超越“检索”本身:用于证据整合的联合解码(GuarantRAG)针对“检索到但被忽略”的失败;静态分析能捕获相当但有上限的一部分 Python 库幻觉,并给出清晰的上界。
- 训练/推理基础设施细节会影响鲁棒性:OPD 可能因长度膨胀而崩溃;KV-cache 卸载可能在上下文密集任务上悄然降低准确率——两者都是看起来像模型失败的“系统”失效模式。
2) 关键主题(聚类)
主题:风险受控的自主性(拒绝、门控、恢复)
- 重要性:当智能体开始执行真实动作时,关键问题往往是何时不该行动。把模型输出转化为校准的升级、可审计的门控与可恢复的执行机制,可减少不可逆失败。
- 代表论文:
- 共同方法:
- 将点预测转为结构化决策(集合值输出;提交/中止;分阶段激活)。
- 增加显式策略层(投票者/决策者;兼容性检查器;动作层级)。
- 强调可审计性 + 回滚/恢复(持久日志;影子部署;回滚控制器)。
- 开放问题 / 失效模式:
- 保证往往是边际/总体层面(如 split conformal),而非条件式/逐实例。
- 分布漂移与环境变化会破坏校准与治理假设。
- 安全层可能带来效用/时延税(更多人工复核;更多工具调用;更慢执行)。
主题:超越文本的智能体攻击面(多智能体、UI/视觉、路由器)
- 重要性:许多已部署的智能体栈包含中间层(路由器)、多智能体(信任传播)与视觉落地(GUI/CUA)。这些面允许攻击在保持 schema 合法且“看起来无害”的情况下绕过纯文本过滤。
- 代表论文:
- Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
- ACIArena: Toward Unified Evaluation for Agent Cascading Injection
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
- Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
- 共同方法:
- 用标准化套件做基准(ACIArena 的 1,356 个案例;路由器市场测量;GUI 注入指标 L1/L2)。
- 保持表面有效性的攻击(schema 合法的 JSON 重写;安全对齐的 UI 图标;小的 ℓ∞ 补丁)。
- 评估可迁移性/持久性(跨受害者图标迁移;对提示变体的鲁棒性;自适应规避触发器)。
- 开放问题 / 失效模式:
- 路由器的长期修复可能需要由提供方背书的完整性/溯源,而不仅是客户端启发式。
- GUI/CUA 防御仍不成熟;现有工作更多展示攻击,缓解评估有限。
- 多智能体防御可能牺牲效用;裁剪状态(ACI-SENTINEL)有帮助但并非通用。
主题:后训练对齐很脆弱(且可被定向操控)
- 重要性:开源模型可通过易得的微调快速失配;防守方需要高效方式在不摧毁效用的情况下恢复或保持安全对齐。
- 代表论文:
- 共同方法:
- 将微调方法作为攻防工具对比(ORPO vs DPO;PEFT vs PFT)。
- 使用机理/白盒杠杆(通过 ESI 做参数排序;在激活空间做逐 token 门控的引导)。
- 用 ASR/不安全性衡量安全,并跟踪效用回退(MMLU/GSM8K 等;连贯性指标)。
- 开放问题 / 失效模式:
- 白盒方法(ESI、steering)无法迁移到封闭 API。
- 安全指标常依赖LLM 评审(judge),可能不完美。
- 双用途风险:同样的工具既可用于重新对齐,也可用于失配或规避。
主题:更真实的评估 + 面向智能体的长时程诊断
- 重要性:只看结果的指标会掩盖失效模式(导航偏离、评估器长度崩溃、隐式记忆缺口)。新基准强调轨迹、长上下文与真实网站。
- 代表论文:
- ClawBench: Can AI Agents Complete Everyday Online Tasks?
- Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
- ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
- Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems
- 共同方法:
- 收集轨迹级工件(多层记录;状态转移图;工具轨迹)。
- 压测长上下文与多步设置,在这些设置下 judges/RMs 会退化。
- 加入困难负例/近似失败(near-misses)以防止肤浅打分。
- 开放问题 / 失效模式:
- 真实在线基准存在可复现性漂移;评估器可能依赖具体模型。
- 轨迹评审在超过约 32k tokens 后崩溃;评估扩展仍未解决。
- 据报告结果,隐式记忆失败即便在带记忆增强的智能体中也仍然存在。
主题:落地与整合(RAG、记忆、代码)
- 重要性:可靠性失败常来自整合而非检索:忽略证据、写入被污染的记忆、或编造 API。实践管线正在形成,并给出可测的上界。
- 代表论文:
- 共同方法:
- 分离关注点:推理 vs 证据(Inner-Answer vs Refer-Answer;主动记忆动作)。
- 使用事后或工具化层(联合解码干预;search/add/buffer/ignore;静态分析器 + 修复)。
- 量化上界与权衡(静态分析可捕获性上界;记忆更新正确性;幻觉降低幅度)。
- 开放问题 / 失效模式:
- 额外时延/算力(双路径生成;多步记忆管理)。
- 对 docstrings/类型信息的依赖限制了静态方法;动态语言仍然困难。
- 记忆管理器的长期在线稳定性仍需在基准之外验证。
3) 技术综合
- 多篇论文在“结构化中间表示(structured intermediates)”上收敛为可靠性杠杆:预测集合(conformal)、类型化日志(AgentBus)、显式工具参数(Q+)、oracle 信号(ORACLE-SWE)、以及轨迹对(Plan-RewardBench)。
- 选择效应(selection effects)被反复利用:保形单例之所以准确是因为会弃权;经 judge 过滤的合成轨迹优于更大的未过滤集合;影子部署能捕获沙盒遗漏的回归。
- LLM-as-judge 无处不在,但论文越来越多地报告 judge 验证(如 PPT-Bench 的人类一致性;FGRPO κ=0.997 vs GPT-5)并/或加入与 judge 无关的信号(激活引导使用嵌入相似度、交叉熵、ELO)。
- 鲁棒性失败越来越非对抗式外观:情绪化表述会降低数学表现;UI 图标“安全对齐”;路由器重写仍 schema 合法;OPD 崩溃看起来像“模型变差”但其实是训练动态。
- 黑盒可部署防御(保形层;提示缓解;静态分析;客户端路由器门控/日志)与白盒机理防御(激活引导;ESI 参数干预;PRAC 补丁构造)之间出现明显分化。
- 长时程设置暴露评估器脆弱性:Plan-RewardBench 显示成对 LLM judges 在超过约 32k tokens 后崩溃,推动更鲁棒的判别式 RM 或分层评估。
- “记忆”正在分化为显式存储(MemReader 主动写入)与隐式行为适应(ImplicitMemBench),后者并不能仅靠检索解决。
- 系统工作(KV 卸载、OPD 稳定性)表明推理/训练优化会悄然改变任务准确率,因此鲁棒性评估必须包含基础设施变体。
- 安全评估正走向生态测量(路由器市场、投毒研究),而不只是实验室攻击,从而产出具体的普遍性数字与可操作缓解措施。
4) Top 5 论文(含“为什么是现在”)
1) Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
- 量化了一个真实但讨论不足的风险:API 路由器会终止 TLS 并可重写可执行的工具调用 JSON。
- 大规模生态测量(28 个付费 + 400 个免费路由器),观察到主动注入与凭据触碰,并包含投毒研究。
- 评估了实用的客户端缓解(策略门控、异常筛查、透明日志),并展示该攻击代理与多种智能体框架的兼容性。
- 保留意见:客户端防御无法提供密码学溯源;测量可能遗漏未触发的自适应行为。
2) From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation
- 将辩论输出重构为在用户设定风险预算 α 下的执行 vs 升级,并提供 split-conformal 的边际覆盖保证。
- 实证瞄准关键失败:错误的一致共识(最初分歧的案例中有 23.9% 到第 3 轮收敛为一致但错误);在 α=0.05 时保形层通过升级拦截 81.9%。
- 黑盒且事后:可通过口头化概率 + 聚合部署在专有模型上。
- 保留意见:保证是边际的并假设可交换性;在闭集多选任务上评估。
3) ClawBench: Can AI Agents Complete Everyday Online Tasks?
- 真实在线基准,具备对终端提交的安全拦截与五层轨迹记录——在真实与安全之间搭桥。
- 显示相对沙盒基准的巨大差距:报告中最佳模型(Claude Sonnet 4.6)为 33.3% SR;GPT-5.4 为 6.5%。
- 通过智能体评估器与人类轨迹对比,提供可追踪的失败诊断。
- 保留意见:真实在线的可变性与手工端点标注限制了可扩展性与可复现性。
4) ACIArena: Toward Unified Evaluation for Agent Cascading Injection
- 将多智能体级联注入评估标准化,覆盖28 种攻击、3 个表面、3 个目标,并集成 6 个 MAS 框架。
- 发现高脆弱性(代码任务常见 90–100% ASR;引用 LLM Debate 为 100% 劫持 ASR),且部分防御会失败或牺牲效用。
- 提出 ACI-SENTINEL(语义最小性裁剪),在报告案例中显著降低 ASR。
- 保留意见:受查询成本限制评估规模;防御引入效用权衡且并非普适有效。
- 绘制常见方法下的攻防动态:ORPO 最强用于失配;DPO 最强用于重新对齐(常伴随效用成本)。
- 显示失配可数据高效(在某些设置中 LoRA 仅需 13 个不安全样本也有效)。
- 强调模型特定的抗性模式(Gemma2 抵抗 SFT 失配但不抵抗 ORPO)。
- 保留意见:不安全性依赖 LLM-judge 集成;不含专有模型与完整 RLHF。
5) 实用下一步
- 为任何多智能体或集成系统加入执行/升级(act/escalate)层:在聚合概率(或类似分数)上实现 split conformal,并衡量自动化错误降低与升级率的权衡。
- 对工具型智能体,将路由器视为不可信:对高风险工具部署 fail-closed 策略,加入响应异常筛查,并实现用于取证的仅追加透明日志。
- 用非文本攻击红队测试 GUI/CUA 栈:语义级 UI 图标注入与视觉偏好重定向;衡量持久性与跨模型迁移,而不只看单次成功。
- 若发布开源权重模型,假设后训练失配很便宜:对候选版本做 ORPO/LoRA 风格对抗调优测试;评估 DPO 或定向干预能多好地恢复安全以及会损失多少效用。
- 升级评估以包含真实在线或轨迹级指标(导航偏离、工具使用努力偏差)以及长上下文 judge 失效检查(如 >32k token 轨迹)。
- 对 SWE 智能体,优先复现测试生成/抽取与更丰富的执行上下文:ORACLE-SWE 表明复现测试主导 oracle 增益,组合信号接近完全成功。
- 审计基础设施变更(KV 卸载、OPD/RL 流水线)时使用上下文密集基准并监控长度/重复;将“优化”视为潜在准确率回归来源。
- 对 RAG,衡量整合失败(参数化覆盖/割裂式整合)并测试双路径 + 融合方法;不要假设检索改进会自动转化为事实性提升。
由逐篇论文分析生成;无外部浏览。
