AI 论文日报（2026-04-11）

Published: April 11, 2026

English version: /paper-news/2026-04-11/

运行统计

候选论文: 285
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-04-09T00:00:00Z → 2026-04-10T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2604.08407`	Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain PDF	cs.CR	95	First systematic study of malicious LLM API routers: injection+secret exfiltration threat model & measurements	agent-security, supply-chain, tool-calling, api-routers, exfiltration, threat-model, measurement
`2604.07667`	From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation PDF	cs.AI, cs.MA, cs.SI	95	Calibrated act-vs-escalate layer for multi-agent debate; conformal guarantees against wrong consensus.	agent-safety, multi-agent, debate, calibration, conformal-prediction, decision-making, escalation
`2604.08499`	PIArena: A Platform for Prompt Injection Evaluation PDF	cs.CR, cs.AI, cs.CL, cs.LG	93	Unified, extensible prompt-injection evaluation platform to compare attacks/defenses across datasets/tasks	prompt-injection, evaluation, benchmarking, security, defenses, attacks, platform
`2604.08523`	ClawBench: Can AI Agents Complete Everyday Online Tasks? PDF	cs.CL, cs.AI	93	Realistic live-web agent benchmark (153 tasks/144 platforms); strong eval value for agentic systems	agents, benchmark, web, evaluation, tool-use, robustness
`2604.07988`	LogAct: Enabling Agentic Reliability via Shared Logs PDF	cs.DC, cs.AI	93	Shared-log state-machine abstraction enables pre-execution veto, recovery, and auditing for agents.	agents, reliability, safety, auditing, execution-control, fault-tolerance, governance
`2604.07775`	ACIArena: Toward Unified Evaluation for Agent Cascading Injection PDF	cs.AI, cs.CL, cs.CR	92	Unified evaluation for multi-agent cascading injection across surfaces/objectives; fills MAS security gap	multi-agent, agent-security, cascading-injection, evaluation-suite, robustness, exfiltration
`2604.07749`	Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models PDF	cs.CL	92	New benchmark for epistemic attacks beyond sycophancy; useful for robustness evals	LLM-safety, robustness, evaluation, jailbreaks, sociotechnical, benchmark
`2604.07695`	AITH: A Post-Quantum Continuous Delegation Protocol for Human-AI Trust Establishment PDF	cs.CR, cs.AI	91	Cryptographic continuous delegation + revocation for AI agents; concrete protocol for bounded autonomy	agent-security, delegation, access-control, cryptography, post-quantum, governance
`2604.08064`	ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models PDF	cs.AI	91	New benchmark for implicit (procedural/priming/conditioning) memory in LLM agents; safety-relevant behavior drift.	benchmark, agent-memory, evaluation, behavior, implicit-learning, reliability
`2604.08178`	Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling PDF	cs.AI	90	Trajectory-level RM benchmark for tool-using agents; tests safety refusal & tool constraints beyond RLHF	reward-modeling, agents, benchmark, tool-use, trajectory-eval, alignment, safety
`2604.08423`	Synthetic Data for any Differentiable Target PDF	cs.CL, cs.AI, cs.LG, stat.ML	90	Optimizes synthetic data to steer models via higher-order gradients; big safety+misuse implications	data-poisoning, model-steering, synthetic-data, data-attribution, security, alignment
`2604.08059`	Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules PDF	cs.RO, cs.AI	89	Systems framework for safe capability upgrades with compatibility checks and runtime rollback in robots	embodied-agents, governance, safe-upgrades, rollback, runtime-safety, modularity
`2604.07776`	Structured Distillation of Web Agent Capabilities Enables Generalization PDF	cs.LG	89	Structured synthetic trajectories distill web-agent skills into 9B; strong WebArena gains for open models.	web-agents, distillation, synthetic-data, tool-use, evaluation, open-weights
`2604.07831`	Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection PDF	cs.CR, cs.CL, cs.CV	88	Practical red-teaming for GUI agents via semantic UI overlay injection; no white-box access required	gui-agents, red-teaming, vision-attacks, visual-grounding, adversarial-ui, robustness
`2604.08395`	Phantasia: Context-Adaptive Backdoors in Vision Language Models PDF	cs.CV, cs.AI	88	Shows stealth of VLM backdoors overestimated; proposes/assesses defenses for multimodal backdoors	security, backdoors, VLM, data-poisoning, adversarial, defenses
`2604.08525`	Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest PDF	cs.AI, cs.CL, cs.CY	88	Analyzes LLM conflicts of interest with ads; important for deployment incentives & alignment	alignment, deployment, conflicts-of-interest, ads, policy, AI-governance
`2604.08005`	Preference Redirection via Attention Concentration: An Attack on Computer Use Agents PDF	cs.LG	87	Vision-side attack on computer-use agents by attention redirection to adversarial patch; preference hijack	computer-use-agents, multimodal-security, adversarial-patch, attention, manipulation, vision
`2604.08297`	Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models PDF	cs.CR	86	ESI identifies safety-critical parameters and enables targeted safety interventions across dense vs MoE	mechanistic-safety, parameter-intervention, MoE, interpretability, safety-control, robustness
`2604.07755`	An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations PDF	cs.CL, cs.SE	86	Quantifies how far static analysis can detect/mitigate library hallucinations; clear limits + upper bounds	hallucinations, code, static-analysis, reliability, evaluation, tooling
`2604.07927`	EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools PDF	cs.AI	86	Adds structured query/evidence tools to deep-research agents; reduces redundancy and improves evidence use.	agents, deep-research, web-search, tooling, evidence, reasoning
`2604.08527`	Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models PDF	cs.CL, cs.LG	86	Identifies OPD length-inflation instability and proposes stabilization; relevant to post-training	LLM-training, distillation, RLHF, post-training, stability, repetition
`2604.08476`	Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization PDF	cs.CV, cs.AI	85	Constrained RL (Faithful GRPO) targets CoT faithfulness/visual grounding, not just accuracy, in MRMs	multimodal, RLVR, GRPO, faithfulness, grounding, reasoning-eval
`2604.08046`	Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation PDF	cs.CL	85	Targets RAG integration bottleneck via joint decoding that forces evidence extraction over parametric priors.	RAG, grounding, faithfulness, decoding, hallucination, knowledge-integration
`2604.07754`	The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training PDF	cs.CR, cs.CL	84	Systematic study of how fine-tuning can misalign/realign safety-aligned LLMs; relevant for model reuse	misalignment, post-training, fine-tuning, realignment, safety, open-models
`2604.07877`	MemReader: From Passive to Active Extraction for Long-Term Agent Memory PDF	cs.CL	84	Active long-term memory extraction (GRPO/ReAct) to reduce memory pollution and improve consistency in agents.	agent-memory, personalization, RL, GRPO, information-extraction, reliability
`2604.08426`	KV Cache Offloading for Context-Intensive Tasks PDF	cs.LG, cs.AI, cs.CL	84	Evaluates KV-cache offloading on context-intensive tasks; releases Text2JSON benchmark	long-context, systems, KV-cache, efficiency, benchmark, information-extraction
`2604.08169`	Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence PDF	cs.AI	83	Runtime activation steering methods to maintain aligned open-ended generation beyond first tokens	activation-steering, runtime-guardrails, alignment, robustness, representation, generation
`2604.07801`	TEMPER: Testing Emotional Perturbation in Quantitative Reasoning PDF	cs.CL, cs.AI	83	TEMPER dataset shows emotion framing alone drops math accuracy 2–10pp; useful robustness stress test	robustness, evaluation, reasoning, emotion, dataset, GSM8K
`2604.07789`	ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents PDF	cs.MA, cs.CL, cs.SE	83	Quantifies value of oracle signals for SWE agents; clarifies what context helps most	agents, software-engineering, evaluation, tool-use, oracles, benchmarks
`2604.07929`	Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems PDF	cs.IR, cs.AI	82	Trace-level human vs GUI-agent behavior comparison in production search; goes beyond success metrics	GUI-agents, evaluation, human-comparison, behavior-traces, search, deployment

AI 论文洞察简报

2026-04-11

0) 执行要点（先读这个）

“安全可执行（Safe to act）”正在成为智能体系统的一等输出：用于辩论的保形（conformal）集合值决策（风险预算式升级）以及基于日志/投票的执行门控，都通过把不确定输出转化为结构化拒绝/复核来减少灾难性自动化行动。
智能体安全正从提示注入转向系统与供应链攻击面：级联式多智能体注入、恶意 API 路由器重写工具调用、以及对 GUI/CUA 智能体的视觉/UI 级操控，都能绕过经典的纯文本防御。
后训练是一个双用途战场：偏好调优（ORPO）可以快速让已对齐安全的开源模型失配（即便用 LoRA 的极少数据也行）；而面向参数的定向方法（ESI→SET/SPA）与激活引导（activation steering）在拥有白盒访问时，能高效地重新对齐/保持对齐。
基准测试更真实也更具诊断性：真实在线任务（ClawBench）暴露出相对沙盒基准的巨大差距；轨迹级奖励建模（Plan-RewardBench）显示评估器在长上下文下崩溃；隐式记忆（ImplicitMemBench）揭示“无意识”适应失败，单靠检索无法修复。
RAG 与代码可靠性正在超越“检索”本身：用于证据整合的联合解码（GuarantRAG）针对“检索到但被忽略”的失败；静态分析能捕获相当但有上限的一部分 Python 库幻觉，并给出清晰的上界。
训练/推理基础设施细节会影响鲁棒性：OPD 可能因长度膨胀而崩溃；KV-cache 卸载可能在上下文密集任务上悄然降低准确率——两者都是看起来像模型失败的“系统”失效模式。

2) 关键主题（聚类）

主题：风险受控的自主性（拒绝、门控、恢复）

重要性：当智能体开始执行真实动作时，关键问题往往是何时不该行动。把模型输出转化为校准的升级、可审计的门控与可恢复的执行机制，可减少不可逆失败。
代表论文：
共同方法：
- 将点预测转为结构化决策（集合值输出；提交/中止；分阶段激活）。
- 增加显式策略层（投票者/决策者；兼容性检查器；动作层级）。
- 强调可审计性 + 回滚/恢复（持久日志；影子部署；回滚控制器）。
开放问题 / 失效模式：
- 保证往往是边际/总体层面（如 split conformal），而非条件式/逐实例。
- 分布漂移与环境变化会破坏校准与治理假设。
- 安全层可能带来效用/时延税（更多人工复核；更多工具调用；更慢执行）。

主题：超越文本的智能体攻击面（多智能体、UI/视觉、路由器）

重要性：许多已部署的智能体栈包含中间层（路由器）、多智能体（信任传播）与视觉落地（GUI/CUA）。这些面允许攻击在保持 schema 合法且“看起来无害”的情况下绕过纯文本过滤。
代表论文：
共同方法：
- 用标准化套件做基准（ACIArena 的 1,356 个案例；路由器市场测量；GUI 注入指标 L1/L2）。
- 保持表面有效性的攻击（schema 合法的 JSON 重写；安全对齐的 UI 图标；小的 ℓ∞ 补丁）。
- 评估可迁移性/持久性（跨受害者图标迁移；对提示变体的鲁棒性；自适应规避触发器）。
开放问题 / 失效模式：
- 路由器的长期修复可能需要由提供方背书的完整性/溯源，而不仅是客户端启发式。
- GUI/CUA 防御仍不成熟；现有工作更多展示攻击，缓解评估有限。
- 多智能体防御可能牺牲效用；裁剪状态（ACI-SENTINEL）有帮助但并非通用。

主题：后训练对齐很脆弱（且可被定向操控）

重要性：开源模型可通过易得的微调快速失配；防守方需要高效方式在不摧毁效用的情况下恢复或保持安全对齐。
代表论文：
共同方法：
- 将微调方法作为攻防工具对比（ORPO vs DPO；PEFT vs PFT）。
- 使用机理/白盒杠杆（通过 ESI 做参数排序；在激活空间做逐 token 门控的引导）。
- 用 ASR/不安全性衡量安全，并跟踪效用回退（MMLU/GSM8K 等；连贯性指标）。
开放问题 / 失效模式：
- 白盒方法（ESI、steering）无法迁移到封闭 API。
- 安全指标常依赖LLM 评审（judge），可能不完美。
- 双用途风险：同样的工具既可用于重新对齐，也可用于失配或规避。

主题：更真实的评估 + 面向智能体的长时程诊断

重要性：只看结果的指标会掩盖失效模式（导航偏离、评估器长度崩溃、隐式记忆缺口）。新基准强调轨迹、长上下文与真实网站。
代表论文：
共同方法：
- 收集轨迹级工件（多层记录；状态转移图；工具轨迹）。
- 压测长上下文与多步设置，在这些设置下 judges/RMs 会退化。
- 加入困难负例/近似失败（near-misses）以防止肤浅打分。
开放问题 / 失效模式：
- 真实在线基准存在可复现性漂移；评估器可能依赖具体模型。
- 轨迹评审在超过约 32k tokens 后崩溃；评估扩展仍未解决。
- 据报告结果，隐式记忆失败即便在带记忆增强的智能体中也仍然存在。

主题：落地与整合（RAG、记忆、代码）

重要性：可靠性失败常来自整合而非检索：忽略证据、写入被污染的记忆、或编造 API。实践管线正在形成，并给出可测的上界。
代表论文：
共同方法：
- 分离关注点：推理 vs 证据（Inner-Answer vs Refer-Answer；主动记忆动作）。
- 使用事后或工具化层（联合解码干预；search/add/buffer/ignore；静态分析器 + 修复）。
- 量化上界与权衡（静态分析可捕获性上界；记忆更新正确性；幻觉降低幅度）。
开放问题 / 失效模式：
- 额外时延/算力（双路径生成；多步记忆管理）。
- 对 docstrings/类型信息的依赖限制了静态方法；动态语言仍然困难。
- 记忆管理器的长期在线稳定性仍需在基准之外验证。

3) 技术综合

多篇论文在“结构化中间表示（structured intermediates）”上收敛为可靠性杠杆：预测集合（conformal）、类型化日志（AgentBus）、显式工具参数（Q+）、oracle 信号（ORACLE-SWE）、以及轨迹对（Plan-RewardBench）。
选择效应（selection effects）被反复利用：保形单例之所以准确是因为会弃权；经 judge 过滤的合成轨迹优于更大的未过滤集合；影子部署能捕获沙盒遗漏的回归。
LLM-as-judge 无处不在，但论文越来越多地报告 judge 验证（如 PPT-Bench 的人类一致性；FGRPO κ=0.997 vs GPT-5）并/或加入与 judge 无关的信号（激活引导使用嵌入相似度、交叉熵、ELO）。
鲁棒性失败越来越非对抗式外观：情绪化表述会降低数学表现；UI 图标“安全对齐”；路由器重写仍 schema 合法；OPD 崩溃看起来像“模型变差”但其实是训练动态。
黑盒可部署防御（保形层；提示缓解；静态分析；客户端路由器门控/日志）与白盒机理防御（激活引导；ESI 参数干预；PRAC 补丁构造）之间出现明显分化。
长时程设置暴露评估器脆弱性：Plan-RewardBench 显示成对 LLM judges 在超过约 32k tokens 后崩溃，推动更鲁棒的判别式 RM 或分层评估。
“记忆”正在分化为显式存储（MemReader 主动写入）与隐式行为适应（ImplicitMemBench），后者并不能仅靠检索解决。
系统工作（KV 卸载、OPD 稳定性）表明推理/训练优化会悄然改变任务准确率，因此鲁棒性评估必须包含基础设施变体。
安全评估正走向生态测量（路由器市场、投毒研究），而不只是实验室攻击，从而产出具体的普遍性数字与可操作缓解措施。

4) Top 5 论文（含“为什么是现在”）

1) Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

量化了一个真实但讨论不足的风险：API 路由器会终止 TLS 并可重写可执行的工具调用 JSON。
大规模生态测量（28 个付费 + 400 个免费路由器），观察到主动注入与凭据触碰，并包含投毒研究。
评估了实用的客户端缓解（策略门控、异常筛查、透明日志），并展示该攻击代理与多种智能体框架的兼容性。
保留意见：客户端防御无法提供密码学溯源；测量可能遗漏未触发的自适应行为。

2) From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation

将辩论输出重构为在用户设定风险预算 α 下的执行 vs 升级，并提供 split-conformal 的边际覆盖保证。
实证瞄准关键失败：错误的一致共识（最初分歧的案例中有 23.9% 到第 3 轮收敛为一致但错误）；在 α=0.05 时保形层通过升级拦截 81.9%。
黑盒且事后：可通过口头化概率 + 聚合部署在专有模型上。
保留意见：保证是边际的并假设可交换性；在闭集多选任务上评估。

3) ClawBench: Can AI Agents Complete Everyday Online Tasks?

真实在线基准，具备对终端提交的安全拦截与五层轨迹记录——在真实与安全之间搭桥。
显示相对沙盒基准的巨大差距：报告中最佳模型（Claude Sonnet 4.6）为 33.3% SR；GPT-5.4 为 6.5%。
通过智能体评估器与人类轨迹对比，提供可追踪的失败诊断。
保留意见：真实在线的可变性与手工端点标注限制了可扩展性与可复现性。

4) ACIArena: Toward Unified Evaluation for Agent Cascading Injection

将多智能体级联注入评估标准化，覆盖28 种攻击、3 个表面、3 个目标，并集成 6 个 MAS 框架。
发现高脆弱性（代码任务常见 90–100% ASR；引用 LLM Debate 为 100% 劫持 ASR），且部分防御会失败或牺牲效用。
提出 ACI-SENTINEL（语义最小性裁剪），在报告案例中显著降低 ASR。
保留意见：受查询成本限制评估规模；防御引入效用权衡且并非普适有效。

5) The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

绘制常见方法下的攻防动态：ORPO 最强用于失配；DPO 最强用于重新对齐（常伴随效用成本）。
显示失配可数据高效（在某些设置中 LoRA 仅需 13 个不安全样本也有效）。
强调模型特定的抗性模式（Gemma2 抵抗 SFT 失配但不抵抗 ORPO）。
保留意见：不安全性依赖 LLM-judge 集成；不含专有模型与完整 RLHF。

5) 实用下一步

为任何多智能体或集成系统加入执行/升级（act/escalate）层：在聚合概率（或类似分数）上实现 split conformal，并衡量自动化错误降低与升级率的权衡。
对工具型智能体，将路由器视为不可信：对高风险工具部署 fail-closed 策略，加入响应异常筛查，并实现用于取证的仅追加透明日志。
用非文本攻击红队测试 GUI/CUA 栈：语义级 UI 图标注入与视觉偏好重定向；衡量持久性与跨模型迁移，而不只看单次成功。
若发布开源权重模型，假设后训练失配很便宜：对候选版本做 ORPO/LoRA 风格对抗调优测试；评估 DPO 或定向干预能多好地恢复安全以及会损失多少效用。
升级评估以包含真实在线或轨迹级指标（导航偏离、工具使用努力偏差）以及长上下文 judge 失效检查（如 >32k token 轨迹）。
对 SWE 智能体，优先复现测试生成/抽取与更丰富的执行上下文：ORACLE-SWE 表明复现测试主导 oracle 增益，组合信号接近完全成功。
审计基础设施变更（KV 卸载、OPD/RL 流水线）时使用上下文密集基准并监控长度/重复；将“优化”视为潜在准确率回归来源。
对 RAG，衡量整合失败（参数化覆盖/割裂式整合）并测试双路径 + 融合方法；不要假设检索改进会自动转化为事实性提升。

由逐篇论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-04-11

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：风险受控的自主性（拒绝、门控、恢复）

主题：超越文本的智能体攻击面（多智能体、UI/视觉、路由器）

主题：后训练对齐很脆弱（且可被定向操控）

主题：更真实的评估 + 面向智能体的长时程诊断

主题：落地与整合（RAG、记忆、代码）

3) 技术综合

4) Top 5 论文（含“为什么是现在”）

5) 实用下一步