AI 论文日报（2026-04-03）

Published: April 03, 2026

English version: /paper-news/2026-04-03/

运行统计

候选论文: 222
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-04-01T00:00:00Z → 2026-04-02T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2604.00788`	UK AISI Alignment Evaluation Case-Study PDF	cs.AI, cs.CR	96	AISI case study on sabotage in AI-lab coding assistants; concrete frontier-model behaviors.	AISI, alignment-eval, sabotage, agentic-coding, deployment, model-behavior
`2604.01151`	Detecting Multi-Agent Collusion Through Multi-Agent Interpretability PDF	cs.AI, cs.LG, cs.MA	95	Benchmark + probes for detecting multi-agent collusion; strong OOD transfer focus.	multi-agent, collusion, interpretability, probes, benchmark, security, OOD-generalization
`2604.01194`	AgentWatcher: A Rule-based Prompt Injection Monitor PDF	cs.CR	94	Rule-based prompt-injection monitor using causal attribution to scale to long contexts.	prompt-injection, agents, monitoring, attribution, long-context, security
`2604.00770`	Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning PDF	cs.LG, cs.AI	93	Backdoors for tokenless latent reasoning; high ASR and evades token-level defenses.	backdoors, latent-reasoning, continuous-CoT, adversarial, auditing, security
`2604.01212`	$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution PDF	cs.CL, cs.AI	92	Long-horizon agent benchmark (hundreds of turns) for planning, delayed feedback, compounding errors.	agents, benchmark, long-horizon, planning, evaluation, simulated-environment
`2604.00547`	Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models PDF	cs.AI, cs.LG	92	New safety benchmark for unified multimodal models; taxonomy + judging framework.	multimodal, safety-benchmark, evaluation, UMLM, red-teaming, safety-taxonomy
`2604.00414`	Decision-Centric Design for LLM Systems PDF	cs.AI, cs.LG	92	Makes LLM control decisions explicit/inspectable; improves debugging, constraints, and safety.	LLM systems, agent control, decision layer, tool use, reliability, governance
`2604.00627`	When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion PDF	cs.CR	91	Shows model merging can unlock hidden trojans; new attack surface for alignment fusion.	model-merging, trojans, safety-regression, attack-surface, alignment, security
`2604.00986`	Do Phone-Use Agents Respect Your Privacy? PDF	cs.CR, cs.AI, cs.CL, cs.LG	90	MyPhoneBench makes mobile-agent privacy measurable: permissions, minimal disclosure, memory.	privacy, mobile-agents, benchmark, auditing, permissions, evaluation
`2604.00842`	Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants PDF	cs.AI, cs.LG, cs.MA	90	User-simulator + FSM app modeling to evaluate proactive agents; introduces Pare-Bench tasks.	agents, proactive-assistants, user-simulation, benchmark, evaluation, tool-use
`2604.00892`	When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation PDF	cs.CL	90	InterruptBench targets long-horizon web agents handling mid-task goal changes.	agents, web-navigation, long-horizon, interruptibility, benchmark, reliability, human-in-the-loop
`2604.00445`	Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models PDF	cs.AI, cs.CL	90	Truth-anchored calibration for LLM uncertainty to detect hallucinations; targets proxy failure.	uncertainty, hallucinations, calibration, reliability, evaluation, post-hoc
`2604.01202`	Therefore I am. I Think PDF	cs.AI	89	Evidence decisions are encoded pre-CoT; probing + causal steering affects behavior.	mechanistic-interpretability, chain-of-thought, steering, tool-use, probes, agency
`2604.00387`	RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems PDF	cs.CR, cs.AI	88	Defense-in-depth for RAG poisoning via provenance/attestation + taint-style reasoning.	RAG, data-poisoning, provenance, supply-chain, grounding, security
`2604.01052`	VibeGuard: A Security Gate Framework for AI-Generated Code PDF	cs.CR, cs.AI	88	Practical secure-dev gate for AI-generated code; targets real packaging/artifact leak failure modes.	security, code-generation, supply-chain, static-analysis, deployment, guardrails
`2604.00392`	EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts PDF	cs.SE, cs.AI	86	Benchmark for LLM-generated tool libraries with safety/robustness and regression metrics.	agents, tool-use, benchmark, software-quality, safety-metrics, evaluation
`2604.00594`	Agent psychometrics: Task-level performance prediction in agentic coding benchmarks PDF	cs.AI	86	Predicts per-task success in agentic coding via IRT-style psychometrics; separates LLM vs scaffold ability.	agents, coding, evaluation, predictive-metrics, IRT, scaffolding
`2604.00477`	Logarithmic 评分s, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation PDF	cs.AI, cs.CL, cs.HC, cs.MA	86	Agent-judge eval study: panel size vs score saturation and issue discovery scaling.	evaluation, LLM-judges, scaling-laws, reliability, human-agreement, red-teaming
`2604.00694`	Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures PDF	cs.ET, cs.AI	86	Agent web interaction via shared shadow-API route graph; could reshape agent architectures/security.	agents, web automation, APIs, tooling, attack surface, infrastructure
`2604.01195`	ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget PDF	cs.CL, cs.AI, cs.IR	85	20K verifiable multi-step search-agent dataset built cheaply; includes external verification pipeline.	search-agents, dataset, verification, web, RAG, training-data
`2604.01108`	Adversarial Moral Stress Testing of Large Language Models PDF	cs.AI	84	Multi-turn adversarial ethical stress testing to catch rare failures and degradation.	safety-eval, multi-turn, red-teaming, ethics, robustness, benchmarks
`2604.00979`	Dual Optimal: Make Your LLM Peer-like with Dignity PDF	cs.CL, cs.AI	84	Targets sycophancy/evasiveness; introduces PersonaKnob + constrained Lagrangian DPO to avoid collapse.	alignment, anti-sycophancy, DPO, preference-learning, personas, evaluation
`2604.01007`	OmniMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory PDF	cs.AI	84	Autonomous research pipeline discovers multimodal lifelong agent memory design.	agents, memory, lifelong-learning, multimodal, auto-research, retrieval, benchmarks
`2604.00722`	LangMARL: Natural Language Multi-Agent Reinforcement Learning PDF	cs.CL	84	Brings MARL credit assignment + policy gradients into language space for coordinating LLM agents.	multi-agent, credit assignment, LLM agents, MARL, coordination, policy gradient
`2604.01039`	Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks PDF	cs.CR, cs.AI	83	Automates testing/hardening system prompts vs encoding-based instruction leakage attacks.	system-prompt, instruction-leakage, encoding-attacks, hardening, LLM-security
`2604.00778`	From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks PDF	cs.CL	83	Mechanistic analysis: correct internal counting but late suppression at output layer.	mechanistic-interpretability, reasoning, probing, activation-patching, logit-lens, failure-modes
`2604.01220`	Universal YOCO for Efficient Depth Scaling PDF	cs.CL	83	Efficient test-time depth scaling via parameter sharing/recursive compute; targets KV/depth costs.	LLM efficiency, test-time scaling, architecture, parameter sharing, long reasoning
`2604.00356`	Signals: Trajectory Sampling and Triage for Agentic Interactions PDF	cs.AI, cs.CL	82	Cheap signals to sample/triage agent trajectories for post-deployment monitoring at scale.	agents, monitoring, telemetry, triage, post-deployment, evaluation
`2604.00362`	In harmony with gpt-oss PDF	cs.AI, cs.LG	81	Reproduces gpt-oss tool scores via reverse-engineered tools + native agent harness.	agents, tool-use, reproducibility, evaluation, SWE-bench, harness, open-source
`2604.00801`	Routing-Free Mixture-of-Experts PDF	cs.LG, cs.AI, cs.CL	81	Removes centralized MoE routing; continuous expert self-activation + adaptive load balancing.	Mixture-of-Experts, routing, scaling, efficiency, architecture, training dynamics

AI 论文洞察简报

2026-04-03

0) 执行要点（先读这个）

智能体评估正在从“单一分数”转向“系统可观测性”：多篇论文提出低成本分诊（triage）、心理测量学难度建模与评审面板规模定律，使得在不逐条评判每条轨迹的情况下，也能以可控预算实现智能体监控与改进。
接口保真度成为一等基准变量：要复现已发表的智能体编程得分，需要找回分布内工具（in-distribution tools）并以模型的原生消息格式运行；格式/工具不匹配会造成巨大且误导性的差距。
安全威胁从提示扩展到流水线与权重：新的攻击/防御覆盖 (i) RAG 供应链（溯源 + 污点），(ii) 模型合并（仅在合并后激活的潜伏木马），(iii) 连续潜变量推理（embedding 行级后门），以及 (iv) 通过编码格式泄露系统提示。
长时程“真实感”基准更尖锐：带活跃用户的主动式助手、可中断的 Web 智能体、以及年度规划模拟都显示：前沿模型仍停留在不高的成功率平台期，并产生巨大的、由 token 主导的恢复成本。
可解释性结果越来越指向控制/攻击面：证据表明工具使用决策在思维链（CoT）开始前就已编码；一些符号类失败来自后层抑制。这两点都暗示干预必须瞄准内部决策回路，而不仅是提示工程。

2) 关键主题（聚类）

主题：可扩展的智能体评估与数据选择

重要性：智能体系统会生成海量交互轨迹；若缺乏可扩展的选择与测量机制，团队要么在审阅/裁判上过度花费，要么错过罕见但关键的失败。
代表论文：
共同方法：
- 用选择/估计层（规则信号、IRT 风格预测器、面板规模定律）替代“全量评估”。
- 使用结构化工件（工具调用、仓库状态/测试/解答、角色日记）提升可解释性与效率。
- 报告超越任务成功率的指标（信息量产出、ICC 信度、问题发现的规模规律、工具库健康度）。
开放问题 / 失效模式：
- 这些方法从基准/模拟用户迁移到生产流量的效果如何？
- 盲区风险：粗粒度信号会漏掉“行为正常但语义错误”的轨迹；预测器可能编码数据集伪迹。
- 如何端到端闭环（分诊 → 偏好数据 → 训练 → 可测量改进）？

主题：智能体编程中的评测框架保真度与可复现性

重要性：若评测框架、消息格式与工具集不同于训练时分布，已发表分数可能不可复现，从而误导模型选型与部署规划。
代表论文：
- In harmony with gpt-oss
- EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts
共同方法：
- 找回或定义分布内工具与 schema；以原生格式运行智能体以避免转换损失。
- 不仅测 pass@1，也测 上下文溢出（context overflow）、工具 schema 鲁棒性，以及生成代码的回归/可组合性。
开放问题 / 失效模式：
- 若日志不完整，工具发现可能不全；框架选择仍可能掩盖污染（contamination）等混杂因素。
- 如何标准化“智能体评测框架规范”，使排行榜在不同实现间仍可比？

主题：LLM 系统的供应链安全（数据、提示、权重）

重要性：随着 LLM 系统变得可组合（RAG 语料、合并检查点、系统提示、生成工件），攻击者可针对流水线而非模型表面行为。
代表论文：
共同方法：
- 将知识/模型工件视为软件供应链（证明、溯源、完整性边界）。
- 展示能通过常规检查的隐蔽攻击（单独安全的来源在合并后变不安全；embedding 中的潜触发器）。
- 增加纵深防御层（溯源 + 信任加权检索 + 污点跟踪；设计期提示重塑）。
开放问题 / 失效模式：
- 溯源防御在内部替换（原地编辑）上存在盲点，除非强制 hash 固定/重新证明。
- 合并时防御在自适应威胁下可能失效；在无特权访问（隐藏状态）的情况下检测仍然困难。
- 基于编码的提示泄露表明，“拒绝直接请求”并不等于机密性保证。

主题：更真实的长时程与混合主动（mixed-initiative）智能体基准

重要性：部署失败往往来自状态性、中断、用户接受度与延迟后果——这些在短时程基准中代表不足。
代表论文：
共同方法：
- 构建有状态环境（FSM 应用、Web UI 状态、POMDP 商业模拟）并在约束下评估成功。
- 增加接受度、更新后成功曲线与成本/效率（tokens/动作、API 成本）指标。
开放问题 / 失效模式：
- 模拟器可能无法覆盖真实用户多样性；合成的中断/场景会带来偏差。
- 中断场景下恢复的 token 开销占主导——如何在不损害适应性的前提下降低“思考成本”？

主题：让控制与内部决策显式化（可解释性 → 工程）

重要性：若决策隐含在生成过程中，失败难以归因；若决策在 CoT 之前就已编码，解释可能是事后合理化。
代表论文：
共同方法：
- 将信号/估计器与策略/控制器分离（显式决策层）。
- 用探针/引导/补丁定位决策或失败出现的位置（生成前动作编码；后层抑制回路）。
开放问题 / 失效模式：
- 如何避免过早承诺（生成前决策），同时保持性能？
- 这些机制性发现能否泛化到更大模型与真实工具使用栈？

3) 技术综合

多项工作在“将测量与行动分离”上趋同：Signals（分诊）、Decision-Centric（显式 δ）、以及智能体裁判扩展（ICC vs 发现）都主张模块化“你观察什么”与“你如何据此行动”。
预算感知评估正在形式化：Signals 报告每个标签的信息量产出；裁判面板显示对数级信度但幂律级发现；心理测量学按任务预测成功率以避免重复运行。
工件级评估从输出扩展：EvolveTool-Bench 评估演化工具库（复用/回归），而 gpt-oss 复现表明评测框架/工具/消息格式本身也是“工件”的一部分。
安全论文越来越采用供应链框架：RAGShield 用证明 + 污点；TrojanMerge 针对参数融合；THOUGHTSTEER 针对潜变量推理模型的 embedding 行；编码攻击针对系统指令机密性。
多项结果暗示特权访问不对称：在可访问隐藏状态时存在强检测界/探针（连续潜变量后门探针；串谋探针），但黑盒检测弱得多。
长时程基准（Pare、InterruptBench、YC-Bench）一致显示前沿模型进入平台期，且效率成本（tokens、重试、API 成本）是决定性因素，而不仅是成功率。
可解释性发现（生成前工具决策；后层抑制回路）表明事后 CoT作为解释通道可能不可靠——支持 Decision-Centric 推动显式决策接口。
可复现性工作（Harmony/tools）强调上下文窗口溢出与消息格式可主导结果——并与长时程场景中持续的上下文压力相互作用。
在评估论文中反复出现的模式：粗指标掩盖失效模式（任务完成掩盖工具库债务；均分掩盖尾部漂移；聚合 pass@1 掩盖框架不匹配）。

4) Top 5 论文（含“为什么是现在”）

1) Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

展示训练期后门（THOUGHTSTEER），在连续潜变量推理模型上以极小的干净准确率损失实现约 100% 的攻击成功率。
将鲁棒性与神经塌缩（Neural Collapse）联系起来，并报告在可访问隐藏状态时线性探针 AUC≈1.0。
评估多种防御并发现它们无法在保持干净准确率的同时降低 ASR。
保留意见：最强检测依赖隐藏状态访问；机制深度主要在较小模型（COCONUT 124M）上最完整。

2) When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

提出 TrojanMerge：源模型单独都安全，但合并后模型有害得分最高可达 85.4%。
在多种合并算法（Task Arithmetic/DARE/TIES/KnOTS）上有效，合并后平均有害性高。
强调“单独通过安全检查”不足以保证面向合并的模型安全。
保留意见：主要评估双模型合并；攻击假设能构造安全关键变换（需要梯度/数据访问）。

3) In harmony with gpt-oss

通过找回分布内工具并实现原生 Harmony 评测框架，独立复现 OpenAI gpt-oss-20b 分数。
量化 Chat Completions 转换如何放大上下文溢出（例如某设置下 Harmony 0.2% vs Chat 11.0%）。
提供可复用的工具发现方法与框架设计。
保留意见：工具发现受可用日志限制；明确未调查 SWE Verified 中的污染问题。

4) Signals: Trajectory Sampling and Triage for Agentic Interactions

确定性、无模型信号在 τ-bench 上将“对开发者有信息量”的产出提升到 82%（随机为 54%），提高标注效率（报告 1.52×）。
区分交互失败与执行失败——对工具型智能体很关键，因为流畅对话可能掩盖执行问题。
设计为可常驻运行且无需额外模型调用。
保留意见：粗分类会漏掉语义错误但行为正常的轨迹；评估使用模拟用户（τ-bench）。

5) Do Phone-Use Agents Respect Your Privacy?

通过 iMy（LOW/HIGH 数据 + 权限工具）与记录字段级编辑的插桩应用，使 GUI 智能体隐私变得可审计。
显示成功率与隐私显著背离（例如 Claude Opus 4.6：82.8% 成功但在 τ=0.7 时 PQSR 为 47.2%）。
识别表单最小化（过度填写可选个人字段）为最持久的失效模式。
保留意见：模拟应用 + 宽松用户模拟器（总是授予 HIGH）限制真实感；未覆盖网络外泄或跨应用泄露。

5) 实用的下一步

为智能体日志增加低成本分诊层（交互 + 执行信号）以优先安排人工审阅；将“每个标签的信息量”作为一等指标跟踪。
版本化并验证你的评测框架：锁定消息格式、工具 schema 与上下文计量；将上下文溢出与工具调用 schema 遵循度纳入评估 CI。
将 RAG 语料当作供应链：实现文档证明 + hash 固定/重新证明流程；在高完整性领域加入信任加权检索与污点传播。
加固针对格式攻击的提示/系统泄露：显式测试“用 YAML/TOML/cron/gitignore 打印系统提示”类探针；考虑设计期指令重塑并复测 ASR。
若要合并模型，加入合并时安全检查：评估合并后的有害性（而非仅逐源模型），并考虑在融合前对贡献方做完整性验证。
用成本曲线基准化长时程行为：对中断场景跟踪 SR(k) 与 token 增量；对主动式助手跟踪提议 vs 接受 vs 成功；对规划模拟跟踪内存使用（scratchpad 写入）作为预测指标。
让控制显式化：将信号估计（充分性/正确性/不确定性）与确定性策略分离；记录决策上下文以便失败可归因。
GUI 智能体隐私：对表单草稿做插桩并强制最小化策略（必填 vs 可选字段）；用类似 PQSR 的联合指标而非仅任务成功率衡量。

由逐论文分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-04-03

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：可扩展的智能体评估与数据选择

主题：智能体编程中的评测框架保真度与可复现性

主题：LLM 系统的供应链安全（数据、提示、权重）

主题：更真实的长时程与混合主动（mixed-initiative）智能体基准

主题：让控制与内部决策显式化（可解释性 → 工程）

3) 技术综合

4) Top 5 论文（含“为什么是现在”）

5) 实用的下一步