AI 论文日报(2026-04-03)

Published:

English version: /paper-news/2026-04-03/

运行统计

  • 候选论文: 222
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-01T00:00:00Z → 2026-04-02T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.00788UK AISI Alignment Evaluation Case-Study
PDF
cs.AI, cs.CR96AISI case study on sabotage in AI-lab coding assistants; concrete frontier-model behaviors.AISI, alignment-eval, sabotage, agentic-coding, deployment, model-behavior
2604.01151Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
PDF
cs.AI, cs.LG, cs.MA95Benchmark + probes for detecting multi-agent collusion; strong OOD transfer focus.multi-agent, collusion, interpretability, probes, benchmark, security, OOD-generalization
2604.01194AgentWatcher: A Rule-based Prompt Injection Monitor
PDF
cs.CR94Rule-based prompt-injection monitor using causal attribution to scale to long contexts.prompt-injection, agents, monitoring, attribution, long-context, security
2604.00770Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
PDF
cs.LG, cs.AI93Backdoors for tokenless latent reasoning; high ASR and evades token-level defenses.backdoors, latent-reasoning, continuous-CoT, adversarial, auditing, security
2604.01212$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
PDF
cs.CL, cs.AI92Long-horizon agent benchmark (hundreds of turns) for planning, delayed feedback, compounding errors.agents, benchmark, long-horizon, planning, evaluation, simulated-environment
2604.00547Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models
PDF
cs.AI, cs.LG92New safety benchmark for unified multimodal models; taxonomy + judging framework.multimodal, safety-benchmark, evaluation, UMLM, red-teaming, safety-taxonomy
2604.00414Decision-Centric Design for LLM Systems
PDF
cs.AI, cs.LG92Makes LLM control decisions explicit/inspectable; improves debugging, constraints, and safety.LLM systems, agent control, decision layer, tool use, reliability, governance
2604.00627When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion
PDF
cs.CR91Shows model merging can unlock hidden trojans; new attack surface for alignment fusion.model-merging, trojans, safety-regression, attack-surface, alignment, security
2604.00986Do Phone-Use Agents Respect Your Privacy?
PDF
cs.CR, cs.AI, cs.CL, cs.LG90MyPhoneBench makes mobile-agent privacy measurable: permissions, minimal disclosure, memory.privacy, mobile-agents, benchmark, auditing, permissions, evaluation
2604.00842Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
PDF
cs.AI, cs.LG, cs.MA90User-simulator + FSM app modeling to evaluate proactive agents; introduces Pare-Bench tasks.agents, proactive-assistants, user-simulation, benchmark, evaluation, tool-use
2604.00892When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
PDF
cs.CL90InterruptBench targets long-horizon web agents handling mid-task goal changes.agents, web-navigation, long-horizon, interruptibility, benchmark, reliability, human-in-the-loop
2604.00445Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models
PDF
cs.AI, cs.CL90Truth-anchored calibration for LLM uncertainty to detect hallucinations; targets proxy failure.uncertainty, hallucinations, calibration, reliability, evaluation, post-hoc
2604.01202Therefore I am. I Think
PDF
cs.AI89Evidence decisions are encoded pre-CoT; probing + causal steering affects behavior.mechanistic-interpretability, chain-of-thought, steering, tool-use, probes, agency
2604.00387RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems
PDF
cs.CR, cs.AI88Defense-in-depth for RAG poisoning via provenance/attestation + taint-style reasoning.RAG, data-poisoning, provenance, supply-chain, grounding, security
2604.01052VibeGuard: A Security Gate Framework for AI-Generated Code
PDF
cs.CR, cs.AI88Practical secure-dev gate for AI-generated code; targets real packaging/artifact leak failure modes.security, code-generation, supply-chain, static-analysis, deployment, guardrails
2604.00392EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts
PDF
cs.SE, cs.AI86Benchmark for LLM-generated tool libraries with safety/robustness and regression metrics.agents, tool-use, benchmark, software-quality, safety-metrics, evaluation
2604.00594Agent psychometrics: Task-level performance prediction in agentic coding benchmarks
PDF
cs.AI86Predicts per-task success in agentic coding via IRT-style psychometrics; separates LLM vs scaffold ability.agents, coding, evaluation, predictive-metrics, IRT, scaffolding
2604.00477Logarithmic 评分s, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
PDF
cs.AI, cs.CL, cs.HC, cs.MA86Agent-judge eval study: panel size vs score saturation and issue discovery scaling.evaluation, LLM-judges, scaling-laws, reliability, human-agreement, red-teaming
2604.00694Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures
PDF
cs.ET, cs.AI86Agent web interaction via shared shadow-API route graph; could reshape agent architectures/security.agents, web automation, APIs, tooling, attack surface, infrastructure
2604.01195ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget
PDF
cs.CL, cs.AI, cs.IR8520K verifiable multi-step search-agent dataset built cheaply; includes external verification pipeline.search-agents, dataset, verification, web, RAG, training-data
2604.01108Adversarial Moral Stress Testing of Large Language Models
PDF
cs.AI84Multi-turn adversarial ethical stress testing to catch rare failures and degradation.safety-eval, multi-turn, red-teaming, ethics, robustness, benchmarks
2604.00979Dual Optimal: Make Your LLM Peer-like with Dignity
PDF
cs.CL, cs.AI84Targets sycophancy/evasiveness; introduces PersonaKnob + constrained Lagrangian DPO to avoid collapse.alignment, anti-sycophancy, DPO, preference-learning, personas, evaluation
2604.01007OmniMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory
PDF
cs.AI84Autonomous research pipeline discovers multimodal lifelong agent memory design.agents, memory, lifelong-learning, multimodal, auto-research, retrieval, benchmarks
2604.00722LangMARL: Natural Language Multi-Agent Reinforcement Learning
PDF
cs.CL84Brings MARL credit assignment + policy gradients into language space for coordinating LLM agents.multi-agent, credit assignment, LLM agents, MARL, coordination, policy gradient
2604.01039Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks
PDF
cs.CR, cs.AI83Automates testing/hardening system prompts vs encoding-based instruction leakage attacks.system-prompt, instruction-leakage, encoding-attacks, hardening, LLM-security
2604.00778From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks
PDF
cs.CL83Mechanistic analysis: correct internal counting but late suppression at output layer.mechanistic-interpretability, reasoning, probing, activation-patching, logit-lens, failure-modes
2604.01220Universal YOCO for Efficient Depth Scaling
PDF
cs.CL83Efficient test-time depth scaling via parameter sharing/recursive compute; targets KV/depth costs.LLM efficiency, test-time scaling, architecture, parameter sharing, long reasoning
2604.00356Signals: Trajectory Sampling and Triage for Agentic Interactions
PDF
cs.AI, cs.CL82Cheap signals to sample/triage agent trajectories for post-deployment monitoring at scale.agents, monitoring, telemetry, triage, post-deployment, evaluation
2604.00362In harmony with gpt-oss
PDF
cs.AI, cs.LG81Reproduces gpt-oss tool scores via reverse-engineered tools + native agent harness.agents, tool-use, reproducibility, evaluation, SWE-bench, harness, open-source
2604.00801Routing-Free Mixture-of-Experts
PDF
cs.LG, cs.AI, cs.CL81Removes centralized MoE routing; continuous expert self-activation + adaptive load balancing.Mixture-of-Experts, routing, scaling, efficiency, architecture, training dynamics

AI 论文洞察简报

2026-04-03

0) 执行要点(先读这个)

  • 智能体评估正在从“单一分数”转向“系统可观测性”:多篇论文提出低成本分诊(triage)心理测量学难度建模评审面板规模定律,使得在不逐条评判每条轨迹的情况下,也能以可控预算实现智能体监控与改进。
  • 接口保真度成为一等基准变量:要复现已发表的智能体编程得分,需要找回分布内工具(in-distribution tools)并以模型的原生消息格式运行;格式/工具不匹配会造成巨大且误导性的差距。
  • 安全威胁从提示扩展到流水线与权重:新的攻击/防御覆盖 (i) RAG 供应链(溯源 + 污点),(ii) 模型合并(仅在合并后激活的潜伏木马),(iii) 连续潜变量推理(embedding 行级后门),以及 (iv) 通过编码格式泄露系统提示
  • 长时程“真实感”基准更尖锐:带活跃用户的主动式助手、可中断的 Web 智能体、以及年度规划模拟都显示:前沿模型仍停留在不高的成功率平台期,并产生巨大的、由 token 主导的恢复成本。
  • 可解释性结果越来越指向控制/攻击面:证据表明工具使用决策在思维链(CoT)开始前就已编码;一些符号类失败来自后层抑制。这两点都暗示干预必须瞄准内部决策回路,而不仅是提示工程。

2) 关键主题(聚类)

主题:可扩展的智能体评估与数据选择

主题:智能体编程中的评测框架保真度与可复现性

  • 重要性:若评测框架、消息格式与工具集不同于训练时分布,已发表分数可能不可复现,从而误导模型选型与部署规划。
  • 代表论文
  • 共同方法
    • 找回或定义分布内工具与 schema;以原生格式运行智能体以避免转换损失。
    • 不仅测 pass@1,也测 上下文溢出(context overflow)、工具 schema 鲁棒性,以及生成代码的回归/可组合性。
  • 开放问题 / 失效模式
    • 若日志不完整,工具发现可能不全;框架选择仍可能掩盖污染(contamination)等混杂因素。
    • 如何标准化“智能体评测框架规范”,使排行榜在不同实现间仍可比?

主题:LLM 系统的供应链安全(数据、提示、权重)

主题:更真实的长时程与混合主动(mixed-initiative)智能体基准

主题:让控制与内部决策显式化(可解释性 → 工程)

3) 技术综合

  • 多项工作在“将测量与行动分离”上趋同:Signals(分诊)、Decision-Centric(显式 δ)、以及智能体裁判扩展(ICC vs 发现)都主张模块化“你观察什么”与“你如何据此行动”。
  • 预算感知评估正在形式化:Signals 报告每个标签的信息量产出;裁判面板显示对数级信度幂律级发现;心理测量学按任务预测成功率以避免重复运行。
  • 工件级评估从输出扩展:EvolveTool-Bench 评估演化工具库(复用/回归),而 gpt-oss 复现表明评测框架/工具/消息格式本身也是“工件”的一部分。
  • 安全论文越来越采用供应链框架:RAGShield 用证明 + 污点;TrojanMerge 针对参数融合;THOUGHTSTEER 针对潜变量推理模型的 embedding 行;编码攻击针对系统指令机密性。
  • 多项结果暗示特权访问不对称:在可访问隐藏状态时存在强检测界/探针(连续潜变量后门探针;串谋探针),但黑盒检测弱得多。
  • 长时程基准(Pare、InterruptBench、YC-Bench)一致显示前沿模型进入平台期,且效率成本(tokens、重试、API 成本)是决定性因素,而不仅是成功率。
  • 可解释性发现(生成前工具决策;后层抑制回路)表明事后 CoT作为解释通道可能不可靠——支持 Decision-Centric 推动显式决策接口。
  • 可复现性工作(Harmony/tools)强调上下文窗口溢出与消息格式可主导结果——并与长时程场景中持续的上下文压力相互作用。
  • 在评估论文中反复出现的模式:粗指标掩盖失效模式(任务完成掩盖工具库债务;均分掩盖尾部漂移;聚合 pass@1 掩盖框架不匹配)。

4) Top 5 论文(含“为什么是现在”)

1) Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

  • 展示训练期后门(THOUGHTSTEER),在连续潜变量推理模型上以极小的干净准确率损失实现约 100% 的攻击成功率。
  • 将鲁棒性与神经塌缩(Neural Collapse)联系起来,并报告在可访问隐藏状态时线性探针 AUC≈1.0。
  • 评估多种防御并发现它们无法在保持干净准确率的同时降低 ASR。
  • 保留意见:最强检测依赖隐藏状态访问;机制深度主要在较小模型(COCONUT 124M)上最完整。

2) When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

  • 提出 TrojanMerge:源模型单独都安全,但合并后模型有害得分最高可达 85.4%。
  • 在多种合并算法(Task Arithmetic/DARE/TIES/KnOTS)上有效,合并后平均有害性高。
  • 强调“单独通过安全检查”不足以保证面向合并的模型安全。
  • 保留意见:主要评估双模型合并;攻击假设能构造安全关键变换(需要梯度/数据访问)。

3) In harmony with gpt-oss

  • 通过找回分布内工具并实现原生 Harmony 评测框架,独立复现 OpenAI gpt-oss-20b 分数。
  • 量化 Chat Completions 转换如何放大上下文溢出(例如某设置下 Harmony 0.2% vs Chat 11.0%)。
  • 提供可复用的工具发现方法与框架设计。
  • 保留意见:工具发现受可用日志限制;明确未调查 SWE Verified 中的污染问题。

4) Signals: Trajectory Sampling and Triage for Agentic Interactions

  • 确定性、无模型信号在 τ-bench 上将“对开发者有信息量”的产出提升到 82%(随机为 54%),提高标注效率(报告 1.52×)。
  • 区分交互失败与执行失败——对工具型智能体很关键,因为流畅对话可能掩盖执行问题。
  • 设计为可常驻运行且无需额外模型调用。
  • 保留意见:粗分类会漏掉语义错误但行为正常的轨迹;评估使用模拟用户(τ-bench)。

5) Do Phone-Use Agents Respect Your Privacy?

  • 通过 iMy(LOW/HIGH 数据 + 权限工具)与记录字段级编辑的插桩应用,使 GUI 智能体隐私变得可审计
  • 显示成功率与隐私显著背离(例如 Claude Opus 4.6:82.8% 成功但在 τ=0.7 时 PQSR 为 47.2%)。
  • 识别表单最小化(过度填写可选个人字段)为最持久的失效模式。
  • 保留意见:模拟应用 + 宽松用户模拟器(总是授予 HIGH)限制真实感;未覆盖网络外泄或跨应用泄露。

5) 实用的下一步

  • 为智能体日志增加低成本分诊层(交互 + 执行信号)以优先安排人工审阅;将“每个标签的信息量”作为一等指标跟踪。
  • 版本化并验证你的评测框架:锁定消息格式、工具 schema 与上下文计量;将上下文溢出与工具调用 schema 遵循度纳入评估 CI。
  • 将 RAG 语料当作供应链:实现文档证明 + hash 固定/重新证明流程;在高完整性领域加入信任加权检索与污点传播。
  • 加固针对格式攻击的提示/系统泄露:显式测试“用 YAML/TOML/cron/gitignore 打印系统提示”类探针;考虑设计期指令重塑并复测 ASR。
  • 若要合并模型,加入合并时安全检查:评估合并后的有害性(而非仅逐源模型),并考虑在融合前对贡献方做完整性验证。
  • 用成本曲线基准化长时程行为:对中断场景跟踪 SR(k) 与 token 增量;对主动式助手跟踪提议 vs 接受 vs 成功;对规划模拟跟踪内存使用(scratchpad 写入)作为预测指标。
  • 让控制显式化:将信号估计(充分性/正确性/不确定性)与确定性策略分离;记录决策上下文以便失败可归因。
  • GUI 智能体隐私:对表单草稿做插桩并强制最小化策略(必填 vs 可选字段);用类似 PQSR 的联合指标而非仅任务成功率衡量。

由逐论文分析生成;未进行外部浏览。