AI 论文日报（2026-04-04）

Published: April 04, 2026

English version: /paper-news/2026-04-04/

运行统计

候选论文: 254
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-04-02T00:00:00Z → 2026-04-03T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2604.02174`	Quantifying Self-Preservation Bias in Large Language Models PDF	cs.AI	95	Benchmark quantifies self-preservation bias via role inconsistency; strong agentic misalignment signal.	agent-safety, instrumental-convergence, shutdown-resistance, evaluation, RLHF, benchmark
`2604.02022`	ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety PDF	cs.AI	94	Long-horizon trajectory benchmark for agent safety with delayed triggers and harm taxonomy.	agent-safety, benchmark, long-horizon, tool-use, red-teaming, evaluation
`2604.01604`	CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders PDF	cs.AI	94	Circuit-guided refusal features near boundary; improves jailbreak/ASR analysis and control	LLM-safety, refusal, mechanistic-interpretability, jailbreaks, feature-selection, circuits
`2604.01905`	From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers PDF	cs.CR, cs.SE	92	Component-centric dataset + detection for malicious MCP servers; targets real tool-ecosystem attacks.	security, agents, MCP, supply-chain, tooling, dataset, detection
`2604.02230`	Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs PDF	cs.AI	92	New abstention method (trace inversion) targets reasoning-model overanswering failures	abstention, hallucinations, reasoning-models, reliability, uncertainty, evaluation
`2604.01658`	CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery PDF	cs.AI	92	Autonomous multi-agent evolution w/ persistent memory + practical safeguards; strong agentic relevance	agents, multi-agent, open-ended, autonomous, safeguards, evaluation, infrastructure
`2604.01496`	From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents PDF	cs.SE, cs.CL	91	Strong SWE-bench gains + large released trajectories; advances real agentic coding workflows	agents, software-engineering, SWE-bench, post-training, datasets, tool-use
`2604.01508`	ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems PDF	cs.SE, cs.AI	90	Deterministic offline benchmark for tool misuse/recovery with budgets and fault injection; very reusable.	agents, tool-use, robustness, benchmark, fault-injection, evaluation
`2604.02091`	Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning PDF	cs.CL, cs.AI, cs.IR	90	RL aligns RAG reranking to downstream LLM answer utility, not static IR labels	RAG, reranking, RL, LLM-feedback, evaluation, alignment
`2604.01664`	ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents PDF	cs.AI	90	RL-based budget-aware context compression for long-horizon agents; directly targets context-limit failures	agents, long-horizon, context-management, compression, reinforcement-learning, efficiency
`2604.02288`	Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing PDF	cs.LG, cs.AI	89	Unifies GRPO/SDPO via routing; addresses RLVR credit assignment + late-stage collapse	RLVR, post-training, GRPO, distillation, optimization-stability, alignment-training
`2604.01977`	RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale PDF	cs.CR, cs.AI, cs.CL, cs.LG, cs.SE	88	Automates CVE detection-rule generation at scale; high security impact and deployable architecture	security, vulnerability-detection, CVE, rule-generation, automation, threat-detection
`2604.01624`	OSCAR: Orchestrated Self-verification and Cross-path Refinement PDF	cs.AI, cs.CL	87	Hallucination mitigation using diffusion LM trajectories; unsupervised uncertainty localization	hallucinations, diffusion-language-models, uncertainty, self-verification, inference-time-control
`2604.01652`	ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models PDF	cs.AI, cs.CL	87	1B grounded claim verifier w/ structured rationales; strong gains vs larger baselines, interpretable	verification, factuality, grounding, hallucinations, small-models, interpretability, evaluation
`2604.01925`	ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues PDF	cs.CL, cs.AI	86	New implicit-bias QA benchmark using characteristic cues; shows bias persists despite explicit suppression.	bias, evaluation, safety, fairness, benchmark, implicit-signals
`2604.01993`	SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning PDF	cs.CL, cs.AI	86	Benchmarking with verifiable atomic steps; filters unanswerables and gives stepwise feedback	evaluation, multi-hop-reasoning, verification, benchmarks, grounding, error-taxonomy
`2604.01837`	PLOT: Enhancing Preference Learning via Optimal Transport PDF	cs.CL	86	Optimal-transport token loss for preference learning; aims for stability/robustness gains	alignment, preference-learning, DPO/RLHF, optimal-transport, token-level
`2604.02322`	Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning PDF	cs.LG, cs.AI, cs.CL	86	Task-scaling law via solving N problems in one context; reduces CoT token cost with simple training	reasoning, efficiency, scaling-laws, training, chain-of-thought, inference-cost
`2604.01682`	PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment PDF	cs.CL	85	Risk-gated SFT objective to reduce overconfident hallucinations at fact-critical spans.	hallucination, alignment, factuality, SFT, uncertainty, training
`2604.02155`	Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents PDF	cs.CL	84	Finds non-monotonic CoT budget effects in function-calling agents; actionable for agent design.	agents, function-calling, reasoning, chain-of-thought, evaluation, reliability
`2604.02194`	Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model PDF	cs.CL, cs.AI	84	Neuron-level tuning to resist noisy/irrelevant retrieval; improves RAG robustness	RAG, robustness, retrieval-noise, instruction-tuning, attribution, neurons
`2604.01610`	GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation PDF	cs.AI	84	Training-free tool-based KG navigation enables multi-hop reasoning beyond context limits	agents, tool-use, knowledge-graphs, grounding, long-context, reasoning
`2604.02047`	Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding PDF	cs.CL, cs.AI	84	Training-free speculative decoding w/ anisotropic trees; principled use of mixed-quality token sources	inference, speculative-decoding, efficiency, decoding, systems
`2604.01754`	LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches PDF	cs.CL, cs.AI, cs.LG	83	Live, post-cutoff math benchmark from recent arXiv theorems; reduces contamination, adds taxonomy	evaluation, math-reasoning, benchmark, data-contamination, proof-sketches
`2604.01676`	GPA: Learning GUI Process Automation from Demonstrations PDF	cs.CV, cs.AI, cs.SE	82	Deterministic, local GUI automation from one demo; emphasizes reliability calibration and privacy.	agents, GUI, RPA, privacy, reliability, tooling
`2604.01576`	Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents PDF	cs.LG	82	Alignment for supportive agents: autonomy-preserving objective + relational failure benchmark	alignment, social-risk, autonomy, supportive-agents, benchmarks, reward-modeling
`2604.01840`	Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models PDF	cs.AI	82	Credits only visually-dependent tokens in RLVR; sharper learning signal for LVLM reasoning	multimodal, VLM, RLVR, credit-assignment, reasoning, optimization
`2604.01618`	Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models PDF	cs.CV, cs.AI	81	Physically plausible adversarial 3D textures attack VLA models; important robotics safety surface.	adversarial, robotics, VLA, physical-attacks, robustness, security
`2604.01988`	SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation PDF	cs.AI	81	Controlled benchmark for number sense + shortcut use/judgment; useful probe of reasoning reliability	evaluation, numerical-reasoning, robustness, shortcuts, calibration
`2604.02276`	De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules PDF	cs.AI, cs.CL, cs.LG	80	Automated regulatory rule extraction with judge+iterative repair; useful for compliance-aware agents.	governance, compliance, LLM-judge, self-refinement, information-extraction, agents

AI 论文洞察简报

2026-04-04

0) 执行要点（先读这个）

智能体可靠性正在从“能力”转向“约束下的操作正确性”：通过确定性故障注入 + 对工具误用的预算化评分（ToolMisuseBench），以及将上下文窗口预算显式建模为 RL 决策问题（ContextBudget），让失败可归因、可优化。
执行密集型 SWE 智能体训练可通过“语义蒸馏 → 小规模执行精炼”的配方实现可扩展：SWE-ZERO（30 万条免执行轨迹）+ SWE-HERO（1.3 万条带执行验证）显著提升 SWE-bench Verified（如 32B：62.2%），同时降低对基础设施的依赖。
安全评估正在变得“轨迹原生”且具备系统供应链意识：ATBench 暴露了长时程、延迟触发的工具风险，即便强模型也难以做细粒度诊断；MCP 服务器安全工作展示多组件攻击链，并给出行为偏离检测器（Connor），F1 很高（94.6%）且在真实市场中发现恶意样本。
“推理”并非单调有益——预算与信用分配很关键：长 CoT 可能损害函数调用准确率（峰值出现在极短的 8–32 token）；多模态 RL 在将优势信号路由到视觉依赖 token 时更有效（PGPO）；RL 后训练在将样本在 GRPO 与自蒸馏之间路由时更稳定（SRPO）。
事实性/弃答正在转向局部化、模型原生信号与定向干预：扩散 LM 可通过跨链熵定位不确定承诺并修正片段（OSCAR）；弃答可通过推理轨迹反演检测“查询不对齐”而改进。

2) 关键主题（聚类）

主题：面向工具型智能体的预算化、确定性评测

重要性：工具失败（schema 漂移、鉴权、超时）与资源限制（步数/调用数/上下文）主导真实部署；确定性、预算感知的基准让可靠性改进可度量且可复现。
代表论文：
共同方法：
- 确定性故障注入 / 可回放模拟器 + 结构化指标（成功率、无效调用、恢复时间、超预算）。
- 将“预算”作为一等变量（对不同上限的 AUC；显式剩余上下文状态；token 预算扫描）。
- 将失败诊断为可行动的类别（选错但有效的函数 vs 幻觉函数；策略违规；恢复是否成功）。
开放问题 / 失败模式：
- 如何处理“硬故障”（鉴权/限流），在 ToolMisuseBench 已发布设置中简单修复层零成功。
- 学到的策略能否跨工具生态与不断变化的 schema 泛化，而不对基准故障分布过拟合。
- 如何自适应地门控推理/CoT（短有助路由；长会诱发误路由/幻觉）且不依赖脆弱启发式。

主题：面向代码与长时程自治的可扩展训练 + 验证闭环

重要性：执行环境与长时间搜索循环是开源智能体的瓶颈；可扩展数据 + 持久记忆可在不需要昂贵基础设施的情况下释放能力。
代表论文：
共同方法：
- 将“廉价语义学习”与“昂贵验证”拆分（免执行蒸馏后再做带执行精炼）。
- 将状态/知识外化为持久化工件（笔记/技能/尝试记录），以便复用与跨智能体扩散。
- 偏好确定性回放/落地机制（基于 SMC 的 GUI 元素定位 + 就绪门控；有界重试）。
开放问题 / 失败模式：
- SWE 蒸馏中的教师继承与验证器精度上限；环境差异导致的脆弱性。
- CORAL 式自治在弱模型或模糊评估器下的表现（论文提到评估器假设）。
- 录制-回放系统（GPA）无法处理需要超出示范的新规划任务。

主题：轨迹级安全 + 供应链/工具安全

重要性：真实危害往往跨多步工具轨迹出现，且可能来自被攻陷的工具服务器；单轮安全检查会漏掉延迟触发与多组件攻击链。
代表论文：
共同方法：
- 显式分类体系 + 可控生成（ATBench 的风险来源/失败模式/危害三轴；延迟触发的双 episode 协议）。
- 超越签名的行为检测（Connor：意图抽取 + 执行追踪 + 代码切片 + 逐步 allow/warn/block）。
- 面向具身智能体的物理落地威胁模型（物体绑定的对抗 3D 纹理；通过 EoT 实现 sim-to-real）。
开放问题 / 失败模式：
- 即便二元不安全检测尚可，细粒度诊断仍很弱（ATBench：GPT-5.4 二元 F1 76.7% vs 失败模式准确率 13.5%）。
- Connor 可能漏掉仿真中未触发的 payload；当良性行为偏离声明意图时会误报。
- 对 VLA 纹理攻击的防御（训练期鲁棒性、动作约束）在此未建立——仅展示漏洞与攻击流水线。

主题：后训练（RLVR / 偏好学习）中的信用分配与路由

重要性：许多对齐失败是优化伪影：更新了错误 token、蒸馏了错误样本，或全局分布漂移捕捉不佳——导致不稳定或鲁棒性收益有限。
代表论文：
共同方法：
- 基于样本状态路由监督（SRPO 将错误 rollout 送入 SDPO，其余送入 GRPO；对 SDPO token 做熵加权）。
- 用因果依赖度量重加权 token 级学习信号（PGPO 用视觉条件 vs 纯文本 token 分布的 KL）。
- 用分布级目标替代局部 token 微调（PLOT 用 OT/Wasserstein 风格 token loss，并以嵌入成本度量）。
开放问题 / 失败模式：
- 超出已测规模/领域的泛化（PGPO 到 7B；SRPO 在 Qwen3 4B/8B 与五个基准；PLOT 在小偏好数据集）。
- 超参敏感性（PGPO τ/β；PLOT α；SRPO 依赖正确的 sibling rollout 以获取教师信息）。
- 这些方法在对抗提示下是否保持行为（超出报告的 ASR 降低（PLOT）与基准增益）。

主题：事实性、弃答与不确定性定位（含扩散 LM）

重要性：“自信但错误”仍普遍；更好的信号用于判断哪里不确定、何时该弃答，可实现定向纠错而非一刀切拒答。
代表论文：
共同方法：
- 将不确定性定位到片段/token（DLM 承诺的跨链熵；事实对齐 mask + 风险传播）。
- 施加定向干预（对不确定片段重 mask/重去噪；仅在高风险事实片段上做概率重分配）。
- 用带监督 rationale 的小型验证器做落地决策（1B 验证器 + 结构化推理）。
开放问题 / 失败模式：
- OSCAR 的显存开销（并行链）与模型缺知识时的局限（跨链一致幻觉）。
- PRISM 依赖事实抽取/验证质量，且需调 λ 以避免能力退化。
- 轨迹反演需要多次 LLM 调用（成本），且针对带推理轨迹的模型。

3) 技术综合

预算感知正在成为智能体可靠性的统一设计原则：ToolMisuseBench 的预算（步数/调用/重试）、ContextBudget 的显式剩余上下文状态、以及 CoT token 预算扫描都表明：若分配不当，“更多算力”也可能更差。
路由/加权是修复粗粒度信用分配的共同手段：SRPO 在 GRPO 与 SDPO 间路由样本；PGPO 将优势信号路由到视觉依赖 token；二者都旨在降低梯度方差并避免后期崩塌。
验证正在更早、更局部地发生：SWE-HERO 在大规模免执行蒸馏后做执行精炼；OSCAR 在扩散解码“结晶”前修正不确定片段；SAFE（多跳）用训练的反馈模型验证每个原子步骤（KG triple）。
确定性 + 可回放性成为工具可靠性与安全的新基准金标准：ToolMisuseBench 的带种子故障引擎与 ATBench 的规划器合成 + 人工审计，使可控消融与长期对比成为可能。
轨迹级安全诊断仍是瓶颈：ATBench 显示二元不安全检测尚可，但细粒度归因极低；Connor 通过意图抽取 + 逐步行为偏离判断来应对。
机制可解释性正在被用于对抗与诊断：CRaFT 用电路影响（跨层转码器）寻找因果有效的拒答特征，得到远高于基于激活选择的越狱 ASR。
RAG 对齐正从 IR 标签转向读者效用信号：RRPO 用 RL 训练重排器，奖励来自 LLM 评估的生成质量；Neuro-RIT 在神经元粒度上适配生成器以忽略无关检索。
小而结构化的推理监督可在验证上胜过更大基线：ThinknCheck 的 1B 模型带监督 rationale，在 LLMAggreFact 上的平衡准确率超过 7B 验证器，并更好泛化到 SciFact。
具身鲁棒性正在超越 2D patch：Tex3D 的可微 3D 纹理优化（双渲染器 + 时间加权）显著提高失败率并实现 sim-to-real 迁移，意味着物体外观是一级攻击面。

4) Top 5 论文（含“为何是现在”）

1) From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

两阶段 SFT：30 万免执行蒸馏轨迹，然后 1.32 万带执行验证精炼。
强劲的开源 SWE-bench Verified 结果（如 32B 为 62.2%），且清晰消融表明免执行阶段很关键（55.7% → 62.2%）。
实用配方细节（YaRN 支持 128k 上下文；多轮 masking；测试时用验证器做 scaling）。
怀疑点：继承教师偏差（Qwen3-Coder-480B）且依赖验证器质量；环境差异影响可复现性。

2) ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

1,000 条人工审计、工具落地的轨迹，含延迟触发；工具池很大（2,084 个工具；1,954 次调用）。
揭示关键缺口：强模型二元安全尚可（GPT-5.4 76.7% F1），但诊断失败（如 13.5% 失败模式准确率）。
提供可控分类体系（风险来源 / 失败模式 / 危害），便于做定向切片评估。
怀疑点：每轴单标签可能遗漏多因解释；仅英文；仅文本+工具（无多模态/具身）。

3) From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

组件中心 PoC 数据集：114 个恶意服务器（19 条影响路径 × 6 个目标）；显示多组件组合可提高 ASR；直接代码/配置注入达到 100% ASR。
Connor 检测器：94.6% F1，消融证据强（语义生成器关键），并做市场扫描（1,672 个服务器 → 2 个确认恶意）。
工具市场安全的具体蓝图：意图抽取 + 执行追踪 + 代码切片 + 逐步判断。
怀疑点：依赖仿真/执行——仿真未触发的 payload 可逃逸；结果依赖宿主/LLM 版本。

4) OSCAR: Orchestrated Self-verification and Cross-path Refinement

免训练的扩散 LM 幻觉检测/纠正：跨链熵定位 + 定向重 mask。
AUROC 超过训练型检测器（LLaDA-8B 平均 86.5%；Dream-7B 85.7%），并提升 QA F1（LLaDA-8B +6.1pp；TriviaQA +10.7）。
在 RAGTruth 上实现片段级降低（总体 41.1% 幻觉片段质量减少）。
怀疑点：峰值显存增加（N=8 时约 1.67×），且仅覆盖两种 DLM；无检索时无法修复“未知的未知”。

5) Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

明确部署建议：短 CoT 有助路由；长 CoT 使准确率崩塌（Qwen2.5-1.5B：44% → 64%（32 token），随后在 256 token 降至 25%）。
机制化错误拆解：短 CoT 大幅减少“选错但有效函数”（30.5% → 1.5%）；长 CoT 增加错选与幻觉函数。
FR-CoT 提示消除函数幻觉（0.0%），同时匹配短 CoT 准确率。
怀疑点：仅覆盖 BFCL v3 多函数与三个模型；未评估多步工具链。

5) 实用下一步

采用预算化评测：在内部工具智能体 CI 中加入 ToolMisuseBench 式确定性故障注入 + 预算上限 AUC；分别跟踪无效调用率、恢复时间与灾难性失败。
为函数调用实现“简短路由 CoT”：尝试 8–32 token 推理上限和/或 FR-CoT 式强制函数承诺；度量“错选有效函数”与“幻觉函数”比例。
将上下文视为受约束控制问题：原型化“剩余上下文感知”的压缩策略（对片段做 NULL/PARTIAL/FULL），并在预算缩小（如 16k→4k）下评估鲁棒性。
加固工具供应链：对高风险启动命令做执行前配置扫描，并从工具 schema 抽取意图；对高风险工具考虑轨迹级行为偏离检查。
从二元安全走向诊断：若使用轨迹安全基准（如 ATBench 类），训练/度量细粒度归因（风险来源/失败模式/危害），而不仅是 safe/unsafe。
对 RAG 系统，用读者效用优化检索：尝试用 RL 训练重排器（RRPO 风格），奖励来自 LLM 评估的生成质量，并与 IR 标签训练的重排器在下游 F1/EM 上对比。
对事实性，先定位再纠正：在存在模型原生不确定信号时（扩散链），做片段级纠正；对 AR 模型，若有事实风险标注，可考虑训练期片段 masking/重分配（PRISM 类）。
对具身系统，加入外观鲁棒性测试：在仿真评估中加入物体绑定的纹理/外观扰动（多视角、EoT 风格）；如适用，跟踪向物理环境的迁移。

由逐篇论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-04-04

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：面向工具型智能体的预算化、确定性评测

主题：面向代码与长时程自治的可扩展训练 + 验证闭环

主题：轨迹级安全 + 供应链/工具安全

主题：后训练（RLVR / 偏好学习）中的信用分配与路由

主题：事实性、弃答与不确定性定位（含扩散 LM）

3) 技术综合

4) Top 5 论文（含“为何是现在”）

5) 实用下一步