AI 论文日报(2026-04-04)

Published:

English version: /paper-news/2026-04-04/

运行统计

  • 候选论文: 254
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-02T00:00:00Z → 2026-04-03T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.02174Quantifying Self-Preservation Bias in Large Language Models
PDF
cs.AI95Benchmark quantifies self-preservation bias via role inconsistency; strong agentic misalignment signal.agent-safety, instrumental-convergence, shutdown-resistance, evaluation, RLHF, benchmark
2604.02022ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
PDF
cs.AI94Long-horizon trajectory benchmark for agent safety with delayed triggers and harm taxonomy.agent-safety, benchmark, long-horizon, tool-use, red-teaming, evaluation
2604.01604CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
PDF
cs.AI94Circuit-guided refusal features near boundary; improves jailbreak/ASR analysis and controlLLM-safety, refusal, mechanistic-interpretability, jailbreaks, feature-selection, circuits
2604.01905From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers
PDF
cs.CR, cs.SE92Component-centric dataset + detection for malicious MCP servers; targets real tool-ecosystem attacks.security, agents, MCP, supply-chain, tooling, dataset, detection
2604.02230Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
PDF
cs.AI92New abstention method (trace inversion) targets reasoning-model overanswering failuresabstention, hallucinations, reasoning-models, reliability, uncertainty, evaluation
2604.01658CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
PDF
cs.AI92Autonomous multi-agent evolution w/ persistent memory + practical safeguards; strong agentic relevanceagents, multi-agent, open-ended, autonomous, safeguards, evaluation, infrastructure
2604.01496From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
PDF
cs.SE, cs.CL91Strong SWE-bench gains + large released trajectories; advances real agentic coding workflowsagents, software-engineering, SWE-bench, post-training, datasets, tool-use
2604.01508ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems
PDF
cs.SE, cs.AI90Deterministic offline benchmark for tool misuse/recovery with budgets and fault injection; very reusable.agents, tool-use, robustness, benchmark, fault-injection, evaluation
2604.02091Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
PDF
cs.CL, cs.AI, cs.IR90RL aligns RAG reranking to downstream LLM answer utility, not static IR labelsRAG, reranking, RL, LLM-feedback, evaluation, alignment
2604.01664ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents
PDF
cs.AI90RL-based budget-aware context compression for long-horizon agents; directly targets context-limit failuresagents, long-horizon, context-management, compression, reinforcement-learning, efficiency
2604.02288Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
PDF
cs.LG, cs.AI89Unifies GRPO/SDPO via routing; addresses RLVR credit assignment + late-stage collapseRLVR, post-training, GRPO, distillation, optimization-stability, alignment-training
2604.01977RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
PDF
cs.CR, cs.AI, cs.CL, cs.LG, cs.SE88Automates CVE detection-rule generation at scale; high security impact and deployable architecturesecurity, vulnerability-detection, CVE, rule-generation, automation, threat-detection
2604.01624OSCAR: Orchestrated Self-verification and Cross-path Refinement
PDF
cs.AI, cs.CL87Hallucination mitigation using diffusion LM trajectories; unsupervised uncertainty localizationhallucinations, diffusion-language-models, uncertainty, self-verification, inference-time-control
2604.01652ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
PDF
cs.AI, cs.CL871B grounded claim verifier w/ structured rationales; strong gains vs larger baselines, interpretableverification, factuality, grounding, hallucinations, small-models, interpretability, evaluation
2604.01925ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
PDF
cs.CL, cs.AI86New implicit-bias QA benchmark using characteristic cues; shows bias persists despite explicit suppression.bias, evaluation, safety, fairness, benchmark, implicit-signals
2604.01993SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
PDF
cs.CL, cs.AI86Benchmarking with verifiable atomic steps; filters unanswerables and gives stepwise feedbackevaluation, multi-hop-reasoning, verification, benchmarks, grounding, error-taxonomy
2604.01837PLOT: Enhancing Preference Learning via Optimal Transport
PDF
cs.CL86Optimal-transport token loss for preference learning; aims for stability/robustness gainsalignment, preference-learning, DPO/RLHF, optimal-transport, token-level
2604.02322Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
PDF
cs.LG, cs.AI, cs.CL86Task-scaling law via solving N problems in one context; reduces CoT token cost with simple trainingreasoning, efficiency, scaling-laws, training, chain-of-thought, inference-cost
2604.01682PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
PDF
cs.CL85Risk-gated SFT objective to reduce overconfident hallucinations at fact-critical spans.hallucination, alignment, factuality, SFT, uncertainty, training
2604.02155Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
PDF
cs.CL84Finds non-monotonic CoT budget effects in function-calling agents; actionable for agent design.agents, function-calling, reasoning, chain-of-thought, evaluation, reliability
2604.02194Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
PDF
cs.CL, cs.AI84Neuron-level tuning to resist noisy/irrelevant retrieval; improves RAG robustnessRAG, robustness, retrieval-noise, instruction-tuning, attribution, neurons
2604.01610GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation
PDF
cs.AI84Training-free tool-based KG navigation enables multi-hop reasoning beyond context limitsagents, tool-use, knowledge-graphs, grounding, long-context, reasoning
2604.02047Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
PDF
cs.CL, cs.AI84Training-free speculative decoding w/ anisotropic trees; principled use of mixed-quality token sourcesinference, speculative-decoding, efficiency, decoding, systems
2604.01754LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
PDF
cs.CL, cs.AI, cs.LG83Live, post-cutoff math benchmark from recent arXiv theorems; reduces contamination, adds taxonomyevaluation, math-reasoning, benchmark, data-contamination, proof-sketches
2604.01676GPA: Learning GUI Process Automation from Demonstrations
PDF
cs.CV, cs.AI, cs.SE82Deterministic, local GUI automation from one demo; emphasizes reliability calibration and privacy.agents, GUI, RPA, privacy, reliability, tooling
2604.01576Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents
PDF
cs.LG82Alignment for supportive agents: autonomy-preserving objective + relational failure benchmarkalignment, social-risk, autonomy, supportive-agents, benchmarks, reward-modeling
2604.01840Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
PDF
cs.AI82Credits only visually-dependent tokens in RLVR; sharper learning signal for LVLM reasoningmultimodal, VLM, RLVR, credit-assignment, reasoning, optimization
2604.01618Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
PDF
cs.CV, cs.AI81Physically plausible adversarial 3D textures attack VLA models; important robotics safety surface.adversarial, robotics, VLA, physical-attacks, robustness, security
2604.01988SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation
PDF
cs.AI81Controlled benchmark for number sense + shortcut use/judgment; useful probe of reasoning reliabilityevaluation, numerical-reasoning, robustness, shortcuts, calibration
2604.02276De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
PDF
cs.AI, cs.CL, cs.LG80Automated regulatory rule extraction with judge+iterative repair; useful for compliance-aware agents.governance, compliance, LLM-judge, self-refinement, information-extraction, agents

AI 论文洞察简报

2026-04-04

0) 执行要点(先读这个)

  • 智能体可靠性正在从“能力”转向“约束下的操作正确性”:通过确定性故障注入 + 对工具误用的预算化评分(ToolMisuseBench),以及将上下文窗口预算显式建模为 RL 决策问题(ContextBudget),让失败可归因、可优化。
  • 执行密集型 SWE 智能体训练可通过“语义蒸馏 → 小规模执行精炼”的配方实现可扩展:SWE-ZERO(30 万条免执行轨迹)+ SWE-HERO(1.3 万条带执行验证)显著提升 SWE-bench Verified(如 32B:62.2%),同时降低对基础设施的依赖。
  • 安全评估正在变得“轨迹原生”且具备系统供应链意识:ATBench 暴露了长时程、延迟触发的工具风险,即便强模型也难以做细粒度诊断;MCP 服务器安全工作展示多组件攻击链,并给出行为偏离检测器(Connor),F1 很高(94.6%)且在真实市场中发现恶意样本。
  • “推理”并非单调有益——预算与信用分配很关键:长 CoT 可能损害函数调用准确率(峰值出现在极短的 8–32 token);多模态 RL 在将优势信号路由到视觉依赖 token 时更有效(PGPO);RL 后训练在将样本在 GRPO 与自蒸馏之间路由时更稳定(SRPO)。
  • 事实性/弃答正在转向局部化、模型原生信号与定向干预:扩散 LM 可通过跨链熵定位不确定承诺并修正片段(OSCAR);弃答可通过推理轨迹反演检测“查询不对齐”而改进。

2) 关键主题(聚类)

主题:面向工具型智能体的预算化、确定性评测

主题:面向代码与长时程自治的可扩展训练 + 验证闭环

主题:轨迹级安全 + 供应链/工具安全

主题:后训练(RLVR / 偏好学习)中的信用分配与路由

主题:事实性、弃答与不确定性定位(含扩散 LM)

3) 技术综合

  • 预算感知正在成为智能体可靠性的统一设计原则:ToolMisuseBench 的预算(步数/调用/重试)、ContextBudget 的显式剩余上下文状态、以及 CoT token 预算扫描都表明:若分配不当,“更多算力”也可能更差
  • 路由/加权是修复粗粒度信用分配的共同手段:SRPO 在 GRPO 与 SDPO 间路由样本;PGPO 将优势信号路由到视觉依赖 token;二者都旨在降低梯度方差并避免后期崩塌。
  • 验证正在更早、更局部地发生:SWE-HERO 在大规模免执行蒸馏后做执行精炼;OSCAR 在扩散解码“结晶”前修正不确定片段;SAFE(多跳)用训练的反馈模型验证每个原子步骤(KG triple)。
  • 确定性 + 可回放性成为工具可靠性与安全的新基准金标准:ToolMisuseBench 的带种子故障引擎与 ATBench 的规划器合成 + 人工审计,使可控消融与长期对比成为可能。
  • 轨迹级安全诊断仍是瓶颈:ATBench 显示二元不安全检测尚可,但细粒度归因极低;Connor 通过意图抽取 + 逐步行为偏离判断来应对。
  • 机制可解释性正在被用于对抗与诊断:CRaFT 用电路影响(跨层转码器)寻找因果有效的拒答特征,得到远高于基于激活选择的越狱 ASR。
  • RAG 对齐正从 IR 标签转向读者效用信号:RRPO 用 RL 训练重排器,奖励来自 LLM 评估的生成质量;Neuro-RIT 在神经元粒度上适配生成器以忽略无关检索。
  • 小而结构化的推理监督可在验证上胜过更大基线:ThinknCheck 的 1B 模型带监督 rationale,在 LLMAggreFact 上的平衡准确率超过 7B 验证器,并更好泛化到 SciFact。
  • 具身鲁棒性正在超越 2D patch:Tex3D 的可微 3D 纹理优化(双渲染器 + 时间加权)显著提高失败率并实现 sim-to-real 迁移,意味着物体外观是一级攻击面。

4) Top 5 论文(含“为何是现在”)

1) From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

  • 两阶段 SFT:30 万免执行蒸馏轨迹,然后 1.32 万带执行验证精炼。
  • 强劲的开源 SWE-bench Verified 结果(如 32B 为 62.2%),且清晰消融表明免执行阶段很关键(55.7% → 62.2%)。
  • 实用配方细节(YaRN 支持 128k 上下文;多轮 masking;测试时用验证器做 scaling)。
  • 怀疑点:继承教师偏差(Qwen3-Coder-480B)且依赖验证器质量;环境差异影响可复现性。

2) ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

  • 1,000 条人工审计、工具落地的轨迹,含延迟触发;工具池很大(2,084 个工具;1,954 次调用)。
  • 揭示关键缺口:强模型二元安全尚可(GPT-5.4 76.7% F1),但诊断失败(如 13.5% 失败模式准确率)。
  • 提供可控分类体系(风险来源 / 失败模式 / 危害),便于做定向切片评估。
  • 怀疑点:每轴单标签可能遗漏多因解释;仅英文;仅文本+工具(无多模态/具身)。

3) From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

  • 组件中心 PoC 数据集:114 个恶意服务器(19 条影响路径 × 6 个目标);显示多组件组合可提高 ASR;直接代码/配置注入达到 100% ASR
  • Connor 检测器:94.6% F1,消融证据强(语义生成器关键),并做市场扫描(1,672 个服务器 → 2 个确认恶意)。
  • 工具市场安全的具体蓝图:意图抽取 + 执行追踪 + 代码切片 + 逐步判断。
  • 怀疑点:依赖仿真/执行——仿真未触发的 payload 可逃逸;结果依赖宿主/LLM 版本。

4) OSCAR: Orchestrated Self-verification and Cross-path Refinement

  • 免训练的扩散 LM 幻觉检测/纠正:跨链熵定位 + 定向重 mask。
  • AUROC 超过训练型检测器(LLaDA-8B 平均 86.5%;Dream-7B 85.7%),并提升 QA F1(LLaDA-8B +6.1pp;TriviaQA +10.7)。
  • 在 RAGTruth 上实现片段级降低(总体 41.1% 幻觉片段质量减少)。
  • 怀疑点:峰值显存增加(N=8 时约 1.67×),且仅覆盖两种 DLM;无检索时无法修复“未知的未知”。

5) Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

  • 明确部署建议:短 CoT 有助路由;长 CoT 使准确率崩塌(Qwen2.5-1.5B:44% → 64%(32 token),随后在 256 token 降至 25%)。
  • 机制化错误拆解:短 CoT 大幅减少“选错但有效函数”(30.5% → 1.5%);长 CoT 增加错选与幻觉函数。
  • FR-CoT 提示消除函数幻觉(0.0%),同时匹配短 CoT 准确率。
  • 怀疑点:仅覆盖 BFCL v3 多函数与三个模型;未评估多步工具链。

5) 实用下一步

  • 采用预算化评测:在内部工具智能体 CI 中加入 ToolMisuseBench 式确定性故障注入 + 预算上限 AUC;分别跟踪无效调用率、恢复时间与灾难性失败。
  • 为函数调用实现“简短路由 CoT”:尝试 8–32 token 推理上限和/或 FR-CoT 式强制函数承诺;度量“错选有效函数”与“幻觉函数”比例。
  • 将上下文视为受约束控制问题:原型化“剩余上下文感知”的压缩策略(对片段做 NULL/PARTIAL/FULL),并在预算缩小(如 16k→4k)下评估鲁棒性。
  • 加固工具供应链:对高风险启动命令做执行前配置扫描,并从工具 schema 抽取意图;对高风险工具考虑轨迹级行为偏离检查。
  • 从二元安全走向诊断:若使用轨迹安全基准(如 ATBench 类),训练/度量细粒度归因(风险来源/失败模式/危害),而不仅是 safe/unsafe。
  • 对 RAG 系统,用读者效用优化检索:尝试用 RL 训练重排器(RRPO 风格),奖励来自 LLM 评估的生成质量,并与 IR 标签训练的重排器在下游 F1/EM 上对比。
  • 对事实性,先定位再纠正:在存在模型原生不确定信号时(扩散链),做片段级纠正;对 AR 模型,若有事实风险标注,可考虑训练期片段 masking/重分配(PRISM 类)。
  • 对具身系统,加入外观鲁棒性测试:在仿真评估中加入物体绑定的纹理/外观扰动(多视角、EoT 风格);如适用,跟踪向物理环境的迁移。

由逐篇论文分析生成;无外部浏览。