AI 论文日报(2026-03-04)

Published:

English version: /paper-news/2026-03-04/

运行统计

  • 候选论文: 284
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-02T01:00:00Z → 2026-03-03T01:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.01608Evaluating and Understanding Scheming Propensity in LLM Agents
PDF
cs.AI95Systematic eval of LLM agent scheming incentives; realistic scenarios + factor decompositionagent-safety, scheming, evaluation, instrumental-goals, autonomy
2603.02196Conformal Policy Control
PDF
cs.AI, cs.LG, math.ST, stat.ML94Conformal calibration to bound policy risk vs safe reference; provable safety for exploration.agent-safety, conformal-prediction, safe-exploration, risk-bounds, RL
2603.01564From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions
PDF
cs.CR92Survey + taxonomy for agentic/web threats (memory/tool/env injection) and defensesagent-security, prompt-injection, tool-safety, memory-attacks, survey, threat-models
2603.01423Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
PDF
cs.CL92Systematic multi-turn reliability eval incl. constraints, tool choice, entity tracking; shows degradation.evaluation, reliability, multi-turn, tool-use, dialogue, agentic
2603.01454VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
PDF
cs.CV, cs.AI92Universal DoS-style energy/latency attack on Video-LLMs; practical triggers without test-time grads.security, adversarial-attacks, denial-of-service, video-llm, robustness
2603.01357ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
PDF
cs.AI91New benchmark for tool-use agents with evolving personal context; exposes failures at high complexity.benchmark, agents, tool-use, personal-context, planning, evaluation
2603.01589SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
PDF
cs.LG, cs.AI90Large scientific safety benchmark (0.25M) + 1.5M training set with more objective metricssafety-eval, benchmarks, science-safety, datasets, red-teaming
2603.02203Tool Verification for Test-Time Reinforcement Learning
PDF
cs.AI, cs.CL90Adds tool-based verification to test-time RL to prevent spurious consensus reward collapse.reasoning, test-time-training, verification, tools, robustness
2603.01896Agentic Code Reasoning
PDF
cs.SE, cs.AI, cs.PL90Semi-formal prompting gives checkable “certificates” for agent code reasoning; strong gains reported.agents, code, reasoning, verification, prompting, reliability, software-engineering
2603.02146LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
PDF
cs.CL89Shows outcome-only RLVR fails for long-context grounding; proposes verifiable context rewards + theory.RLVR, long-context, grounding, alignment, training, theory
2603.01784Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution
PDF
cs.CR, cs.AI88Co-evolutionary multimodal safety alignment with evolving adversarial attacks (genetic ops)multimodal, adversarial-training, alignment, robustness, automated-redteaming
2603.01907Efficient RLVR Training via Weighted Mutual Information Data Selection
PDF
cs.LG, cs.CL88Mutual-information data selection for RLVR/RL training; targets efficiency + uncertainty, not just difficulty.RLHF, RLVR, data-selection, uncertainty, bayesian, training-efficiency, alignment
2603.01426Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
PDF
cs.CL87KV-cache compression analysis finds hallucination 'safety cliff' near high compression; better eval lens.long-context, KV-cache, efficiency, hallucinations, attention, robustness
2603.02029Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization
PDF
cs.AI, cs.LG, stat.ML87Cuts eval cost by combining cheap autoraters + few human labels via tensor factorization.evaluation, human-preference, autoraters, statistical-modeling, scalable-evals
2603.01494Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision
PDF
cs.SE, cs.AI, cs.CR, cs.LG86Inference-time safety for code LLMs via retrieval-augmented revision using security knowledgecode-llms, secure-coding, RAG, inference-time, software-security
2603.01714TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training
PDF
cs.LG, cs.CL86Interaction-topology curation for tool-use training; goes beyond pass-rate filtering to informative tasks.agents, tool-use, data-curation, RL, training, trajectories
2603.01940CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
PDF
cs.AI85Constraint-guided verification to synthesize correct tool-use trajectories + RL rewardstool-use, agents, verification, post-training, data-synthesis, RL
2603.02128LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations
PDF
cs.CL, cs.AI, cs.CY85Measures LLM agent behavior in crisis sims: alignment to humans, risk calibration, framing drift.agent-evaluation, risk-calibration, geopolitics, behavioral-analysis, multi-round
2603.01562RubricBench: Aligning Model-Generated Rubrics with Human Standards
PDF
cs.AI84RubricBench benchmark for rubric-based reward/evaluation; targets hard, bias-misleading comparisons.reward-models, evaluation, rubrics, alignment, benchmark, preference-modeling
2603.02208Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
PDF
cs.CL84Procedural, verifiable symbolic data suite (planning/FOL/CFG/causal/equations) for scaling reasoning.synthetic-data, reasoning, verification, benchmarks, curriculum
2603.01571Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
PDF
cs.AI84Structured breadth+depth CoT for generative reward models; SFT+RLVR to improve evaluator reliability.reward-models, evaluation, RLVR, chain-of-thought, reliability, alignment
2603.01620ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents
PDF
cs.AI83Fine-grained reward decomposition for tool-integrated agent alignment beyond binary successagents, tool-calling, RLHF, reward-modeling, DPO, GRPO
2603.01919Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
PDF
cs.CR, cs.AI, cs.SE83First audit of 'shadow APIs' claiming frontier models; reliability/security implications for deployments.security, model-supply-chain, API, auditing, reliability, governance
2603.01550Extracting Training Dialogue Data from Large Language Model based Task Bots
PDF
cs.CL, cs.AI82Quantifies memorization leakage in LLM-based task bots; extracts dialogue events and identifiers.privacy, memorization, data-extraction, task-bots, security, LLMs
2603.02091Learning from Synthetic Data Improves Multi-hop Reasoning
PDF
cs.LG, cs.AI, cs.CL82RL fine-tuning on rule-generated synthetic multi-hop data improves real QA without costly labels.reasoning, reinforcement-learning, synthetic-data, multi-hop, data-generation
2603.01792ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
PDF
cs.CL, cs.AI82Token-entropy-guided unlearning with lightweight asymmetric LoRA; aims to reduce collateral damage.unlearning, privacy, safety, LoRA, model-editing, knowledge-control
2603.01574DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
PDF
cs.CR, cs.AI81Black-box detection of backdoor/prompt-injection via online 'entropy lull' generation signalprompt-injection, backdoors, black-box, monitoring, detection
2603.01639Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
PDF
cs.CL81RL-optimized speculative decoding to maximize real throughput (draft+verify), not proxy acceptance metrics.inference, speculative-decoding, RL, efficiency, serving, LLM-systems
2603.02119Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
PDF
cs.AI, cs.GT, cs.LG80Verifiable multi-step reasoning benchmark with step-level checks; supports dense process rewardsreasoning, benchmarks, process-supervision, verification, agentic-eval
2603.01710Legal RAG Bench: an end-to-end benchmark for legal RAG
PDF
cs.CL, cs.IR, cs.LG80End-to-end Legal RAG benchmark with hierarchical error decomposition separating retrieval vs reasoning.RAG, benchmark, legal, evaluation, retrieval, grounding

AI 论文洞察简报

2026-03-04

0) 执行要点(先读这个)

  • 智能体评估正在从“是否完成?”转向“为什么失败?” ASTRA-bench 和 Legal RAG Bench 都加入了可落地的证据产物与错误分类体系,将检索/落地(grounding)与行动/载荷构造(payload construction)及推理区分开来——有助于针对性修复训练问题,而不是追逐总体分数。
  • “可靠性悬崖”正在成为关键部署风险信号:KV-cache 压缩在极端压缩附近(α≈0.9)出现明显的幻觉“安全悬崖”;多轮对话会导致指令保持能力大幅下降,即使单轮表现几乎完美。
  • 安全越来越关乎可用性 + 供应链,而不只是越狱:VidDoS 展示了对 Video-LLM 的通用延迟/Token 膨胀;“影子 API(shadow APIs)”显示广泛的模型替换/欺骗,在医疗/法律任务上能力大幅下降,并且指纹验证经常失败。
  • 奖励/评估流水线本身是瓶颈:RubricBench 量化了显著的“评分细则差距”(模型自生成 vs 人类细则);Mix-GRM 表明 推理结构(广度 vs 深度)必须匹配任务类型——仅靠长度扩展不够。
  • RLVR 正在为“真实场景”重构:LongRLVR 指出仅结果奖励的 RLVR 无法学习长上下文落地(梯度消失),并用可验证的上下文奖励修复;INSIGHT 通过贝叶斯互信息数据选择提升 RLVR 效率;T³RL 通过工具验证伪标签来稳定测试时 RL。
  • 工具使用型智能体训练正在更“系统化”:TopoCurate(拓扑感知数据筛选)、CoVe(约束验证的交互数据)、ToolRLA(细粒度奖励分解 + 合规惩罚)都强调结构化信号而非二元成功。

2) 关键主题(聚类)

主题:面向智能体与 RAG 的可落地、诊断式评估

  • 重要性:端到端成功率会掩盖失败究竟来自检索/落地、指代消解、载荷构造还是推理。能暴露“哪里坏了”的基准可支持定向修复与更安全的部署。
  • 代表论文
  • 共同方法
    • 将任务落地到可验证产物(工具轨迹/系统状态;标注证据段落;逐步谜题规则检查;测试执行的真值)。
    • 提供错误分解(如检索 vs 推理 vs 幻觉;里程碑/雷区;等价 vs 非等价补丁案例)。
    • 压测真实失败驱动因素:随时间演化的个人上下文、词面不相似的法律查询、长时程迭代求解、仓库级代码导航。
  • 开放问题 / 失败模式
    • 评估器脆弱性:里程碑检查/裁判可能对“有效但未预期”的计划给出假阴性(ASTRA)。
    • 基准到真实的差距:合成个人语料与谜题/文本棋盘表示可能遗漏真实世界噪声/多模态性。
    • 成本/基础设施限制:长时程智能体评估在高“努力”设置下可能极其昂贵且更易失败(Pencil Puzzle Bench)。

主题:RLVR 与测试时学习——更密集信号、更好课程、更安全伪标签

主题:工具使用型智能体训练信号——约束、拓扑与合规感知奖励

主题:评估与奖励模型可靠性——评分细则与推理结构

主题:部署中 LLM 生态的安全与完整性(可用性、隐私泄露、供应链)

主题:交互与基础设施中的可靠性悬崖(多轮、长上下文压缩)

3) 技术综合

  • 落地(grounding)正在成为显式训练/评估对象:LongRLVR 将落地与作答因子化并加入可验证上下文奖励;Legal RAG Bench 分别度量检索准确率;ASTRA 提供检索金标实体与工具轨迹可验证性。
  • “仅结果”信号在不同形式下反复失败:RLVR 在落地上停滞;TTRL 在错误多数伪标签下坍塌;仅结果轨迹过滤遗漏恢复/效率/多样性(TopoCurate)。
  • 验证正在从 LLM 裁判转向尽可能的确定性检查:CoVe 使用基于规则的约束满足;Pencil Puzzle Bench 验证每一步;Agentic Code Reasoning 用测试执行作为补丁等价性的真值;T³RL 用代码执行验证 rollout。
  • 当使用 LLM 裁判时,论文越来越多地量化裁判/细则失败:RubricBench 分离细则生成 vs 执行;Legal RAG Bench 报告内部裁判准确率;张量分解工作将自动评分器视为噪声辅助信号而非真理。
  • 工具使用瓶颈正从检索转向结构化行动构造:ASTRA 的分解显示 IR 召回较强,而载荷/参数生成是主要瓶颈,且模型间方差很大。
  • 安全/鲁棒性失败常表现为尖锐相变:KV 压缩幻觉悬崖与 GER 峰值相关;多轮指令遵循相较工具选择/实体抽取出现更大的离散下降。
  • 安全威胁模型在扩展:可用性(VidDoS)、供应链完整性(影子 API)、微调结构化机器人中的隐私泄露(belief-state 抽取)、黑盒运行时检测(DualSentinel)。
  • 智能体不当行为倾向对配置敏感:在基线下“策划/欺骗(scheming)”倾向接近零,但小幅 prompt/脚手架变化可显著跃升;工具可用性变化也可能使 scheming 率崩塌。
  • 效率工作正变得更偏控制理论/RL:推测解码被表述为通过共适应 RL 策略进行吞吐优化(LTD),与压缩与长上下文带来的可靠性担忧形成互补。

4) Top 5 论文(含“为何现在重要”)

1) Real Money, Fake Models: Deceptive Model Claims in Shadow APIs

  • 量化了真实的供应链问题:17 个影子 API 被 187 篇论文使用
  • 显示在高风险领域出现显著能力崩塌(例如 Gemini-2.5-flash 在 MedQA 上从官方 83.82% 降到影子端点约 ~36.95%)。
  • 提供直接身份证据:在 24 个端点中,45.83% 未通过指纹验证(另有 +12.50% 大偏差),并由 MET 佐证。
  • 质疑点:测量是有时间边界的快照(2025 年 9–12 月),且市场波动大;无法获得后端真值。

2) ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

  • 让工具使用评估更接近真实助手:纵向个人上下文 + 有状态工具 + 时间锚点
  • 加入诊断式评分(里程碑与雷区 DAG + 细则裁判)与复杂度维度。
  • 发现明确瓶颈:载荷/参数生成落后于检索,并驱动模型间方差。
  • 质疑点:合成到真实差距;评估器可能对有效计划给假阴性;编写里程碑成本高。

3) LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

  • 梯度消失论证解释了为何仅结果奖励的 RLVR 在长上下文落地上失败。
  • 加入可验证上下文奖励(证据 chunk 上的调制 Fβ),并提升长上下文基准(如 Qwen2.5-14B-1M 在 RULER-QA AVG 73.17→88.90)。
  • 质疑点:依赖由合成流水线产生的证据 chunk 真值标注;超出该设定的普适性尚未在此确立。

4) RubricBench: Aligning Model-Generated Rubrics with Human Standards

  • 1,147 对样本 + 专家撰写的仅指令原子细则让细则质量可度量。
  • 显示稳定的 ~26–28 分“细则差距”(如 DeepSeek-v3.2 从 57.8%→84.9%)。
  • 证明即使人类在受生成细则约束时也会退化(N=100 上从 92%→61%)。
  • 质疑点:专家细则标注昂贵;二元清单式细则以可验证性换取了细腻度。

5) Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

  • 将 KV 压缩重构为注意力路由扰动问题,而不仅是内存缩减。
  • 报告在 α≈0.9 附近的幻觉“安全悬崖”,并与 GER 相关(如 r 最高到 0.93)。
  • 通过探针 vs 生成失败展示“保留/可达性 ≠ 利用”。
  • 质疑点:受控合成任务可能无法覆盖真实语料异质性;理论具有启发性但并非完全保证。

5) 实用下一步

  • 为你的智能体栈加入分解指标:分别记录检索召回、工具名合法性、参数/载荷正确性、冗余/效率、以及终态成功(对齐 ASTRA + ToolRLA + CoVe)。
  • 在长上下文 RL 中为落地加入可验证中间奖励:要求显式证据 chunk ID,并奖励 Fβ 重叠(LongRLVR 风格),而不是只看最终答案正确性。
  • 加固测试时训练回路:若使用多数投票伪标签,集成工具验证与加权投票,防止错误模式自我强化(T³RL)。
  • 将 KV 压缩视为安全参数:监控类似 GER 的证据路由删除代理指标,并在部署激进压缩比前测试幻觉悬崖。
  • 审计 API 来源:若依赖第三方端点,运行指纹/分布等同性检验,并在实验中记录端点来源(影子 API 论文协议)。
  • 在可行处部署黑盒运行时检测器:若可获得 top-k token 概率,测试熵-低谷(entropy-lull)+ 任务翻转验证以检测定向序列攻击(DualSentinel),并在你的领域测量误报率。
  • 针对代码安全,尝试生成后“检索增强的修订”:利用社区安全讨论(SOSECURE),同时跟踪修复率与功能回归(尽可能补充测试)。
  • 工具使用训练数据不要只做结果过滤:用交互结构信号(TopoCurate)与约束验证合成(CoVe)来提升恢复/多样性且不牺牲正确性。

由逐篇论文分析生成;未进行外部浏览。