AI 论文日报(2026-03-04)

Published:

English version: /paper-news/2026-03-04/

运行统计

  • 候选论文: 236
  • 入选论文: 30
  • 已精读完成: 32
  • 时间窗口 (UTC): 2026-03-03T01:00:00Z → 2026-03-04T01:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.016082603.01608
PDF
95Systematic eval of LLM agent scheming incentives; realistic scenarios + factor decompositionagent-safety, scheming, evaluation, instrumental-goals, autonomy
2603.021962603.02196
PDF
94Conformal calibration to bound policy risk vs safe reference; provable safety for exploration.agent-safety, conformal-prediction, safe-exploration, risk-bounds, RL
2603.015642603.01564
PDF
92Survey + taxonomy for agentic/web threats (memory/tool/env injection) and defensesagent-security, prompt-injection, tool-safety, memory-attacks, survey, threat-models
2603.014232603.01423
PDF
92Systematic multi-turn reliability eval incl. constraints, tool choice, entity tracking; shows degradation.evaluation, reliability, multi-turn, tool-use, dialogue, agentic
2603.014542603.01454
PDF
92Universal DoS-style energy/latency attack on Video-LLMs; practical triggers without test-time grads.security, adversarial-attacks, denial-of-service, video-llm, robustness
2603.013572603.01357
PDF
91New benchmark for tool-use agents with evolving personal context; exposes failures at high complexity.benchmark, agents, tool-use, personal-context, planning, evaluation
2603.015892603.01589
PDF
90Large scientific safety benchmark (0.25M) + 1.5M training set with more objective metricssafety-eval, benchmarks, science-safety, datasets, red-teaming
2603.022032603.02203
PDF
90Adds tool-based verification to test-time RL to prevent spurious consensus reward collapse.reasoning, test-time-training, verification, tools, robustness
2603.018962603.01896
PDF
90Semi-formal prompting gives checkable “certificates” for agent code reasoning; strong gains reported.agents, code, reasoning, verification, prompting, reliability, software-engineering
2603.021462603.02146
PDF
89Shows outcome-only RLVR fails for long-context grounding; proposes verifiable context rewards + theory.RLVR, long-context, grounding, alignment, training, theory
2603.017842603.01784
PDF
88Co-evolutionary multimodal safety alignment with evolving adversarial attacks (genetic ops)multimodal, adversarial-training, alignment, robustness, automated-redteaming
2603.019072603.01907
PDF
88Mutual-information data selection for RLVR/RL training; targets efficiency + uncertainty, not just difficulty.RLHF, RLVR, data-selection, uncertainty, bayesian, training-efficiency, alignment
2603.014262603.01426
PDF
87KV-cache compression analysis finds hallucination 'safety cliff' near high compression; better eval lens.long-context, KV-cache, efficiency, hallucinations, attention, robustness
2603.020292603.02029
PDF
87Cuts eval cost by combining cheap autoraters + few human labels via tensor factorization.evaluation, human-preference, autoraters, statistical-modeling, scalable-evals
2603.014942603.01494
PDF
86Inference-time safety for code LLMs via retrieval-augmented revision using security knowledgecode-llms, secure-coding, RAG, inference-time, software-security
2603.017142603.01714
PDF
86Interaction-topology curation for tool-use training; goes beyond pass-rate filtering to informative tasks.agents, tool-use, data-curation, RL, training, trajectories
2603.019402603.01940
PDF
85Constraint-guided verification to synthesize correct tool-use trajectories + RL rewardstool-use, agents, verification, post-training, data-synthesis, RL
2603.021282603.02128
PDF
85Measures LLM agent behavior in crisis sims: alignment to humans, risk calibration, framing drift.agent-evaluation, risk-calibration, geopolitics, behavioral-analysis, multi-round
2603.015622603.01562
PDF
84RubricBench benchmark for rubric-based reward/evaluation; targets hard, bias-misleading comparisons.reward-models, evaluation, rubrics, alignment, benchmark, preference-modeling
2603.022082603.02208
PDF
84Procedural, verifiable symbolic data suite (planning/FOL/CFG/causal/equations) for scaling reasoning.synthetic-data, reasoning, verification, benchmarks, curriculum
2603.015712603.01571
PDF
84Structured breadth+depth CoT for generative reward models; SFT+RLVR to improve evaluator reliability.reward-models, evaluation, RLVR, chain-of-thought, reliability, alignment
2603.016202603.01620
PDF
83Fine-grained reward decomposition for tool-integrated agent alignment beyond binary successagents, tool-calling, RLHF, reward-modeling, DPO, GRPO
2603.019192603.01919
PDF
83First audit of 'shadow APIs' claiming frontier models; reliability/security implications for deployments.security, model-supply-chain, API, auditing, reliability, governance
2603.015502603.01550
PDF
82Quantifies memorization leakage in LLM-based task bots; extracts dialogue events and identifiers.privacy, memorization, data-extraction, task-bots, security, LLMs
2603.020912603.02091
PDF
82RL fine-tuning on rule-generated synthetic multi-hop data improves real QA without costly labels.reasoning, reinforcement-learning, synthetic-data, multi-hop, data-generation
2603.017922603.01792
PDF
82Token-entropy-guided unlearning with lightweight asymmetric LoRA; aims to reduce collateral damage.unlearning, privacy, safety, LoRA, model-editing, knowledge-control
2603.015742603.01574
PDF
81Black-box detection of backdoor/prompt-injection via online 'entropy lull' generation signalprompt-injection, backdoors, black-box, monitoring, detection
2603.016392603.01639
PDF
81RL-optimized speculative decoding to maximize real throughput (draft+verify), not proxy acceptance metrics.inference, speculative-decoding, RL, efficiency, serving, LLM-systems
2603.021192603.02119
PDF
80Verifiable multi-step reasoning benchmark with step-level checks; supports dense process rewardsreasoning, benchmarks, process-supervision, verification, agentic-eval
2603.017102603.01710
PDF
80End-to-end Legal RAG benchmark with hierarchical error decomposition separating retrieval vs reasoning.RAG, benchmark, legal, evaluation, retrieval, grounding

AI 论文洞察简报

2026-03-04

0) 执行要点(先读这个)

  • Agent 的可靠性瓶颈与其说在“找信息”,不如说在“正确行动”:在具备个人上下文的工具型 Agent 中,信息检索召回率很高,而payload/参数构造是主要失败点(ASTRA-bench)。
  • 多轮交互是一类一等的鲁棒性风险:在多轮设置下,指令维护会急剧崩塌(例如全局“≤5 句”约束),而工具选择与槽位抽取退化更小,并且与模型规模相关(Conversational Reliability)。
  • 优化捷径可能掩盖断崖:KV-cache 压缩在标准长上下文基准上看起来没问题,但在极端压缩(~0.9)附近会出现与注意力路由删除相关的幻觉“安全断崖”(KV compression physics)。
  • 可用性攻击已对 Video-LLM 变得可行:一个通用、离线训练的补丁可诱发200× token 膨胀>15s 延迟开销,带来实时安全风险(VidDoS)。
  • 评测基础设施本身也是瓶颈:rubric 引导式评审只有在 rubric 正确时才会改进;自生成 rubric 与人工 rubric 之间存在巨大且稳定的“rubric gap”(~26–28 分)(RubricBench)。
  • RLVR 正在分裂为两种范式:(i) 便宜、完全可验证的合成数据可迁移到真实多跳 QA(Synthetic→Real RLVR),但 (ii) 长上下文 grounding 需要可验证的中间上下文奖励,否则 RLVR 会停滞(LongRLVR)。

2) 关键主题(聚类)

主题:真实、具状态环境中的工具型 Agent

主题:可验证的多步推理 + 训练信号(RLVR、过程反馈)

主题:评测可靠性(rubric、自动评审器与诊断基准)

主题:已部署 LLM 系统的安全与隐私(可用性、抽取、供应链)

主题:长上下文与推理优化中的鲁棒性断崖

3) 技术综合

  • 有依据的评测正在收敛到“trace-first”信号:ASTRA 使用工具轨迹 + 里程碑 DAG;CoVe 使用确定性约束满足;Pencil Puzzle Bench 验证每一步;Legal RAG Bench 分离检索 vs 推理 vs 幻觉。
  • 反复出现的瓶颈是“结构化行动正确性”:ASTRA 发现 payload/参数生成最弱;ToolRLA 通过工具名/覆盖/参数准确性显式门控正确性;CoVe 过滤零冗余的约束满足。
  • 当成功依赖稀有前置事件时,仅结果的 RLVR 不足:LongRLVR 将证据选择的梯度消失形式化,并用可验证上下文奖励修复。
  • 但在合成多跳设置中,仅结果 RLVR 仍可提升中间推理质量:训练会增加轨迹中正确中间答案的包含率(Synthetic multi-hop RLVR),表明任务结构很关键。
  • 测试时学习需要外部 grounding 以避免自我强化错误:T³RL 用工具验证的加权投票替代多数投票伪标注,防止“伪热门模式塌缩”。
  • 评测可靠性已成为可测对象:RubricBench 分离 rubric 生成 vs 执行;张量分解评测将自动评审器视为噪声传感器,并用稀缺人工标签校准。
  • 长上下文优化可能制造安全断崖:KV 压缩在 α≈0.9 附近出现幻觉尖峰,并与全局驱逐答案相关路由(GER)相关。
  • 多轮交互是独立的鲁棒性轴:指令维护退化远大于工具选择或槽位抽取,小模型退化更明显(Conversational Reliability)。
  • 安全威胁正在转向系统属性:可用性(VidDoS)、溯源/完整性(Shadow APIs)与结构化标签记忆(task bots)都是“非纯文本”的失败模式。
  • Agent 安全正被重构为全生态问题:Agentic Web survey 强调身份/授权、溯源与生态响应(隔离/吊销/恢复)是超越单 Agent 防御的基础原语。

4) Top 5 论文(含“为何是现在”)

1) Real Money, Fake Models: Deceptive Model Claims in Shadow APIs

  • 量化了真实的供应链问题:17 个 shadow API 被 187 篇论文使用
  • 展示在高风险领域的效用大幅崩塌(例如报告 MedQA 准确率在 shadow vs 官方之间下降)以及安全行为分歧。
  • 提供两种互补的验证方法(LLMmap 指纹 + MET),并进行受控验证。
  • 怀疑点:市场波动大;结果是快照(2025 年 9–12 月),且无法获得后端真实依据。

2) ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

  • 将工具使用评测更贴近真实助手:纵向个人上下文 + 有状态工具 + 时间锚点
  • 诊断性分解识别出payload/参数生成是相对检索的主要瓶颈。
  • 压力测试量化了在错误信息/上下文不足下的性能下降。
  • 怀疑点:合成到真实的差距;里程碑编写成本与评测器对未预料到但有效计划的假阴性。

3) LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

  • 解释了为何仅结果 RLVR 在长上下文 grounding 上会停滞(梯度消失),并给出可验证的修复方案。
  • 在长上下文基准上展示提升(例如 Qwen2.5-14B-1M 在 RULER-QA AVG 上 73.17→88.90)。
  • 怀疑点:依赖合成流水线提供的证据 chunk 真值标注;超出本文设置的普适性尚未建立。

4) RubricBench: Aligning Model-Generated Rubrics with Human Standards

  • 专家指令式 rubric与严格对齐指标,使 rubric 质量可测。
  • 发现稳定的~26–28 分“rubric gap”:自生成 rubric vs 人工注入 rubric。
  • 表明测试时扩展无法修复 rubric 生成;即便人类在受生成 rubric 约束时也会退化。
  • 怀疑点:专家 rubric 标注限制规模;二元清单式 rubric 可能遗漏主观任务的细微差别。

5) VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

  • 展示一种通用、可随处部署的补丁,可诱发极端生成/延迟膨胀(token 比例 >200×;报告延迟开销 >15s)。
  • 与实时流水线安全直接相关(累积延迟阈值违规)。
  • 怀疑点:在所提供内容中未明确给出局限性章节;真实世界防御/缓解与对多样化部署的迁移仍需进一步研究。

5) 实用的下一步

  • 为工具型 Agent 加装“payload 正确性”指标(参数有效性、schema 遵循、参数准确性),并与检索、规划分开统计——ASTRA 表明这是主导瓶颈。
  • 尽可能加入确定性验证器:基于约束的工具验证(CoVe)、逐步状态检查器(Pencil Puzzle Bench)、工具执行的数学验证(T³RL),以降低评审噪声。
  • 对长上下文 RLVR,显式奖励 grounding:实现 chunk-selection 输出 + Fβ 风格上下文奖励(LongRLVR),并跟踪上下文召回以检测早期停滞。
  • 用单轮 vs 多轮配对任务压力测试多轮可靠性(全局约束、工具路由、槽位抽取),在部署前量化“对话税”。
  • 将 KV 压缩视为安全参数:随着压缩增加监控路由删除代理指标(如 GER 类度量)与幻觉率;在无护栏情况下避免运行在报告的断崖区间附近。
  • 为多模态系统加入可用性红队测试:在 Video-LLM 与实时流水线的 CI 中加入长生成/延迟膨胀测试(VidDoS 风格)。
  • 审计研究与生产中的 API 溯源:采用指纹/分布等价检验(Shadow APIs),并记录端点溯源以防静默模型替换。
  • 若使用 rubric 引导式评审,直接度量 rubric 质量:跟踪 rubric 相对人工 rubric 的召回/幻觉(RubricBench),避免假设“rubric prompting”就足够。

由逐篇论文分析生成;无外部浏览。