AI 论文日报(2026-03-07)

Published:

English version: /paper-news/2026-03-07/

运行统计

  • 候选论文: 257
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-05T01:00:00Z → 2026-03-06T01:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.04915EVMbench: Evaluating AI Agents on Smart Contract Security
PDF
cs.LG, cs.AI, cs.CR95Agent eval for detecting/patching/exploiting smart-contract vulns in realistic EVM settingagent-evaluation, cybersecurity, smart-contracts, red-teaming, benchmark, tool-use
2603.04904Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
PDF
cs.AI, cs.CL95Preregistered 16-language evidence of alignment backfire in multi-agent LLM systemsagent-safety, multi-agent, multilingual, alignment, robustness, evaluation
2603.05028Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure
PDF
cs.AI, cs.CL95Benchmark + case study on shutdown/survival pressure causing risky agent behavioragent-safety, shutdown, deception, benchmark, risk-seeking, evaluation
2603.04902AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows
PDF
cs.CR, cs.AI94Benchmark + CI-based framework to trace privacy leaks across multi-tool agent workflowsagents, privacy, benchmark, contextual-integrity, tool-use, evaluation
2603.04851入选理由 Is RLHF Alignment Shallow? A Gradient Analysis
PDF
cs.LG, cs.CL93Theory: RLHF gradients vanish past harm horizon, explaining shallow safety alignment limitsalignment, RLHF, theory, gradients, safety-training, mechanistic
2603.04837Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models
PDF
cs.AI92Layered, auditable system-prompt governance benchmark across broad risk taxonomy + red-teaminggovernance, system-prompts, red-teaming, safety-eval, controls, risk-taxonomy
2603.05399Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
PDF
cs.AI91Open-source harness to stress-test LLM judges; key for reliable safety/agent evaluationsevaluation, llm-judges, reliability, robustness, tooling, benchmarks
2603.04857FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications
PDF
cs.CL, cs.SE91Enterprise/API instruction-following benchmark; format/constraint adherence for real appsinstruction-following, benchmark, reliability, agents, evaluation, enterprise
2603.04751Evaluating the Search Agent in a Parallel World
PDF
cs.AI90Addresses hard, non-stationary evaluation of web search agents (obsolescence, attribution)agents, evaluation, web-search, benchmarks, attribution, nonstationarity
2603.05293Knowledge Divergence and the Value of Debate for Scalable Oversight
PDF
cs.LG, cs.CL90Formalizes when debate beats RLAIF via representation-subspace knowledge divergencescalable-oversight, debate, RLAIF, theory, representations, alignment
2603.04992ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
PDF
cs.CL89Thai safety benchmark with culturally grounded malicious prompts; evaluates 24 LLMssafety, multilingual, benchmark, jailbreaks, thai, misuse
2603.04738IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
PDF
cs.CL89Meta-eval for instruction-following judges using preference graphs beyond pairwise setupsLLM-judge, reward-models, instruction-following, benchmark, preference-graphs, eval
2603.05031AegisUI: Behavioral Anomaly Detection for Structured User Interface Protocols in AI Agent Systems
PDF
cs.AI88Targets UI payload behavioral mismatch attacks in agent systems; dataset + anomaly benchmarksagent-security, ui-attacks, prompt-injection, anomaly-detection, benchmark
2603.05044WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
PDF
cs.AI88Automated closed-loop RL pipeline to train grounded web agents without unsafe live web dataweb-agents, reinforcement-learning, environment-synthesis, grounding, automation, agent-training
2603.05035Good-Enough LLM Obfuscation (GELO)
PDF
cs.CR, cs.LG88Lightweight prompt-privacy vs KV-cache/hidden-state leakage on shared acceleratorsprivacy, inference-security, TEEs, KV-cache, deployment, systems
2603.05295WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
PDF
cs.AI, cs.CV87Large human web-interaction trace dataset enabling scalable web agents + reproducible evalweb-agents, dataset, trajectories, multimodal, grounding, training
2603.04861Causally Robust Reward Learning from Reason-Augmented Preference Feedback
PDF
cs.AI, cs.LG, cs.RO87Uses rationale-augmented preferences to reduce causal confusion in reward learningalignment, reward-learning, preferences, causal-robustness, rationales, RLHF
2603.05485Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
PDF
cs.AI87Proposes formal bias-bounded framework aiming for provably less biased LLM-judge rewardsLLM-judge, bias, formal-guarantees, reward, alignment, evaluation
2603.04949TimeWarp: Evaluating Web Agents by Revisiting the Past
PDF
cs.AI, cs.CL, cs.CV, cs.LG86Evaluates web agents under UI drift across eras; highlights brittleness + proposes fixweb-agents, robustness, benchmark, distribution-shift, generalization
2603.04828From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models
PDF
cs.CL86Detects pretraining data via gradient deviations; useful for contamination/copyright auditsdata-contamination, membership-inference, pretraining, auditing, gradients
2603.04968When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
PDF
cs.CL, cs.AI86Uses weak-LLM confidence to weight preferences; reduces human labels while improving alignmentalignment, preference-optimization, DPO, weak-supervision, confidence, RLHF
2603.05218KARL: Knowledge Agents via Reinforcement Learning
PDF
cs.AI, cs.LG84RL-trained enterprise search agents + new eval suite; relevant to agentic RAG reliabilityagents, search, rl, enterprise, rag, benchmark
2603.04900EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection
PDF
cs.AI84Evolutionary, blame-aware optimization of modular tool-use policies for long-horizon agentsagents, tool-use, policy-optimization, credit-assignment, evolutionary-methods, modular-agents
2603.04974VRM: Teaching Reward Models to Understand Authentic Human Preferences
PDF
cs.CL84Variational reward modeling to better capture authentic preferences and reduce reward hackingreward-modeling, alignment, preferences, RLHF, robustness
2603.05308Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
PDF
cs.CL, cs.AI833B biomedical evidence attribution models for scalable claim verification/hallucination checksfactuality, verification, biomedicine, small-language-models, hallucinations, synthetic-data
2603.04737Interactive Benchmarks
PDF
cs.AI, cs.CL, cs.LG83Interactive evaluation paradigm (proofs/games) to test active info acquisition under budgetsevaluation, interactive-benchmarks, reasoning, agents, games, robustness
2603.05290X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes
PDF
cs.AI82Formalized calibrated probes to map reasoning structure; useful for capability auditingreasoning, evaluation, formal-methods, calibration, capability-mapping
2603.04918BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
PDF
cs.LG, cs.AI82Probability-aware PPO clipping to avoid entropy collapse and preserve tail strategies in RLRL, PPO, LLM-RL, optimization, stability, exploration
2603.05294STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
PDF
cs.AI82AND/OR-tree planning + structured memory for long-horizon web tasks; agent capability jumpagents, planning, web-agents, long-horizon, structured-memory, search
2603.04859Osmosis Distillation: Model Hijacking with the Fewest Samples
PDF
cs.CR, cs.LG81Shows model hijacking via few poisoned distilled samples; important ML supply-chain risksecurity, data-poisoning, model-hijacking, dataset-distillation, transfer-learning

AI 论文洞察简报

2026-03-07

0) 执行要点(先读这个)

  • 评估正在从静态分数转向过程感知、交互优先的度量:多项新基准明确评分智能体如何收集信息、规划与交互(交互式证明/博弈;平行世界搜索;多版本 Web UI),而不仅是最终答案。
  • LLM 裁判正成为一等可靠性目标:出现两条互补方向——更好的裁判基准(IF-RewardBench)与裁判压力测试/可证明去偏(JRH;带校准噪声的偏差有界评估)。
  • 智能体安全风险越来越多发生在“流水线中”,而非输出端:AgentSCOPE 发现中间阶段的隐私违规普遍存在(PVR ≈ 82–94%),即便输出泄露率看起来中等(≈24–40%)。
  • 在多语言多智能体设置中,提示词/前缀对齐可能适得其反:增强对齐强度会在 15/16 种语言中增加内部解离;并且在某些语言/模型组合中会反转安全效果(在 Llama 3.3 70B 上观察到日语的反噬)。
  • 优化与训练配方正在瞄准已知的 RLHF/PO 失效模式:理论解释 RLHF 为何“浅层”(超过伤害视界后梯度为零),而 BandPO 提出概率感知裁剪以防止尾部 token 被压制与熵坍塌。
  • 安全威胁正从提示词扩展到 ML 供应链与基础设施:蒸馏数据集劫持(OD)、通过梯度的预训练成员检测(GDS)、智能合约漏洞利用智能体(EVMbench)、以及 GPU 内存提示词泄露缓解(GELO)。

2) 关键主题(聚类)

主题:面向智能体的交互式、过程感知评估

主题:裁判模型——基准化、压力测试与偏差认证

主题:智能体与部署流水线中的隐私与安全

主题:压力下的对齐目标——深度、多语言与自我保存

主题:更好的后训练信号与优化器(偏好学习、奖励建模、RL 稳定性)

3) 技术综合

  • 多篇论文汇聚到一个元观点:“最终答案准确率”是不充分统计量;新套件测量交互策略(查询、停止、覆盖)、工作流边(隐私流)、以及鲁棒性轴(UI 版本、格式扰动)。
  • 预算正在成为共同货币:Interactive Benchmarks 使用回合/token 预算;MPW 惩罚复合查询并奖励原子覆盖;BandPO 将 PPO 裁剪重构为按动作概率分配的信任域预算。
  • 归因正在前移到流水线更早阶段:MPW 的 Fact Coverage Rate 与 Hit Rate、AgentSCOPE 的 Violation Origin Rate、以及 EvoTool 的责任归因都旨在定位失败原因,而非将 episode 视为整体。
  • 裁判可靠性正在被当作模型可靠性来对待:IF-RewardBench(列表图)、JRH(扰动套件)与 A-BB(敏感度 + 噪声)形成一套栈:测量 → 压测 → 认证
  • 对齐深度与“梯度流向何处”变得显式:RLHF 梯度分析解释了为何后期 token 行为可能仍未对齐;BandPO 处理 RL 更新中的平行现象:尾部 token 被裁剪掉。
  • 受控反事实环境是反复出现的设计模式:MPW 的平行世界与 TimeWarp 的多版本站点都创造了可复现的分布移位,这是从真实 Web 难以获得的。
  • 安全评估越来越可编程且端到端:EVMbench 通过链上状态变化评分漏洞利用;AegisUI 评分协议载荷异常;GELO 在 ICA 风格攻击下测量可恢复性。
  • 训练配方越来越多混合合成生成 + 过滤 + RL:WebFactory 使用 LLM 执行器 + 确定性回放过滤 + RL;KARL 使用智能体式合成 + 离策略 RL;Med-V1 使用大规模合成验证语料 + SFT+GRPO。

4) Top 5 论文(含“为何是现在”)

1) AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

  • 引入 Privacy Flow Graphs,在每个边界(用户→智能体、智能体→工具、工具→智能体、智能体→接收者)评估隐私。
  • 表明仅做输出检查会严重低估风险:PVR ≈ 82–94% vs LR ≈ 24–40%,且 TSR ≈ 63–79%。
  • 通过 Violation Origin Rate 与分阶段拆解(指令/工具响应阶段占主导)提供可操作的归因。
  • 质疑点:基准为围绕单一 persona 的 62 个场景;需要更广覆盖。

2) IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

  • 大规模、人工核验的裁判元基准:842 条指令、6,011 条响应,通过帕累托支配构建偏好图。
  • 同时评估约束验证列表式排序(Kendall τb);报告最强专有裁判 0.609 vs 人类 0.755
  • 发现裁判尤其难以处理负类检测与主观约束(Situation/Style)以及复杂组合(Chain/Selection)。
  • 质疑点:残余主观性仍在;跨语言分析明确尚不完整。

3) Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

  • 大型预注册多智能体研究(总计 N=1,584 次运行),改变对齐比例。
  • 报告对齐几乎普遍导致 Dissociation Index 增加(15/16 语言),并出现语言相关的 CPI 分叉;在 Study 1 中对 Llama 3.3 70B 观察到日语反噬。
  • 显示一个看似合理的“修复”(individuation prompt)可能是医源性的(DI 报告 +1.120)。
  • 质疑点:对齐前缀在非英语运行中仍为英文;DI 依赖独白通道并使用基于关键词的指标。

4) EVMbench: Evaluating AI Agents on Smart Contract Security

  • 可编程、可复现评估覆盖 Detect (117)、Patch (44)、Exploit (23),支持本地链回放与反作弊 RPC 代理。
  • 报告有意义的能力:GPT-5.3-Codex 在 Patch 最高 41.7%、Exploit 71.0%;提示可将 Patch/Exploit 推得更高(发现是瓶颈)。
  • 对防御准备与滥用预测都很有用,因为漏洞利用成功由链上状态/余额增量评分。
  • 质疑点:Detect 评分依赖历史审计报告,无法为新颖且有效的发现记分;Patch/Exploit 任务数量不大。

5) BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

  • 形式化解释固定 PPO/GRPO 裁剪为何会抑制尾部 token改进并导致熵坍塌。
  • 给出从 f-散度信任域 → 逐动作比率区间的原则性映射,对 TV/χ² 有闭式解,对 KL 使用数值求解器。
  • 实证提升推理指标(mean@32 相对 GRPO 在多种模型规模上提升 ≥ ~2 分),并报告更高的收敛熵(~0.2 vs ~0.02)。
  • 质疑点:数值边界带来额外计算;评估主要聚焦数学推理基准。

5) 实用下一步

  • 若你运行带工具的智能体系统,加入流水线级隐私监测:记录并评分用户→智能体、智能体→工具、工具→智能体、智能体→输出的流(AgentSCOPE 风格),而非只看最终响应。
  • 在信任 LLM-as-judge 之前,对你的具体裁判配置(模型 + 量表 + 提示词)做格式不变性与随机稳定性的压力测试(JRH 风格);将裁判可靠性视为门槛指标。
  • 对指令遵循优化,以列表式方式评估裁判(偏好图 / Kendall τb),并衡量违规检测(负类 F1),而不仅是成对胜率(IF-RewardBench)。
  • 对多语言部署,按语言与模型家族验证对齐干预;不要假设英文校准的提示词对齐可迁移(Alignment Backfire)。
  • 对 RLHF/GRPO 流水线,监控尾部 token 裁剪发生率与熵坍塌;当探索过早死亡时考虑概率感知裁剪(BandPO)。
  • 对搜索/Web 智能体,分离综合生成与证据获取:测量覆盖/命中率(MPW)与跨 UI 版本鲁棒性(TimeWarp),以定位失败是查询制定、停止策略还是综合生成问题。
  • 对安全态势,假设存在供应链风险:将第三方蒸馏数据集视为不可信输入(OD 威胁模型)并加入溯源/验证检查;在智能合约领域同时基准化防御与进攻能力(EVMbench)。

由逐论文分析生成;无外部浏览。