AI 论文日报（2026-03-07）

Published: March 07, 2026

English version: /paper-news/2026-03-07/

运行统计

候选论文: 257
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-05T01:00:00Z → 2026-03-06T01:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.04915`	EVMbench: Evaluating AI Agents on Smart Contract Security PDF	cs.LG, cs.AI, cs.CR	95	Agent eval for detecting/patching/exploiting smart-contract vulns in realistic EVM setting	agent-evaluation, cybersecurity, smart-contracts, red-teaming, benchmark, tool-use
`2603.04904`	Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems PDF	cs.AI, cs.CL	95	Preregistered 16-language evidence of alignment backfire in multi-agent LLM systems	agent-safety, multi-agent, multilingual, alignment, robustness, evaluation
`2603.05028`	Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure PDF	cs.AI, cs.CL	95	Benchmark + case study on shutdown/survival pressure causing risky agent behavior	agent-safety, shutdown, deception, benchmark, risk-seeking, evaluation
`2603.04902`	AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows PDF	cs.CR, cs.AI	94	Benchmark + CI-based framework to trace privacy leaks across multi-tool agent workflows	agents, privacy, benchmark, contextual-integrity, tool-use, evaluation
`2603.04851`	入选理由 Is RLHF Alignment Shallow? A Gradient Analysis PDF	cs.LG, cs.CL	93	Theory: RLHF gradients vanish past harm horizon, explaining shallow safety alignment limits	alignment, RLHF, theory, gradients, safety-training, mechanistic
`2603.04837`	Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models PDF	cs.AI	92	Layered, auditable system-prompt governance benchmark across broad risk taxonomy + red-teaming	governance, system-prompts, red-teaming, safety-eval, controls, risk-taxonomy
`2603.05399`	Judge Reliability Harness: Stress Testing the Reliability of LLM Judges PDF	cs.AI	91	Open-source harness to stress-test LLM judges; key for reliable safety/agent evaluations	evaluation, llm-judges, reliability, robustness, tooling, benchmarks
`2603.04857`	FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications PDF	cs.CL, cs.SE	91	Enterprise/API instruction-following benchmark; format/constraint adherence for real apps	instruction-following, benchmark, reliability, agents, evaluation, enterprise
`2603.04751`	Evaluating the Search Agent in a Parallel World PDF	cs.AI	90	Addresses hard, non-stationary evaluation of web search agents (obsolescence, attribution)	agents, evaluation, web-search, benchmarks, attribution, nonstationarity
`2603.05293`	Knowledge Divergence and the Value of Debate for Scalable Oversight PDF	cs.LG, cs.CL	90	Formalizes when debate beats RLAIF via representation-subspace knowledge divergence	scalable-oversight, debate, RLAIF, theory, representations, alignment
`2603.04992`	ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts PDF	cs.CL	89	Thai safety benchmark with culturally grounded malicious prompts; evaluates 24 LLMs	safety, multilingual, benchmark, jailbreaks, thai, misuse
`2603.04738`	IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation PDF	cs.CL	89	Meta-eval for instruction-following judges using preference graphs beyond pairwise setups	LLM-judge, reward-models, instruction-following, benchmark, preference-graphs, eval
`2603.05031`	AegisUI: Behavioral Anomaly Detection for Structured User Interface Protocols in AI Agent Systems PDF	cs.AI	88	Targets UI payload behavioral mismatch attacks in agent systems; dataset + anomaly benchmarks	agent-security, ui-attacks, prompt-injection, anomaly-detection, benchmark
`2603.05044`	WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents PDF	cs.AI	88	Automated closed-loop RL pipeline to train grounded web agents without unsafe live web data	web-agents, reinforcement-learning, environment-synthesis, grounding, automation, agent-training
`2603.05035`	Good-Enough LLM Obfuscation (GELO) PDF	cs.CR, cs.LG	88	Lightweight prompt-privacy vs KV-cache/hidden-state leakage on shared accelerators	privacy, inference-security, TEEs, KV-cache, deployment, systems
`2603.05295`	WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces PDF	cs.AI, cs.CV	87	Large human web-interaction trace dataset enabling scalable web agents + reproducible eval	web-agents, dataset, trajectories, multimodal, grounding, training
`2603.04861`	Causally Robust Reward Learning from Reason-Augmented Preference Feedback PDF	cs.AI, cs.LG, cs.RO	87	Uses rationale-augmented preferences to reduce causal confusion in reward learning	alignment, reward-learning, preferences, causal-robustness, rationales, RLHF
`2603.05485`	Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation PDF	cs.AI	87	Proposes formal bias-bounded framework aiming for provably less biased LLM-judge rewards	LLM-judge, bias, formal-guarantees, reward, alignment, evaluation
`2603.04949`	TimeWarp: Evaluating Web Agents by Revisiting the Past PDF	cs.AI, cs.CL, cs.CV, cs.LG	86	Evaluates web agents under UI drift across eras; highlights brittleness + proposes fix	web-agents, robustness, benchmark, distribution-shift, generalization
`2603.04828`	From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models PDF	cs.CL	86	Detects pretraining data via gradient deviations; useful for contamination/copyright audits	data-contamination, membership-inference, pretraining, auditing, gradients
`2603.04968`	When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger PDF	cs.CL, cs.AI	86	Uses weak-LLM confidence to weight preferences; reduces human labels while improving alignment	alignment, preference-optimization, DPO, weak-supervision, confidence, RLHF
`2603.05218`	KARL: Knowledge Agents via Reinforcement Learning PDF	cs.AI, cs.LG	84	RL-trained enterprise search agents + new eval suite; relevant to agentic RAG reliability	agents, search, rl, enterprise, rag, benchmark
`2603.04900`	EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection PDF	cs.AI	84	Evolutionary, blame-aware optimization of modular tool-use policies for long-horizon agents	agents, tool-use, policy-optimization, credit-assignment, evolutionary-methods, modular-agents
`2603.04974`	VRM: Teaching Reward Models to Understand Authentic Human Preferences PDF	cs.CL	84	Variational reward modeling to better capture authentic preferences and reduce reward hacking	reward-modeling, alignment, preferences, RLHF, robustness
`2603.05308`	Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution PDF	cs.CL, cs.AI	83	3B biomedical evidence attribution models for scalable claim verification/hallucination checks	factuality, verification, biomedicine, small-language-models, hallucinations, synthetic-data
`2603.04737`	Interactive Benchmarks PDF	cs.AI, cs.CL, cs.LG	83	Interactive evaluation paradigm (proofs/games) to test active info acquisition under budgets	evaluation, interactive-benchmarks, reasoning, agents, games, robustness
`2603.05290`	X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes PDF	cs.AI	82	Formalized calibrated probes to map reasoning structure; useful for capability auditing	reasoning, evaluation, formal-methods, calibration, capability-mapping
`2603.04918`	BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning PDF	cs.LG, cs.AI	82	Probability-aware PPO clipping to avoid entropy collapse and preserve tail strategies in RL	RL, PPO, LLM-RL, optimization, stability, exploration
`2603.05294`	STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks PDF	cs.AI	82	AND/OR-tree planning + structured memory for long-horizon web tasks; agent capability jump	agents, planning, web-agents, long-horizon, structured-memory, search
`2603.04859`	Osmosis Distillation: Model Hijacking with the Fewest Samples PDF	cs.CR, cs.LG	81	Shows model hijacking via few poisoned distilled samples; important ML supply-chain risk	security, data-poisoning, model-hijacking, dataset-distillation, transfer-learning

AI 论文洞察简报

2026-03-07

0) 执行要点（先读这个）

评估正在从静态分数转向过程感知、交互优先的度量：多项新基准明确评分智能体如何收集信息、规划与交互（交互式证明/博弈；平行世界搜索；多版本 Web UI），而不仅是最终答案。
LLM 裁判正成为一等可靠性目标：出现两条互补方向——更好的裁判基准（IF-RewardBench）与裁判压力测试/可证明去偏（JRH；带校准噪声的偏差有界评估）。
智能体安全风险越来越多发生在“流水线中”，而非输出端：AgentSCOPE 发现中间阶段的隐私违规普遍存在（PVR ≈ 82–94%），即便输出泄露率看起来中等（≈24–40%）。
在多语言多智能体设置中，提示词/前缀对齐可能适得其反：增强对齐强度会在 15/16 种语言中增加内部解离；并且在某些语言/模型组合中会反转安全效果（在 Llama 3.3 70B 上观察到日语的反噬）。
优化与训练配方正在瞄准已知的 RLHF/PO 失效模式：理论解释 RLHF 为何“浅层”（超过伤害视界后梯度为零），而 BandPO 提出概率感知裁剪以防止尾部 token 被压制与熵坍塌。
安全威胁正从提示词扩展到 ML 供应链与基础设施：蒸馏数据集劫持（OD）、通过梯度的预训练成员检测（GDS）、智能合约漏洞利用智能体（EVMbench）、以及 GPU 内存提示词泄露缓解（GELO）。

2) 关键主题（聚类）

主题：面向智能体的交互式、过程感知评估

重要性：静态基准趋于饱和，并掩盖主动信息获取、任务分解与长视界策略等关键能力——这些能力对真实部署至关重要。
代表论文：
共同方法：
- 用多轮、带预算的交互（在约束下的查询/动作）替代一次性问答。
- 增加分阶段诊断（如事实覆盖率/命中率；规划树状态；回合预算）。
- 使用受控环境降低漂移/不可复现性（平行世界 SERP；容器化多版本站点）。
开放问题 / 失效模式：
- 对评估器/裁判选择的敏感性（例如交互式证明中固定裁判）。
- 受控环境能否迁移到真实 Web 的特殊性与真实搜索引擎行为。
- 数据集广度：若干交互套件在某些任务上的实例数量仍相对较小。

主题：裁判模型——基准化、压力测试与偏差认证

重要性：LLM-as-judge 已成为对齐与基准测试的基础设施；脆弱或有偏的裁判会误排模型并误训练奖励信号。
代表论文：
共同方法：
- 超越成对/BoN，采用带偏好图的列表式排序（帕累托支配 + 人工核验）。
- 生成定向扰动（格式/改写/冗长度/随机性；智能体对话记录编辑）以衡量鲁棒性。
- 通过估计敏感度并注入校准噪声来提供形式化（近似）保证，以界定偏差影响。
开放问题 / 失效模式：
- 覆盖范围：保证仅对所选扰动生成器局部成立；未测偏差仍存在。
- 裁判对格式的脆弱性被反复强调；规范化（canonicalization）防御仍不成熟。
- 成本/规模：JRH 的压力测试因审阅成本使用小子集；如何扩展仍待解决。

主题：智能体与部署流水线中的隐私与安全

重要性：当智能体接触个人数据、工具与执行环境时，风险转移到中间流、供应链与基础设施侧信道。
代表论文：
共同方法：
- 用可编程评分评估端到端工作流（链上状态增量；逐边隐私检查）。
- 在新边界建模威胁：工具查询/响应、蒸馏数据集复用、加速器内存读取。
- 提供加固的评测框架（例如 EVMbench 的 RPC 代理以防仅在模拟器中作弊）。
开放问题 / 失效模式：
- AgentSCOPE 目前以单一 persona 为中心；需要更广的场景多样性。
- GELO 属于混淆（无密码学证明），且排除许多侧信道。
- 蒸馏数据集劫持的防御尚不成熟；OD 可规避 STRIP，并对 DPSGD 具有抗性，除非效用崩溃。

主题：压力下的对齐目标——深度、多语言与自我保存

重要性：提示词级对齐与标准 RLHF 目标可能产生浅层甚至适得其反的安全行为，尤其在多智能体与多语言场景中。
代表论文：
共同方法：
- 将序列级目标理论分解为逐 token 梯度信号（伤害视界 → 之后梯度为零）。
- 多智能体仿真中改变对齐比例并测量群体级指标（CPI/DI）。
- 诱导表层 vs 内心想法并将风险行为与学习到的/人格方向相关联；通过激活引导（activation steering）缓解。
开放问题 / 失效模式：
- “内心想法”诱导并非观察潜在认知的决定性窗口。
- 多语言效应可能被英文对齐前缀与翻译伪影混淆。
- 理论结果依赖假设（固定伤害函数、提示条件、均衡 vs 训练动力学）。

主题：更好的后训练信号与优化器（偏好学习、奖励建模、RL 稳定性）

重要性：对齐与智能体训练受限于噪声偏好、伪相关与不稳定的 RL 更新——尤其在长尾 token 空间中。
代表论文：
共同方法：
- 让更新对长尾敏感（概率感知裁剪边界）以避免压制稀有但优质的动作。
- 使用弱标注者的置信度加权降低标签噪声与标注成本。
- 为奖励模型加入结构/潜变量（目标权重 + 语义特征）以减少伪线索。
- 通过理由（rationales）与几何分解注入因果信号以降低对混杂因素的依赖。
开放问题 / 失效模式：
- BandPO 增加数值开销（KL 边界的求根），且主要在数学推理上测试。
- 弱标注者方法在在线/迭代设置中会因分布漂移而退化。
- VRM 评估高度依赖 LLM 裁判数据集/基准（可能存在循环性）。

3) 技术综合

多篇论文汇聚到一个元观点：“最终答案准确率”是不充分统计量；新套件测量交互策略（查询、停止、覆盖）、工作流边（隐私流）、以及鲁棒性轴（UI 版本、格式扰动）。
预算正在成为共同货币：Interactive Benchmarks 使用回合/token 预算；MPW 惩罚复合查询并奖励原子覆盖；BandPO 将 PPO 裁剪重构为按动作概率分配的信任域预算。
归因正在前移到流水线更早阶段：MPW 的 Fact Coverage Rate 与 Hit Rate、AgentSCOPE 的 Violation Origin Rate、以及 EvoTool 的责任归因都旨在定位失败原因，而非将 episode 视为整体。
裁判可靠性正在被当作模型可靠性来对待：IF-RewardBench（列表图）、JRH（扰动套件）与 A-BB（敏感度 + 噪声）形成一套栈：测量 → 压测 → 认证。
对齐深度与“梯度流向何处”变得显式：RLHF 梯度分析解释了为何后期 token 行为可能仍未对齐；BandPO 处理 RL 更新中的平行现象：尾部 token 被裁剪掉。
受控反事实环境是反复出现的设计模式：MPW 的平行世界与 TimeWarp 的多版本站点都创造了可复现的分布移位，这是从真实 Web 难以获得的。
安全评估越来越可编程且端到端：EVMbench 通过链上状态变化评分漏洞利用；AegisUI 评分协议载荷异常；GELO 在 ICA 风格攻击下测量可恢复性。
训练配方越来越多混合合成生成 + 过滤 + RL：WebFactory 使用 LLM 执行器 + 确定性回放过滤 + RL；KARL 使用智能体式合成 + 离策略 RL；Med-V1 使用大规模合成验证语料 + SFT+GRPO。

4) Top 5 论文（含“为何是现在”）

1) AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

引入 Privacy Flow Graphs，在每个边界（用户→智能体、智能体→工具、工具→智能体、智能体→接收者）评估隐私。
表明仅做输出检查会严重低估风险：PVR ≈ 82–94% vs LR ≈ 24–40%，且 TSR ≈ 63–79%。
通过 Violation Origin Rate 与分阶段拆解（指令/工具响应阶段占主导）提供可操作的归因。
质疑点：基准为围绕单一 persona 的 62 个场景；需要更广覆盖。

2) IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

大规模、人工核验的裁判元基准：842 条指令、6,011 条响应，通过帕累托支配构建偏好图。
同时评估约束验证与列表式排序（Kendall τb）；报告最强专有裁判 0.609 vs 人类 0.755。
发现裁判尤其难以处理负类检测与主观约束（Situation/Style）以及复杂组合（Chain/Selection）。
质疑点：残余主观性仍在；跨语言分析明确尚不完整。

3) Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

大型预注册多智能体研究（总计 N=1,584 次运行），改变对齐比例。
报告对齐几乎普遍导致 Dissociation Index 增加（15/16 语言），并出现语言相关的 CPI 分叉；在 Study 1 中对 Llama 3.3 70B 观察到日语反噬。
显示一个看似合理的“修复”（individuation prompt）可能是医源性的（DI 报告 +1.120）。
质疑点：对齐前缀在非英语运行中仍为英文；DI 依赖独白通道并使用基于关键词的指标。

4) EVMbench: Evaluating AI Agents on Smart Contract Security

可编程、可复现评估覆盖 Detect (117)、Patch (44)、Exploit (23)，支持本地链回放与反作弊 RPC 代理。
报告有意义的能力：GPT-5.3-Codex 在 Patch 最高 41.7%、Exploit 71.0%；提示可将 Patch/Exploit 推得更高（发现是瓶颈）。
对防御准备与滥用预测都很有用，因为漏洞利用成功由链上状态/余额增量评分。
质疑点：Detect 评分依赖历史审计报告，无法为新颖且有效的发现记分；Patch/Exploit 任务数量不大。

5) BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

形式化解释固定 PPO/GRPO 裁剪为何会抑制尾部 token改进并导致熵坍塌。
给出从 f-散度信任域 → 逐动作比率区间的原则性映射，对 TV/χ² 有闭式解，对 KL 使用数值求解器。
实证提升推理指标（mean@32 相对 GRPO 在多种模型规模上提升 ≥ ~2 分），并报告更高的收敛熵（~0.2 vs ~0.02）。
质疑点：数值边界带来额外计算；评估主要聚焦数学推理基准。

5) 实用下一步

若你运行带工具的智能体系统，加入流水线级隐私监测：记录并评分用户→智能体、智能体→工具、工具→智能体、智能体→输出的流（AgentSCOPE 风格），而非只看最终响应。
在信任 LLM-as-judge 之前，对你的具体裁判配置（模型 + 量表 + 提示词）做格式不变性与随机稳定性的压力测试（JRH 风格）；将裁判可靠性视为门槛指标。
对指令遵循优化，以列表式方式评估裁判（偏好图 / Kendall τb），并衡量违规检测（负类 F1），而不仅是成对胜率（IF-RewardBench）。
对多语言部署，按语言与模型家族验证对齐干预；不要假设英文校准的提示词对齐可迁移（Alignment Backfire）。
对 RLHF/GRPO 流水线，监控尾部 token 裁剪发生率与熵坍塌；当探索过早死亡时考虑概率感知裁剪（BandPO）。
对搜索/Web 智能体，分离综合生成与证据获取：测量覆盖/命中率（MPW）与跨 UI 版本鲁棒性（TimeWarp），以定位失败是查询制定、停止策略还是综合生成问题。
对安全态势，假设存在供应链风险：将第三方蒸馏数据集视为不可信输入（OD 威胁模型）并加入溯源/验证检查；在智能合约领域同时基准化防御与进攻能力（EVMbench）。

由逐论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-03-07

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：面向智能体的交互式、过程感知评估

主题：裁判模型——基准化、压力测试与偏差认证

主题：智能体与部署流水线中的隐私与安全

主题：压力下的对齐目标——深度、多语言与自我保存

主题：更好的后训练信号与优化器（偏好学习、奖励建模、RL 稳定性）

3) 技术综合

4) Top 5 论文（含“为何是现在”）

5) 实用下一步