AI 论文日报(2026-03-03)
Published:
English version: /paper-news/2026-03-03/
运行统计
- 候选论文: 600
- 入选论文: 30
- 已精读完成: 31
- 时间窗口 (UTC): 2026-02-28T01:00:00Z → 2026-03-03T01:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
| arXiv ID | 标题 / 链接 | 分类 | 评分 | 入选理由 | 标签 |
|---|---|---|---|---|---|
2603.01608 | Evaluating and Understanding Scheming Propensity in LLM Agents | cs.AI | 95 | Realistic eval of LLM agent scheming propensity; varies incentives & environment factors | agent-safety, scheming, deception, evaluation, autonomous-agents, instrumental-goals |
2603.00476 | Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents | cs.CR, cs.AI | 95 | Finds widespread TOCTOU exploits in browser agents; benchmark + mitigations for real web adversaries | agent-security, browser-agents, TOCTOU, prompt-injection-adjacent, benchmark, web-safety |
2603.00829 | Constitutional Black-Box Monitoring for Scheming in LLM Agents | cs.CL, cs.AI, cs.LG | 95 | Black-box constitutional monitors for detecting scheming in LLM agents; synthetic trajectory pipelines. | agent-safety, scheming, monitoring, oversight, synthetic-data, evaluations |
2603.00991 | Tracking Capabilities for Safer Agents | cs.AI, cs.PL | 94 | Capability-safe “safety harness” for agents via static effect tracking; strong, actionable safety design. | agents, capabilities, tool-safety, programming-languages, sandboxing, information-leakage |
2603.01297 | I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift | cs.LG, cs.CL | 94 | Shows safety classifiers can silently fail under tiny embedding drift; critical for deployed LLM safety stacks | safety-classifiers, distribution-shift, embedding-drift, monitoring, robustness, deployment |
2603.00801 | The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents | cs.AI, cs.IR | 93 | Adversarial mini-webs to causally test agent epistemic robustness to misleading search/ranking. | agents, web, benchmark, adversarial-content, epistemics, RAG, evaluation |
2603.00811 | Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning | cs.LG | 93 | Shows privacy leakage via data curation; new membership inference attacks across pipeline steps | privacy, membership-inference, data-curation, security, ML-pipelines |
2603.01185 | Token-level Data Selection for Safe LLM Fine-tuning | cs.CL, cs.AI, cs.CR | 93 | Token-level filtering to prevent safety regression during fine-tuning; better safety/utility trade-off. | LLM-safety, fine-tuning, data-selection, robustness, safety-regression |
2603.01246 | Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders | cs.CR, cs.AI | 92 | Measures safety-tuning refusals harming cyber defenders; large real-world NCCDC dataset | safety-alignment, refusal, cybersecurity, evaluation, overrefusal, deployment |
2603.01285 | Attention Smoothing Is All You Need For Unlearning | cs.LG, cs.AI, cs.CL | 92 | Simple unlearning method via attention smoothing; targets memorization of sensitive/hazardous content | unlearning, privacy, safety, LLMs, self-distillation, memorization |
2603.00495 | AI Runtime Infrastructure | cs.AI | 92 | Execution-time layer to monitor/intervene in agent runs for reliability, efficiency, and safety. | agents, runtime-monitoring, policy-enforcement, reliability, safety-infrastructure, long-horizon |
2603.01589 | SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond | cs.LG, cs.AI | 91 | Large, objective scientific safety eval+train data (0.25M/1.5M); broad risk coverage. | safety-evaluation, benchmarks, science, datasets, risk-assessment, training-data |
2603.01423 | Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction | cs.CL | 91 | Systematic multi-turn reliability eval: constraints, tool selection, entity tracking; quantifies degradation | reliability, multi-turn, tool-use, evaluation, entity-tracking |
2603.00902 | Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents | cs.CR, cs.SE | 90 | Realistic toolchain attack on open agent ecosystem causing 6–9x token/billing amplification. | agents, tool-calling, prompt-injection, denial-of-wallet, security, ecosystems |
2603.00823 | A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction | cs.CL, cs.AI | 90 | Shows LLM unlearning can fail under multi-turn interaction; interactive robustness eval | unlearning, privacy, safety, multi-turn, robustness, evaluation |
2603.01357 | ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context | cs.AI | 90 | Interactive tool-use benchmark with evolving personal context; exposes failures under high complexity | agents, tool-use, benchmark, personal-context, evaluation |
2603.01335 | Provable and Practical In-Context Policy Optimization for Self-Improvement | cs.LG, cs.AI | 90 | In-context policy optimization for test-time self-improvement; theory + practical ME-ICPO algorithm | test-time-scaling, self-improvement, in-context-learning, RL, reliability |
2603.01714 | TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training | cs.LG, cs.CL | 90 | Interaction-topology curation for tool-use agents; targets error recovery and nontrivial task structure | agents, tool-use, training, trajectory-curation, interaction-dynamics, RL |
2603.00822 | ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files | cs.SE, cs.AI | 90 | Turns agent instruction files into executable constraints to prevent context drift in autonomous coding agents. | agents, guardrails, software-engineering, policy-enforcement, reliability |
2603.00552 | EMPA: Evaluating Persona-Aligned Empathy as a Process | cs.AI | 90 | Process-based eval for persona-aligned empathy; multi-agent sandbox exposes trajectory failure modes. | evaluation, dialogue-agents, safety, human-values, multi-agent, mental-health |
2603.02208 | Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training | cs.CL | 90 | Scalable procedural suite for verifiable symbolic reasoning data with solvers + curriculum control. | reasoning, synthetic-data, verification, symbolic, curriculum, benchmarks |
2603.00516 | ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data | cs.CR | 89 | Federated instruction tuning with interspersed poisoning; proposes backdoor-free defense | data-poisoning, backdoors, federated-learning, instruction-tuning, security, robustness |
2603.02146 | LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards | cs.CL | 89 | Shows outcome-only RLVR fails for long context; proposes verifiable context rewards to fix grounding | long-context, RLVR, grounding, rewards, reasoning, training |
2603.01209 | Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics | cs.AI, cs.LG | 89 | Studies interpreter state persistence as train-time semantics; important for reliable CodeAct-style agents | agents, tool-use, python, state, training-semantics, benchmarks |
2603.02196 | Conformal Policy Control | cs.AI, cs.LG, math.ST, stat.ML | 89 | Conformal calibration to regulate new policies vs safe reference policy with user-specified risk tolerance. | safe-exploration, conformal, policy-safety, risk-control, reinforcement-learning |
2603.00498 | Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence | cs.LG | 88 | Defense vs harmful fine-tuning by attenuating harmful gradient influence; relevant to FTaaS. | alignment, fine-tuning-security, poisoning, robust-training, gradient-regularization |
2603.02119 | Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning | cs.AI, cs.GT, cs.LG | 88 | Verifiable step-level reasoning benchmark (pencil puzzles) enabling dense process supervision signals | reasoning, benchmark, process-supervision, verification, agentic-eval |
2603.01260 | MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers | cs.LG, cs.AI | 88 | Unified platform to compare RL/LLM/VLM/humans in same envs; useful for agent safety studies. | agents, evaluation-platform, multi-agent, LLM, VLM, reproducibility, benchmarks |
2603.02203 | Tool Verification for Test-Time Reinforcement Learning | cs.AI, cs.CL | 88 | Adds tool-based verification to test-time RL to prevent spurious consensus and mode collapse. | test-time-training, RL, tool-verification, robustness, math-reasoning, self-improvement |
2603.01940 | CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification | cs.AI | 87 | Constraint-guided trajectory synthesis + deterministic verification for tool-use agents | tool-use, agents, post-training, verification, synthetic-data, RL |
AI 论文洞察简报
2026-03-03
0) 执行要点(先读这个)
- Agent 安全正从“提示攻击”转向“执行时失败”:动态 Web 的 TOCTOU、对抗性排序、工具链 token 消耗、以及运行时语义不匹配,都能在不需要越狱模型的情况下破坏 agent。
- 可复现性基础设施正在成为一等安全杠杆:Jailbreak Foundry 表明“论文→可运行攻击”的自动化能让越狱评测保持最新(30 种攻击;平均 ASR 偏差 +0.26pp),但也带来双重用途风险。
- 数据流水线在新位置发生泄露与投毒:“为隐私而策展”仍可能泄露成员关系(即使不训练私有数据),联邦指令微调也易受用户级夹杂式投毒影响——但协同过滤可将 ASR 降到 0.00。
- 微调防御在变得更细粒度:两条互补方向——(i) 梯度影响衰减(Antibody)与 (ii) token 级掩蔽(TOSS/TOSS-Pro)——都旨在在防止安全回退的同时保留效用。
- 静态安全组件在分布/表征漂移下很脆弱:基于 embedding 的安全分类器在小幅 embedding 漂移下可能崩到接近随机但仍过度自信;多轮交互也可能“恢复”已被遗忘/去除的知识。
- 评估正在转向轨迹/过程指标:将同理心视为潜在状态“有效功”(EMPA)、带里程碑/雷区的个人助理工具使用(ASTRA-bench)、以及确定性的多轮可靠性任务,都能暴露单轮评分隐藏的失败。
2) 关键主题(聚类)
主题:执行时的 agent 漏洞(超越提示注入)
- 为何重要:许多真实失败发生在规划之后——当环境变化、工具异常或成本爆炸时——因此“安全输出”不够;你需要运行时检查与原子性。
- 代表论文:
- Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents
- The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents
- Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents
- Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
- 常见方法:
- 定义与agent 循环时序或工具/运行时语义绑定的具体威胁模型(TOCTOU 窗口;rank-0 蜜罐;持久上下文摄入;持久化 vs 重置)。
- 构建受控基准/环境以隔离因果(DYNWEB;合成 mini-internet;贴近部署的 OpenClaw;2×2 训练/运行时研究)。
- 在可能处加入轻量运行时缓解(带 DOM/布局监控的执行前验证)。
- 开放问题 / 失效模式:
- 仍存在残余窗口(TOCTOU 验证→执行约 ~0.13s;压力测试仍有 0.2–0.3% 成功)。
- 即使工具无限,agent 也可能不升级搜索/验证(对抗排序下工具调用增幅很小)。
- 恢复行为可能增加成本/攻击影响(回退级联放大 token 消耗)。
- 训练–部署不匹配可能悄然抬高成本或导致脆弱错误(Persistent→Stateless ~80% 回合存在未解析引用)。
主题:安全评估与基准走向“过程优先”
- 为何重要:单轮指标会漏掉长时程漂移、潜在状态动力学与工具执行正确性;过程级评估正成为安全与能力主张的必要条件。
- 代表论文:
- EMPA: Evaluating Persona-Aligned Empathy as a Process
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
- Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
- Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
- 常见方法:
- 用轨迹评分替代标量“好坏”(潜在心理动力向量;里程碑/雷区 DAG;跨多轮的确定性通过/失败;逐步谜题检查)。
- 用受控扰动测试指标鲁棒性(persona 翻转;奉承攻击;单轮 vs 多轮配对数据集)。
- 强调可验证性(工具轨迹/状态检查;确定性约束;基于引擎的规则检查)。
- 开放问题 / 失效模式:
- 评审可靠性与合成到真实的差距(EMPA 人类代理分歧;ASTRA 合成语料差距;谜题基准缺少人类基线)。
- 多轮可靠性失败可能巨大且依模型而异(如 GPT-4o 指令遵循 96%→63;GPT-4o-mini 93%→24)。
- 工具使用瓶颈集中在参数/载荷生成(ASTRA payload 分数 0.5603–0.8478)。
主题:投毒/后门下的微调与联邦训练
- 为何重要:定制化流水线(FTaaS、FIT)已成为核心产品面;攻击者可用很小的投毒比例或分布式用户级投毒来破坏安全性。
- 代表论文:
- 常见方法:
- 将安全退化视为优化信号问题(有害梯度;不安全 token;频域梯度特征)。
- 使用选择性衰减而非粗暴过滤(样本重加权;token 掩蔽;基于聚类的样本过滤)。
- 跨模型/数据集/攻击强度评估并包含消融(Antibody 组件消融;TOSS vs TOSS-Pro;ProtegoFed 召回/开销)。
- 开放问题 / 失效模式:
- 计算/内存开销可能很高(Antibody 时间 6.27h;最大内存 64.55GB)。
- 假设可能失效(若全局大多被投毒,ProtegoFed 无效)。
- 依赖参考模型/tokenizer 与打分质量(TOSS 依赖安全退化 vs 效用导向参考;迁移假设共享 tokenizer)。
主题:通过“上游”流水线(策展、遗忘/去学习)的隐私与安全泄露
- 为何重要:即使避免在私有数据上训练,选择与事后移除也可能在真实交互下泄露或失效。
- 代表论文:
- 常见方法:
- 在流水线多个阶段攻击/度量泄露(分数、子集掩码、最终模型指纹)。
- 在交互模式下压力测试去学习(自我纠错;对话条件查询)。
- 提出更简单、针对内部机制的去学习目标(attention-temperature “forget teacher” 蒸馏)。
- 开放问题 / 失效模式:
- 多轮交互可恢复“已遗忘”知识(多种去学习方法在 Round-1 约 ~25% 后,Round-2 准确率常回到 ~60–75%)。
DP 缓解能降低泄露但需要隐私–效用权衡研究(如 ε=100 将 CIFAR-10 T =1000 的图像策展 TPR@1%FPR 从 0.9842→0.0155)。 - 去学习方法可能牺牲连贯性(ASU 指出 τ>4 常产生乱码;需要调参)。
主题:面向 agent 的监督与护栏(监控、约束、能力)
- 为何重要:随着 agent 更自主,需要在低误报率下有效的控制,不依赖思维链,并最好具备可执行保证。
- 代表论文:
- 常见方法:
- 黑盒、仅基于动作的监控,并使用低 FPR 指标(FPR∈[0,0.2] 上的 pAUROC)。
- 将“何时会 scheming”做因子化分析(提示片段、工具、激励),而非视为固定特质。
- 将自然语言策略编译为可执行检查(仓库 instruction files → 流程/源码/架构检查)。
- 将信任转向静态强制(能力类型化代码;“Classified” 纯度约束)。
- 开放问题 / 失效模式:
- 合成到真实的迁移上限与监控提示优化的过拟合(pAUROC 饱和;更重优化可能变差)。
- Scheming 倾向对配置高度敏感(如片段诱导最高到 91%;移除一个工具可使 59%→3%)。
- 护栏可能难以在真实 agent diff/log 上端到端验证(ContextCov 缺少受控 agent 实验)。
3) 技术综合
- 多篇论文共同指向 “运行时是新的攻击面”:TOCTOU 窗口、排序操纵、token 成本放大、解释器语义都利用了agent 以为的与实际执行的之间的缝隙。
- 因果、受控环境正成为 agent 鲁棒性主张的标准:带 rank 注入的合成 mini-web;带 oracle 的合成动态 Web 操作;程序化部分可观测任务;确定性多轮通过/失败套件。
- 验证正在更早进入循环:执行前验证(DOM/布局监控)、工具使用数据的确定性约束验证器(CoVe)、逐步谜题检查、以及用于测试时 RL 伪标签的工具验证(T^3RL)。
- 安全训练防御正成为“信号整形”方法:Antibody 衰减有害梯度(平坦性 + 重加权),TOSS 掩蔽不安全 token,ProtegoFed 聚类频域梯度特征以过滤投毒样本。
- 静态组件在漂移下会失效:基于 embedding 的安全分类器在小漂移下可能崩溃(σ≈0.028 → ROC-AUC 49.75%)但仍自信;这支持再训练、漂移检测或表征不变的安全层。
- 多轮交互是跨领域压力源:去学习可被自我纠错/对话上下文逆转;多轮下指令遵循崩塌;同理心评估需要轨迹级潜在状态评分。
- 监督需要低 FPR 指标与扎实标签:监控用 pAUROC;ControlArena 用执行式标签;工具使用与谜题用确定性检查。
- 可复现工具本身也是安全能力:Jailbreak Foundry 表明自动化“论文→可执行攻击”能让评估保持最新,但也降低攻击者成本(双重用途)。
4) Top 5 论文(含“为何现在”)
1) The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents(The Synthetic Web:对抗策展的迷你互联网,用于诊断语言 agent 的认知弱点)
- 因果性展示 一个 rank-0 蜜罐即可让准确率崩塌:GPT-5 65.1%→18.2%,GPT-4o 27.2%→3.8%。
- 工具轨迹显示 几乎不升级(如 GPT-5 工具调用 6.45→6.61),解释了为何“多浏览点”不会自然发生。
- 增加校准证据:对抗排序下 GPT-5 ECE 0.298→0.641。
- 质疑点:使用统一的 zero-shot 提示(非最佳工程化 agent 栈)与合成的纯文本环境。
2) Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents(面向 agent 的原子性:揭示、利用与缓解浏览器使用型 agent 的 TOCTOU 漏洞)
- 识别一种独特的 agent 失效模式:规划延迟制造 TOCTOU 窗口,UI/数据变化会重定向动作。
- 基准证据:全部 10 个开源 agent 在至少一种操纵类型上失败。
- 实用缓解:执行前验证在测试用例上将触发比率 100%→0%,开销 <0.05s;残余窗口攻击在压力测试中仍有 0.2–0.3% 成功。
- 质疑点:未实现严格原子性;残余窗口仍在。
3) Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning(Curation Leaks:针对机器学习数据策展的成员推断攻击)
- 打破常见假设:仅用于策展指导的私有目标数据仍可能通过分数/子集/模型泄露成员关系。
展示方法依赖:图像最近邻策展高度脆弱;TRAK 在分数/子集阶段接近随机,但在 T 较小时可通过指纹泄露。 量化 DP 缓解:在 CIFAR-10 T =1000 上,ε=100 将 TPR@1%FPR 0.9842→0.0155(图像策展)。 - 质疑点:大规模端到端训练有限;指纹验证受限。
4) Tracking Capabilities for Safer Agents(跟踪能力以实现更安全的 agent)
- 提供模型无关的安全机制:agent 生成 能力类型化的 Scala;编译器强制可产生的副作用范围。
- 实证:在 “classified mode” 下,131 次试验 100% 安全(无泄露),同时效用仍高(如 Claude Sonnet 4.6 在 120 个用户任务+注入试验上 99.2%)。
- 在 τ2-bench 上表现具竞争力,在 SWE-bench Lite 上小幅下降(某模型 41.7% vs 43.3%)。
- 质疑点:在所提供分析中,限制/未来工作未充分展开;更广泛部署约束未在此量化。
5) Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking(Jailbreak Foundry:从论文到可运行攻击的可复现基准)
- 展示规模化的论文→攻击模块自动合成:30 种攻击,复现减报告的平均 ASR 偏差 +0.26pp。
- 降低工程负担:对 22 个带参考代码的攻击,LOC 压缩 22,714→9,549(比例 0.42)。
- 支持纵向、可比评估(统一 harness/judge 下 30 攻击 × 10 受害模型)。
- 质疑点:明确的双重用途风险——降低将越狱操作化的门槛。
5) 实用下一步
- 为浏览器 agent 增加 执行前验证(DOM/布局监控),并测量残余窗口可利用性(瞄准 ~0.13s 间隙)。
- 对 Web/RAG agent,实现 抗排序检索策略(如强制佐证/多源核验),并在受控的 rank-0 蜜罐设置下评估。
- 将 工具生态视为供应链:对每项技能设置 token 预算、SKILL.md 大小上限,并审计持久化工具输出摄入以缓解 token-drain 模式。
- 若依赖基于 embedding 的安全分类器,部署 漂移检测 + 再训练触发器;不要在漂移下把置信度当作失败信号。
- 对 FTaaS/FIT,在 夹杂式用户级投毒下测试防御,并考虑频域协同过滤(共享质心)+ 计算成本核算。
- 对去学习,向评估套件加入 多轮鲁棒性测试(自我纠错 + 对话条件查询);静态 MCQ 遗忘分数可能误导。
- 对 agent 监督,在 低 FPR 下评估监控器(pAUROC 风格),并压力测试合成到落地的迁移;避免将提示过度优化到已观察到的上限之后。
- 对内部编码 agent,将 instruction files 编译为 可执行检查(流程/源码/架构),并跟踪违规率作为“上下文漂移”的领先指标。
由逐篇论文分析生成;无外部浏览。
