AI 论文日报（2026-02-27）

Published: February 27, 2026

English version: /paper-news/2026-02-27/

运行统计

候选论文: 262
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2602.22755`	AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors PDF	cs.CL	96	Benchmark of hidden misalignment behaviors + agentic auditing; strong for eval & oversight research	alignment auditing, benchmark, hidden behaviors, model evaluation, agent tools, deception
`2602.23329`	LLM Novice Uplift on Dual-Use, In Silico Biology Tasks PDF	cs.AI, cs.CL, cs.CR, cs.CY, cs.HC	96	Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLM access.	dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
`2602.22724`	AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification PDF	cs.CR, cs.AI	94	Directly targets indirect prompt injection in agents with trajectory-aware detection/mitigation	agent security, prompt injection, tool outputs, inference-time defense, causal diagnostics, context sanitization
`2602.22525`	Systems-Level Attack Surface of Edge Agent Deployments on IoT PDF	cs.CR	94	Empirical security analysis of edge LLM agents; defines measurable system security metrics + failures.	agent-security, edge-agents, IoT, attack-surface, systems-security, provenance, MQTT
`2602.22557`	CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety PDF	cs.AI, cs.LG	92	Zero-shot safety policy adaptation via RAG + adversarial debate grounded in policy docs	LLM safety, policy compliance, RAG, multi-agent debate, governance, zero-shot
`2602.22787`	Probing for Knowledge Attribution in Large Language Models PDF	cs.CL, cs.AI	92	Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation.	hallucinations, attribution, interpretability, faithfulness, factuality, probing
`2602.22953`	General Agent Evaluation PDF	cs.AI	92	Proposes unified protocol + framework for general-agent evaluation; addresses benchmark integration bias.	agent-evaluation, benchmarks, protocols, framework, general-agents, measurement
`2602.22603`	SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning PDF	cs.AI, cs.LG	92	LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.	agents, long-context, KV-cache, memory, efficiency, reasoning
`2602.22554`	Multilingual Safety Alignment Via Sparse Weight Editing PDF	cs.LG	90	Training-free multilingual safety alignment via sparse weight editing; addresses cross-lingual guardrail gaps	multilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
`2602.22576`	Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training PDF	cs.CL, cs.IR, cs.LG	89	Reward shaping for agentic RAG RL; extracts signal from failed trajectories to improve sample efficiency.	agentic-RAG, reinforcement-learning, reward-shaping, retrieval, training, reasoning
`2602.23271`	Evaluating Stochasticity in Deep Research Agents PDF	cs.AI	89	Formalizes and measures stochasticity/variance in research agents; identifies sources via MDP framing.	agents, evaluation, stochasticity, reliability, deep-research, variance
`2602.22556`	Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation PDF	cs.LG, cs.AI, cs.CL	89	Adaptive thinking RL to curb overthinking while preserving hard-query reasoning; practical reliability gain.	reasoning, RL, efficiency, overthinking, post-training, LRM
`2602.22775`	TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation PDF	cs.HC, cs.AI, cs.CL	88	Multi-agent adversarial simulation to surface long-horizon relational safety failures in therapy chatbots	conversational safety, mental health, multi-turn evaluation, adversarial simulation, relational harms, red teaming
`2602.22675`	Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization PDF	cs.CL	87	Agentic search framework emphasizing parallel evidence gathering to cut cost and improve generalization.	agents, search, efficiency, long-horizon, generalization, deep-research
`2602.22897`	OmniGAIA: Towards Native Omni-Modal AI Agents PDF	cs.AI, cs.CL, cs.CV, cs.LG, cs.MM	87	Omni-modal agent benchmark (audio+video+image+tools) with multi-hop queries; useful for capability eval.	multimodal, agents, benchmark, tool-use, evaluation, audio, video
`2602.22769`	AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications PDF	cs.AI, cs.LG	86	Agent memory benchmark focused on real agent-environment trajectories, not just dialogue	agent evaluation, long-horizon memory, benchmark, trajectories, agentic systems
`2602.22719`	Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks PDF	cs.LG	86	Mechanistic interpretability + test-time steering for Mamba SSMs with sizable benchmark gains.	interpretability, steering, state-space-models, Mamba, mechanistic, control
`2602.22968`	Certified Circuits: Stability Guarantees for Mechanistic Circuits PDF	cs.AI, cs.CV, cs.CY	85	Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliability	mechanistic interpretability, circuits, certification, robustness, theory, OOD stability
`2602.23200`	InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models PDF	cs.LG, cs.CL	85	Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.	LLM-efficiency, KV-cache, quantization, long-context, inference, systems
`2602.23193`	ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering PDF	cs.AI	84	Event-sourcing architecture for LLM agents: separates intent from state mutation for reliability/auditing.	agents, software-engineering, state, orchestration, auditability, reliability
`2602.23136`	Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs PDF	cs.CL, cs.AI, cs.LG	84	Information-theoretic account of modality collapse as mismatched decoding; actionable framing for MLLMs.	multimodal-LLMs, information-theory, decoding, modality-collapse, analysis
`2602.22871`	Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching PDF	cs.CL, cs.AI	84	Step-level PRM scoring and stitching for diffusion LMs; improves test-time scaling beyond voting.	test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
`2602.22584`	Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA PDF	cs.CL	82	Industrial RAG reliability: jointly optimizes retrieval+generation with evidence-constrained RL	RAG, hallucination reduction, faithfulness, reinforcement learning, retrieval optimization, enterprise QA
`2602.22585`	Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach PDF	cs.AI, cs.LG	82	Uses IRT/Rasch to correct rater effects in human evals; improves validity of AI comparisons and RLHF data.	evaluation, human-ratings, psychometrics, RLHF, measurement, bias
`2602.22642`	Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning PDF	cs.LG	82	Difficulty-aware entropy regularization to compress CoT while preserving exploration on hard problems.	reasoning, CoT, efficiency, entropy-regularization, RL, inference-cost
`2602.23262`	Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling PDF	cs.CV, cs.CR	81	DP image generation via wavelet coarse-to-fine; targets privacy-sensitive frequencies to reduce quality loss.	privacy, differential-privacy, image-generation, wavelets, memorization
`2602.22758`	Decomposing Physician Disagreement in HealthBench PDF	cs.AI, stat.AP	81	Finds most HealthBench label variance is irreducible case-level residual; important for eval design.	evaluation, medical-AI, rater-disagreement, uncertainty, benchmarks, reliability
`2602.23258`	AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning PDF	cs.AI, cs.CL	80	Test-time rectify-or-reject pruning to prevent error cascades in multi-agent systems	multi-agent systems, test-time control, error correction, RAG, robustness, information flow
`2602.23079`	Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent PDF	cs.CL, cs.CR, cs.LG	80	Stylometry+LLM agent to assess/mitigate deanonymization risk; relevant to privacy leakage from text.	privacy, deanonymization, stylometry, LLM-agents, security, text
`2602.22546`	Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention PDF	cs.AI	79	Learned policy to query human experts as a tool; large gains in Minecraft on hard tasks (human-in-loop).	human-in-the-loop, agents, tool-use, collaboration, planning, Minecraft

AI 论文洞察简报

2026-02-27

0) 执行要点（先读这个）

智能体安全正在“向下沉到技术栈底层”：多篇论文显示，部署架构（边缘 IoT 群体、工具返回边界、KV/记忆管理）往往主导风险/鲁棒性结果，并且经常绕过提示词/模型层面的防御。
推理时、免训练的干预正在成熟，覆盖安全与效率：用于间接提示注入的因果反事实防御（报告 ASR 为 0%）、用于零样本策略切换的基于策略文本的辩论裁决、以及用于多语种安全迁移的稀疏权重编辑。
GRPO 正在成为能力与安全/忠实性调优的默认骨干（自适应思考、智能体 RAG 奖励塑形、工业 RAG 忠实性、人机协作模块），新工作聚焦于在长度/路径异质性下稳定梯度与奖励。
长时程智能体正遭遇系统瓶颈（KV cache 增长、记忆检索失败、跨运行随机性）。新基准与机制（AMA-Bench、随机性方差指标、SideQuest）让这些失效模式可测量、可优化。
评测方法学正在被积极修复：评分者效应建模（MFRM/IRT）与医生分歧分解表明，原始人工标签可能重排系统排名，且大量分歧是案例特异的——意味着“更好的裁判”可能需要更好的任务设计，而不只是更好的模型。
生物安全能力提升证据已进入人类受试、多模型、长时程阶段：报告称，能使用 LLM 的新手比仅用互联网的新手准确率高 4.16×，且多数人表示绕过防护几乎不费力——提高了对真实 uplift 评估的优先级。

2) 关键主题（聚类）

主题：面向智能体的推理时安全层（策略、提示注入、边缘系统）

重要性：当智能体通过工具与物理系统行动时，关键失效往往发生在边界处（工具返回、消息总线、回退路径），经典提示防御在这些地方要么不适用，要么不可观测。
代表论文：
共同方法：
- 将防御前移到决策边界（工具返回检查点；基于策略的裁决；MQTT 控制平面）。
- 使用结构化协议（检索支撑的辩论裁决；因果反事实机制；来源/溯源元数据封装）。
- 强调可度量的运行指标（ASR/UA/FPR；延迟；执行到审计延迟；出站/主权；故障切换窗口）。
开放问题 / 失效模式：
- 开销/延迟：反事实重执行与多智能体辩论会增加推理成本。
- 骨干脆弱性：格式遵循问题可能导致解析失败（CourtGuard）；边缘异构性使执行更难（IoT）。
- 信任边界缺口：MQTT broker 接受伪造/重放/直接发布；静默回退跨越主权边界。

主题：用于智能体 RAG、忠实性与协作的 RL（常为 GRPO）

重要性：智能体系统需要比最终答案正确性更稠密的学习信号；工业部署还需要可操作、可测试的忠实性约束（例如 URL 幻觉）。
代表论文：
共同方法：
- 用过程/路径奖励替代稀疏的结果奖励（双轨步骤覆盖；软结果评分）。
- 使用GRPO 风格 RL + 结构化格式（先规划的轨迹；带标签的交互协议）。
- 加入领域特定的忠实性约束（证据忠实性；URL 有效性检查与惩罚）。
开放问题 / 失效模式：
- 依赖 LLM 评估器/裁判打分（奖励黑客 / 评估器敏感性）。
- 离线工件：参考规划器与指示器类资源增加流水线复杂度。
- 泛化：训练常锚定在特定领域（Minecraft、广告 QA、QA 基准）。

主题：不牺牲准确率的推理效率（自适应思考、熵/长度控制）

重要性：长 CoT 成本高；朴素的长度惩罚可能导致探索坍塌，或因长度异质性极端而使 RL 不稳定。
代表论文：
共同方法：
- 按实例自适应控制（think/no-think token；难/易熵缩放；按题目最短正确长度基线）。
- 在长度异质性下用优势塑形 + 梯度重加权稳定 RL。
- 将测试时扩展从“单条长轨迹”转向步骤级复用（PRM 评分拼接 + AR 重计算）。
开放问题 / 失效模式：
- 部分工作尚未证明可扩展到大模型（自适应思考在 1.5B/7B 上评估）。
- 依赖 PRM 与采样轨迹多样性；共享错误会限制拼接恢复。
- 难度估计代理（历史准确率 EMA）跨领域可能脆弱。

主题：长时程智能体基础设施：记忆、KV cache、随机性与评测

重要性：智能体运行更久后，失败会变成系统失败：记忆压缩丢失因果状态、KV cache 成为服务瓶颈、随机性削弱可靠性（即便 API 设置 temperature=0）。
代表论文：
共同方法：
- 让隐藏瓶颈可测量（峰值 token 利用率、KV 读取、对结论/引用的总方差、needle 协议）。
- 使用模型驱动或硬件对齐机制（辅助线程语义驱逐；KV 内维分组）。
- 加入结构化缓解（结构化输出；查询交集集成；基于因果图的工具增强检索）。
开放问题 / 失效模式：
- OOD 退化（SideQuest 在 BrowseComp 上最高约 5% 准确率下降）。
- 记忆构建/检索损失主导端到端性能（AMA-Bench needle 消融）。
- 微基准 vs 端到端延迟（InnerQ 报告 matmul 加速；更广泛的服务影响未充分展示）。

主题：评测可靠性、审计与隐藏行为

重要性：安全与能力主张依赖测量；评分者效应与分歧上限可能颠倒排名，而审计工具必须在模型主动抗拒披露时也能被检验。
代表论文：
共同方法：
- 将评测视为测量问题（MFRM 严苛度/阈值；方差分解；智能体审计成功率）。
- 用对抗性目标进行压力测试（植入隐藏行为 + 反坦白训练）。
- 报告诊断信息而非仅分数（评分者中心性；工具到智能体差距；残余分歧占主导）。
开放问题 / 失效模式：
- 工具到智能体差距：工具揭示的证据未必能转化为调查者成功。
- 评分者模型的识别/可估计性约束（MFRM 尝试中 policy facet 不可估）。
- 大量残余分歧表明：若不改进 rubric/上下文，仅提升“裁判模型”可能有限。

主题：智能体时代的隐私与双重用途风险

重要性：智能体 + 工具/记忆会放大隐私伤害（去匿名化）与双重用途能力提升；防御必须在真实、长时程的人类使用条件下评估。
代表论文：
共同方法：
- 端到端流水线：搜索 + 聚合 + 反思（文体计量智能体；多模型访问的 uplift 研究）。
- 形式化隐私：DP + 后处理（仅对粗小波 token 做 DP；细节用公共先验）。
- 不仅测准确率，还测操作性风险信号（候选覆盖；引导式重组缓解；受试者报告的防护摩擦）。
开放问题 / 失效模式：
- 开放世界去匿名化即便有 DB 增强仍偏低（top-3 仍有限），但在定向设置下提升显著。
- 严格 ε 下 DP 质量差距仍在（如 ε=1 伪影；对公共先验强度敏感）。
- 将 in-silico uplift 映射到湿实验风险仍未解决。

3) 技术综合

GRPO 作为统一优化原语贯穿：自适应思考（CPAS/LAGR）、智能体 RAG（Search-P1）、工业忠实性 RL（广告 QA）、人机协作工具使用（AHCE HFM）。
反复出现的稳定化模式：当轨迹长度/结构差异巨大时，方法会加入显式归一化/加权（LAGR 长度权重；CPAS 优势偏移；路径中心奖励；难度感知熵）。
“边界中心”的智能体安全正在收敛：AgentSentry 在工具返回边界防御；IoT 边缘论文强调 MQTT 作为命令边界；CourtGuard 将裁决锚定在检索到的策略文本而非参数化“直觉”。
检索被视为策略学习问题，而非固定模块：Search-P1 围绕计划执行与参考步骤覆盖塑形奖励；工业 GraphRAG 用 RL 协同适配检索与生成。
长时程可靠性正被用新指标操作化：随机性用对答案/发现/引用的归一化总方差；记忆用召回/因果/状态更新/抽象类别；系统安全用执行到审计延迟与故障切换黑窗。
模型驱动的系统优化正在超越“更好的提示词”：SideQuest 用模型做 KV cache 垃圾回收；InnerQ 将量化分组与解码期向量-矩阵访问模式对齐。
评测正走向“测量模型”：IRT/MFRM 校正评分者严苛度/中心性；HealthBench 分歧分解显示残余占主导；AuditBench 衡量端到端调查者成功而非仅工具信号。
安全迁移越来越偏参数级或推理时：多语种安全的稀疏权重编辑；CourtGuard 策略切换；AgentSentry 推理时反事实净化——降低对大规模新数据集的依赖。
基准更贴近真实智能体：AMA-Bench 使用动作–观测日志与符号工件；OmniGAIA 需要全模态工具使用；General Agent Evaluation 聚焦跨环境、保持协议的一致比较。

4) Top 5 论文（含“为什么是现在”）

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification（AgentSentry：通过时间因果诊断与上下文净化缓解 LLM 智能体的间接提示注入）

引入以边界为锚的反事实重执行（orig/mask/mask_sanitized/orig_sanitized），估计因果接管（ACE/IE/DE）。
在 AgentDojo 上对三类 IPI 家族与多个黑盒模型报告 ASR = 0%，表格中报告 FPR = 0%。
“为什么是现在”：工具增强智能体正在出货；这是一个具体的推理时层，目标是安全继续而非终止。
质疑点：开销随每个边界的重执行次数扩展；评估指出基准可能低估长时程、延迟接管。

2) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors（AuditBench：在具有隐藏行为的模型上评估对齐审计技术）

提供 56 个目标模型 / 14 种隐藏行为，并带反坦白训练，使系统化审计评估成为可能。
发现带脚手架的黑盒工具整体优于许多白盒工具；记录了工具到智能体差距。
“为什么是现在”：审计正成为部署门槛；该工作提供可重复的目标与端到端智能体评估。
质疑点：目标是基于单一底座模型（Llama 3.3 70B）的微调；可能比自然涌现行为更易审计。

3) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks（LLM 对双重用途的计算生物任务中新手能力提升）

人类受试证据：与仅互联网相比，LLM 访问使新手准确率 4.16×（odds ratio）提升；Treatment 在 8 个基准中提升了 7 个。
Treatment 有时超过专家基线（如 HPCT、VCT），且参与者常报告防护摩擦很小（89.6%）。
“为什么是现在”：政策讨论需要在真实多模型、长时程使用下的 uplift 数据，而不只是模型内基准。
质疑点：限于 in-silico 任务；研究中途模型可用性变化；非双盲。

4) SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning（SideQuest：面向长时程智能体推理的模型驱动 KV cache 管理）

使用并行辅助线程判断哪些工具输出已过时，并删除其 KV 条目而不污染主上下文。
报告显著效率收益（峰值 token 利用率 −56–65%，KV 读取 −53–71%），并在 H100 上对 FRAMES 报告吞吐 +83.9%。
“为什么是现在”：深度研究/网页智能体受 KV 限制；这是实用的服务侧杠杆。
质疑点：驱逐仅限工具输出（不含“思考”）；存在一定 OOD 准确率退化（BrowseComp）。

5) Multilingual Safety Alignment Via Sparse Weight Editing（通过稀疏权重编辑实现多语种安全对齐）

免训练的稀疏神经元编辑，用闭式低秩更新将英文安全行为迁移到其他语言。
引入 MULTI-STRONGREJECT（8 种语言，每种 313 条有害提示），并展示跨模型 unsafe-count 降低；可与 MPO 组合。
“为什么是现在”：多语种越狱差距是现实部署漏洞；权重编辑迭代与部署快。
质疑点：评估依赖自动 guard 模型；数据集为机器翻译（可能遗漏自然的低资源语言越狱）。

5) 实用下一步

为智能体加入边界级监测：记录工具返回边界与来源元数据，并对高风险工具/动作周期性运行“影子”反事实检查（AgentSentry 风格）。
将消息中间件纳入边缘/IoT 的安全边界：强化 MQTT 认证/ACL 与重放防护；将执行到审计延迟与故障切换黑窗作为一等安全 SLO 进行度量。
若做智能体 RAG 的 RL：尝试路径中心奖励（自一致性 + 参考步骤覆盖）与软结果评分；通过替换裁判模型显式测试评估器敏感性。
在不破坏正确性的前提下降低长时程成本：实现自适应思考控制 token，并用长度感知梯度调节稳定 RL；另行测试难度感知熵正则以防熵坍塌。
让研究型智能体的可靠性可测量：计算跨运行在答案/发现/引用上的方差；再用结构化输出 + 早期查询交集集成降低随机性，同时跟踪准确率。
面向多语种部署：做多语种有害提示扫描，并考虑稀疏权重编辑作为快速补丁——同时用多个危害分类器验证（不只一个 guard）。
升级人工评测流水线：建模评分者严苛度/中心性（MFRM）并跟踪分歧分解；在分歧高处优先收集“可减少不确定性”标签或缺失上下文标注。
面向审计项目：用调查者智能体端到端评估工具（AuditBench 风格），而非只看工具信号；显式测试最难目标配置（如 TD+KTO），避免对易审计对象过拟合。

由逐篇分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-02-27

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：面向智能体的推理时安全层（策略、提示注入、边缘系统）

主题：用于智能体 RAG、忠实性与协作的 RL（常为 GRPO）

主题：不牺牲准确率的推理效率（自适应思考、熵/长度控制）

主题：长时程智能体基础设施：记忆、KV cache、随机性与评测

主题：评测可靠性、审计与隐藏行为

主题：智能体时代的隐私与双重用途风险

3) 技术综合

4) Top 5 论文（含“为什么是现在”）

5) 实用下一步