AI 论文日报(2026-02-27)

Published:

English version: /paper-news/2026-02-27/

运行统计

  • 候选论文: 262
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2602.22755AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
PDF
cs.CL96Benchmark of hidden misalignment behaviors + agentic auditing; strong for eval & oversight researchalignment auditing, benchmark, hidden behaviors, model evaluation, agent tools, deception
2602.23329LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
PDF
cs.AI, cs.CL, cs.CR, cs.CY, cs.HC96Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLM access.dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
2602.22724AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
PDF
cs.CR, cs.AI94Directly targets indirect prompt injection in agents with trajectory-aware detection/mitigationagent security, prompt injection, tool outputs, inference-time defense, causal diagnostics, context sanitization
2602.22525Systems-Level Attack Surface of Edge Agent Deployments on IoT
PDF
cs.CR94Empirical security analysis of edge LLM agents; defines measurable system security metrics + failures.agent-security, edge-agents, IoT, attack-surface, systems-security, provenance, MQTT
2602.22557CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
PDF
cs.AI, cs.LG92Zero-shot safety policy adaptation via RAG + adversarial debate grounded in policy docsLLM safety, policy compliance, RAG, multi-agent debate, governance, zero-shot
2602.22787Probing for Knowledge Attribution in Large Language Models
PDF
cs.CL, cs.AI92Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation.hallucinations, attribution, interpretability, faithfulness, factuality, probing
2602.22953General Agent Evaluation
PDF
cs.AI92Proposes unified protocol + framework for general-agent evaluation; addresses benchmark integration bias.agent-evaluation, benchmarks, protocols, framework, general-agents, measurement
2602.22603SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
PDF
cs.AI, cs.LG92LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.agents, long-context, KV-cache, memory, efficiency, reasoning
2602.22554Multilingual Safety Alignment Via Sparse Weight Editing
PDF
cs.LG90Training-free multilingual safety alignment via sparse weight editing; addresses cross-lingual guardrail gapsmultilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
2602.22576Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
PDF
cs.CL, cs.IR, cs.LG89Reward shaping for agentic RAG RL; extracts signal from failed trajectories to improve sample efficiency.agentic-RAG, reinforcement-learning, reward-shaping, retrieval, training, reasoning
2602.23271Evaluating Stochasticity in Deep Research Agents
PDF
cs.AI89Formalizes and measures stochasticity/variance in research agents; identifies sources via MDP framing.agents, evaluation, stochasticity, reliability, deep-research, variance
2602.22556Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
PDF
cs.LG, cs.AI, cs.CL89Adaptive thinking RL to curb overthinking while preserving hard-query reasoning; practical reliability gain.reasoning, RL, efficiency, overthinking, post-training, LRM
2602.22775TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
PDF
cs.HC, cs.AI, cs.CL88Multi-agent adversarial simulation to surface long-horizon relational safety failures in therapy chatbotsconversational safety, mental health, multi-turn evaluation, adversarial simulation, relational harms, red teaming
2602.22675Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
PDF
cs.CL87Agentic search framework emphasizing parallel evidence gathering to cut cost and improve generalization.agents, search, efficiency, long-horizon, generalization, deep-research
2602.22897OmniGAIA: Towards Native Omni-Modal AI Agents
PDF
cs.AI, cs.CL, cs.CV, cs.LG, cs.MM87Omni-modal agent benchmark (audio+video+image+tools) with multi-hop queries; useful for capability eval.multimodal, agents, benchmark, tool-use, evaluation, audio, video
2602.22769AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
PDF
cs.AI, cs.LG86Agent memory benchmark focused on real agent-environment trajectories, not just dialogueagent evaluation, long-horizon memory, benchmark, trajectories, agentic systems
2602.22719Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
PDF
cs.LG86Mechanistic interpretability + test-time steering for Mamba SSMs with sizable benchmark gains.interpretability, steering, state-space-models, Mamba, mechanistic, control
2602.22968Certified Circuits: Stability Guarantees for Mechanistic Circuits
PDF
cs.AI, cs.CV, cs.CY85Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliabilitymechanistic interpretability, circuits, certification, robustness, theory, OOD stability
2602.23200InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
PDF
cs.LG, cs.CL85Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.LLM-efficiency, KV-cache, quantization, long-context, inference, systems
2602.23193ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering
PDF
cs.AI84Event-sourcing architecture for LLM agents: separates intent from state mutation for reliability/auditing.agents, software-engineering, state, orchestration, auditability, reliability
2602.23136Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
PDF
cs.CL, cs.AI, cs.LG84Information-theoretic account of modality collapse as mismatched decoding; actionable framing for MLLMs.multimodal-LLMs, information-theory, decoding, modality-collapse, analysis
2602.22871Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
PDF
cs.CL, cs.AI84Step-level PRM scoring and stitching for diffusion LMs; improves test-time scaling beyond voting.test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
2602.22584Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
PDF
cs.CL82Industrial RAG reliability: jointly optimizes retrieval+generation with evidence-constrained RLRAG, hallucination reduction, faithfulness, reinforcement learning, retrieval optimization, enterprise QA
2602.22585Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
PDF
cs.AI, cs.LG82Uses IRT/Rasch to correct rater effects in human evals; improves validity of AI comparisons and RLHF data.evaluation, human-ratings, psychometrics, RLHF, measurement, bias
2602.22642Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
PDF
cs.LG82Difficulty-aware entropy regularization to compress CoT while preserving exploration on hard problems.reasoning, CoT, efficiency, entropy-regularization, RL, inference-cost
2602.23262Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
PDF
cs.CV, cs.CR81DP image generation via wavelet coarse-to-fine; targets privacy-sensitive frequencies to reduce quality loss.privacy, differential-privacy, image-generation, wavelets, memorization
2602.22758Decomposing Physician Disagreement in HealthBench
PDF
cs.AI, stat.AP81Finds most HealthBench label variance is irreducible case-level residual; important for eval design.evaluation, medical-AI, rater-disagreement, uncertainty, benchmarks, reliability
2602.23258AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
PDF
cs.AI, cs.CL80Test-time rectify-or-reject pruning to prevent error cascades in multi-agent systemsmulti-agent systems, test-time control, error correction, RAG, robustness, information flow
2602.23079Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
PDF
cs.CL, cs.CR, cs.LG80Stylometry+LLM agent to assess/mitigate deanonymization risk; relevant to privacy leakage from text.privacy, deanonymization, stylometry, LLM-agents, security, text
2602.22546Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention
PDF
cs.AI79Learned policy to query human experts as a tool; large gains in Minecraft on hard tasks (human-in-loop).human-in-the-loop, agents, tool-use, collaboration, planning, Minecraft

AI 论文洞察简报

2026-02-27

0) 执行要点(先读这个)

  • 智能体安全正在“向下沉到技术栈底层”:多篇论文显示,部署架构(边缘 IoT 群体、工具返回边界、KV/记忆管理)往往主导风险/鲁棒性结果,并且经常绕过提示词/模型层面的防御。
  • 推理时、免训练的干预正在成熟,覆盖安全与效率:用于间接提示注入的因果反事实防御(报告 ASR 为 0%)、用于零样本策略切换的基于策略文本的辩论裁决、以及用于多语种安全迁移的稀疏权重编辑。
  • GRPO 正在成为能力与安全/忠实性调优的默认骨干(自适应思考、智能体 RAG 奖励塑形、工业 RAG 忠实性、人机协作模块),新工作聚焦于在长度/路径异质性下稳定梯度与奖励。
  • 长时程智能体正遭遇系统瓶颈(KV cache 增长、记忆检索失败、跨运行随机性)。新基准与机制(AMA-Bench、随机性方差指标、SideQuest)让这些失效模式可测量、可优化。
  • 评测方法学正在被积极修复:评分者效应建模(MFRM/IRT)与医生分歧分解表明,原始人工标签可能重排系统排名,且大量分歧是案例特异的——意味着“更好的裁判”可能需要更好的任务设计,而不只是更好的模型。
  • 生物安全能力提升证据已进入人类受试、多模型、长时程阶段:报告称,能使用 LLM 的新手比仅用互联网的新手准确率高 4.16×,且多数人表示绕过防护几乎不费力——提高了对真实 uplift 评估的优先级。

2) 关键主题(聚类)

主题:面向智能体的推理时安全层(策略、提示注入、边缘系统)

主题:用于智能体 RAG、忠实性与协作的 RL(常为 GRPO)

主题:不牺牲准确率的推理效率(自适应思考、熵/长度控制)

主题:长时程智能体基础设施:记忆、KV cache、随机性与评测

主题:评测可靠性、审计与隐藏行为

主题:智能体时代的隐私与双重用途风险

3) 技术综合

  • GRPO 作为统一优化原语贯穿:自适应思考(CPAS/LAGR)、智能体 RAG(Search-P1)、工业忠实性 RL(广告 QA)、人机协作工具使用(AHCE HFM)。
  • 反复出现的稳定化模式:当轨迹长度/结构差异巨大时,方法会加入显式归一化/加权(LAGR 长度权重;CPAS 优势偏移;路径中心奖励;难度感知熵)。
  • “边界中心”的智能体安全正在收敛:AgentSentry 在工具返回边界防御;IoT 边缘论文强调 MQTT 作为命令边界;CourtGuard 将裁决锚定在检索到的策略文本而非参数化“直觉”。
  • 检索被视为策略学习问题,而非固定模块:Search-P1 围绕计划执行与参考步骤覆盖塑形奖励;工业 GraphRAG 用 RL 协同适配检索与生成。
  • 长时程可靠性正被用新指标操作化:随机性用对答案/发现/引用的归一化总方差;记忆用召回/因果/状态更新/抽象类别;系统安全用执行到审计延迟与故障切换黑窗。
  • 模型驱动的系统优化正在超越“更好的提示词”:SideQuest 用模型做 KV cache 垃圾回收;InnerQ 将量化分组与解码期向量-矩阵访问模式对齐。
  • 评测正走向“测量模型”:IRT/MFRM 校正评分者严苛度/中心性;HealthBench 分歧分解显示残余占主导;AuditBench 衡量端到端调查者成功而非仅工具信号。
  • 安全迁移越来越偏参数级或推理时:多语种安全的稀疏权重编辑;CourtGuard 策略切换;AgentSentry 推理时反事实净化——降低对大规模新数据集的依赖。
  • 基准更贴近真实智能体:AMA-Bench 使用动作–观测日志与符号工件;OmniGAIA 需要全模态工具使用;General Agent Evaluation 聚焦跨环境、保持协议的一致比较。

4) Top 5 论文(含“为什么是现在”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification(AgentSentry:通过时间因果诊断与上下文净化缓解 LLM 智能体的间接提示注入)

  • 引入以边界为锚的反事实重执行(orig/mask/mask_sanitized/orig_sanitized),估计因果接管(ACE/IE/DE)。
  • 在 AgentDojo 上对三类 IPI 家族与多个黑盒模型报告 ASR = 0%,表格中报告 FPR = 0%。
  • “为什么是现在”:工具增强智能体正在出货;这是一个具体的推理时层,目标是安全继续而非终止。
  • 质疑点:开销随每个边界的重执行次数扩展;评估指出基准可能低估长时程、延迟接管。

2) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors(AuditBench:在具有隐藏行为的模型上评估对齐审计技术)

  • 提供 56 个目标模型 / 14 种隐藏行为,并带反坦白训练,使系统化审计评估成为可能。
  • 发现带脚手架的黑盒工具整体优于许多白盒工具;记录了工具到智能体差距
  • “为什么是现在”:审计正成为部署门槛;该工作提供可重复的目标与端到端智能体评估。
  • 质疑点:目标是基于单一底座模型(Llama 3.3 70B)的微调;可能比自然涌现行为更易审计。

3) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks(LLM 对双重用途的计算生物任务中新手能力提升)

  • 人类受试证据:与仅互联网相比,LLM 访问使新手准确率 4.16×(odds ratio)提升;Treatment 在 8 个基准中提升了 7 个。
  • Treatment 有时超过专家基线(如 HPCT、VCT),且参与者常报告防护摩擦很小(89.6%)。
  • “为什么是现在”:政策讨论需要在真实多模型、长时程使用下的 uplift 数据,而不只是模型内基准。
  • 质疑点:限于 in-silico 任务;研究中途模型可用性变化;非双盲。

4) SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning(SideQuest:面向长时程智能体推理的模型驱动 KV cache 管理)

  • 使用并行辅助线程判断哪些工具输出已过时,并删除其 KV 条目而不污染主上下文。
  • 报告显著效率收益(峰值 token 利用率 −56–65%,KV 读取 −53–71%),并在 H100 上对 FRAMES 报告吞吐 +83.9%。
  • “为什么是现在”:深度研究/网页智能体受 KV 限制;这是实用的服务侧杠杆。
  • 质疑点:驱逐仅限工具输出(不含“思考”);存在一定 OOD 准确率退化(BrowseComp)。

5) Multilingual Safety Alignment Via Sparse Weight Editing(通过稀疏权重编辑实现多语种安全对齐)

  • 免训练的稀疏神经元编辑,用闭式低秩更新将英文安全行为迁移到其他语言。
  • 引入 MULTI-STRONGREJECT(8 种语言,每种 313 条有害提示),并展示跨模型 unsafe-count 降低;可与 MPO 组合。
  • “为什么是现在”:多语种越狱差距是现实部署漏洞;权重编辑迭代与部署快。
  • 质疑点:评估依赖自动 guard 模型;数据集为机器翻译(可能遗漏自然的低资源语言越狱)。

5) 实用下一步

  • 为智能体加入边界级监测:记录工具返回边界与来源元数据,并对高风险工具/动作周期性运行“影子”反事实检查(AgentSentry 风格)。
  • 将消息中间件纳入边缘/IoT 的安全边界:强化 MQTT 认证/ACL 与重放防护;将执行到审计延迟与故障切换黑窗作为一等安全 SLO 进行度量。
  • 若做智能体 RAG 的 RL:尝试路径中心奖励(自一致性 + 参考步骤覆盖)与软结果评分;通过替换裁判模型显式测试评估器敏感性。
  • 在不破坏正确性的前提下降低长时程成本:实现自适应思考控制 token,并用长度感知梯度调节稳定 RL;另行测试难度感知熵正则以防熵坍塌。
  • 让研究型智能体的可靠性可测量:计算跨运行在答案/发现/引用上的方差;再用结构化输出 + 早期查询交集集成降低随机性,同时跟踪准确率。
  • 面向多语种部署:做多语种有害提示扫描,并考虑稀疏权重编辑作为快速补丁——同时用多个危害分类器验证(不只一个 guard)。
  • 升级人工评测流水线:建模评分者严苛度/中心性(MFRM)并跟踪分歧分解;在分歧高处优先收集“可减少不确定性”标签或缺失上下文标注。
  • 面向审计项目:用调查者智能体端到端评估工具(AuditBench 风格),而非只看工具信号;显式测试最难目标配置(如 TD+KTO),避免对易审计对象过拟合。

由逐篇分析生成;未进行外部浏览。