AI 论文日报(2026-02-28)

Published:

English version: /paper-news/2026-02-28/

运行统计

  • 候选论文: 262
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2602.22755AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
PDF
cs.CL96Audit benchmark w/ 56 models hiding 14 bad traits; evaluates auditing tools + autonomous investigator agent.alignment auditing, hidden behaviors, benchmarks, red-teaming, agent evaluation, model honesty
2602.23329LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
PDF
cs.AI, cs.CL, cs.CR, cs.CY, cs.HC96Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLMsdual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
2602.22724AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
PDF
cs.CR, cs.AI94Targets indirect prompt injection in tool/RAG agents with multi-turn causal diagnostics + context purification.agent security, prompt injection, tool use, RAG safety, inference-time defense, trajectory attacks
2602.22525Systems-Level Attack Surface of Edge Agent Deployments on IoT
PDF
cs.CR94Empirical security analysis of edge LLM agents on IoT; identifies concrete attack surfaces + metrics.agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance, MQTT
2602.22557CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
PDF
cs.AI, cs.LG92Model-agnostic zero-shot safety policy adaptation via retrieval-grounded multi-agent evidentiary debate.policy compliance, RAG, multi-agent debate, governance, safety evaluation, zero-shot
2602.22787Probing for Knowledge Attribution in Large Language Models
PDF
cs.CL, cs.AI92Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigationhallucinations, attribution, faithfulness, factuality, probes, evaluation
2602.22953General Agent Evaluation
PDF
cs.AI92Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration gaps.agent-evaluation, benchmarks, evaluation-protocol, agentic-systems, framework
2602.22603SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
PDF
cs.AI, cs.LG92LRM-driven KV-cache compression for long-horizon agents; targets real bottleneck in agentic RAG.agents, long-context, kv-cache, efficiency, reasoning, memory-management, RAG
2602.22554Multilingual Safety Alignment Via Sparse Weight Editing
PDF
cs.LG90Training-free sparse weight editing to reduce multilingual safety gaps; claims closed-form cross-lingual mapping.multilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
2602.23271Evaluating Stochasticity in Deep Research Agents
PDF
cs.AI90Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP view.agents, evaluation, reliability, stochasticity, research-agents, variance
2602.22675Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
PDF
cs.CL89Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost+generalizationagents, search, efficiency, long-horizon, generalization, deep-research
2602.22556Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
PDF
cs.LG, cs.AI, cs.CL89RL method to curb overthinking while preserving hard-query reasoning; practical accuracy/latency tradeoff.reasoning, test-time-compute, RL, efficiency, adaptive-computation, alignment-adjacent
2602.22775TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
PDF
cs.HC, cs.AI, cs.CL88Adversarial multi-agent simulation to surface multi-turn relational safety failures in mental health chatbots.relational safety, mental health, multi-agent simulation, evaluation, conversation dynamics, harm modes
2602.22576Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
PDF
cs.CL, cs.IR, cs.LG88Reward shaping for RL-trained agentic RAG; extracts signal from failures to improve sample efficiencyRAG, agents, reinforcement-learning, reward-shaping, training, retrieval
2602.22897OmniGAIA: Towards Native Omni-Modal AI Agents
PDF
cs.AI, cs.CL, cs.CV, cs.LG, cs.MM88Omni-modal agent benchmark (video+audio+image) with tool use and multi-hop reasoning; likely reusable.multimodal, agents, benchmark, tool-use, evaluation, video, audio
2602.23136Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
PDF
cs.CL, cs.AI, cs.LG87Info-theoretic account of modality collapse as mismatched decoding; actionable framing for multimodal LLMs.multimodal-llms, information-theory, decoding, representation, modality-collapse, theory
2602.22871Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
PDF
cs.CL, cs.AI87Step-level PRM scoring + stitching across diffusion CoTs; strong test-time scaling idea for reasoning.reasoning, process-reward-model, test-time-scaling, diffusion-LM, self-consistency
2602.22968Certified Circuits: Stability Guarantees for Mechanistic Circuits
PDF
cs.AI, cs.CV, cs.CY86Provable stability guarantees for mechanistic circuit discovery via randomized subsampling certification.mechanistic interpretability, circuits, robustness, certification, auditing, OOD stability
2602.22638MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
PDF
cs.AI86Real-world route-planning agent benchmark with deterministic API-replay sandbox for reproducibilityagents, benchmark, evaluation, tool-use, reproducibility, planning
2602.22769AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
PDF
cs.AI, cs.LG85AMA-Bench evaluates long-horizon agent memory on real agent-environment trajectories (not just dialogue).agent memory, benchmarks, long-horizon, evaluation, trajectories, agentic applications
2602.22719Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
PDF
cs.LG85Mechanistic interpretability for Mamba SSMs + simple activation steering yields broad gains.interpretability, steering, state-space-models, Mamba, mechanistic-interpretability, reliability
2602.23193ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering
PDF
cs.AI84Event-sourcing architecture for LLM agents: structured intentions + deterministic state/loggingagents, software-engineering, orchestration, state, reliability, audit-logs
2602.23200InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
PDF
cs.LG, cs.CL84Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.efficiency, KV-cache, quantization, long-context, inference, systems
2602.22758Decomposing Physician Disagreement in HealthBench
PDF
cs.AI, stat.AP83Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.evaluation, medical-AI, uncertainty, human-judgment, benchmarking, reliability
2602.22689No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
PDF
cs.CV, cs.CR82Caption-free membership inference for diffusion models using model-fitted synthetic conditioning inputs.privacy, membership inference, diffusion models, data memorization, auditing, security
2602.22585Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
PDF
cs.AI, cs.LG82Uses IRT/Rasch to correct rater effects in human evals; improves reliability of AI conclusionsevaluation, human-ratings, psychometrics, IRT, RLHF, measurement
2602.22642Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
PDF
cs.LG82Difficulty-aware entropy regularization to compress CoT while avoiding entropy collapse on hard problems.reasoning, CoT, efficiency, entropy-regularization, inference-cost, RL
2602.23262Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
PDF
cs.CV, cs.CR81DP image generation via coarse-to-fine wavelet modeling to reduce quality loss; privacy-relevant technique.privacy, differential-privacy, image-generation, wavelets, memorization, DP-SGD
2602.22699DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule
PDF
cs.CR, cs.DB, cs.LG80DP SQL library enforcing user-level DP plus minimum-frequency rule; practical governance-aligned privacy.differential privacy, governance, SQL, data release, minimum frequency rule, privacy engineering
2602.23079Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
PDF
cs.CL, cs.CR, cs.LG80Stylometry+LLM agent for authorship inference; highlights and mitigates deanonymization risksprivacy, deanonymization, stylometry, LLM-agents, security, risk

AI 论文洞察简报

2026-02-28

0) 执行要点(先读这个)

  • 智能体安全正在从“提示词层”转向“系统层”:边缘/混合部署引入了可测量的新失效窗口(审计延迟、故障切换黑窗、静默云端回退)以及可绕过模型行为防线的协议层欺骗风险。
  • 动态、以政策文本为依据的安全机制正在成为“权重锁定式护栏”的可行替代:基于检索的“裁决”(CourtGuard)在基准上表现强劲,并可零样本切换政策,但会带来延迟并依赖底座模型对格式/指令的遵循。
  • 面向智能体 RAG 与推理效率的 RL 正在收敛到“过程/路径塑形”:对轨迹进行奖励塑形(Search-P1)以及针对长度异质性的稳定性修复(自适应思考;难度感知熵)同时报告了准确率提升与大幅 token 降低。
  • 评测更贴近真实——也更令人警醒:新基准覆盖智能体记忆(AMA-Bench)、移动出行工具使用(MobilityBench)、全模态工具智能体(OmniGAIA)、隐蔽行为审计(AuditBench)与 DRA 随机性,常揭示当前系统因结构性原因失败(上下文/记忆丢失、工具误用、运行间方差)。
  • 隐私/安全研究正在超越经典文本 MIA:无字幕扩散成员推断(MOFIT)、带最小频次治理规则的 DP SQL(DPSQL+)、基于小波由粗到细的 DP 文生图(DP-Wavelet)、以及风格计量辅助的去匿名化智能体,展示了新的攻击面与可部署的缓解手段。
  • 双重用途风险越来越关乎“人类能力提升”,而非模型分数:一项人体实验发现,LLM 访问使新手在与生物安全相关的 in silico 任务上准确率约提升 4.16×,且多数参与者表示安全护栏带来的阻力很小。

2) 关键主题(聚类)

主题:系统层智能体安全与治理(超越提示词)

主题:政策可适配性与隐蔽行为审计

主题:通过过程/路径塑形实现高效推理与智能体 RAG

主题:更真实的智能体评测:记忆、工具、多模态与随机性

主题:隐私与双重用途:新型审计攻击、带治理约束的 DP、以及人类能力提升

3) 技术综合

  • 多篇论文在以 GRPO 风格 RL 为基础上收敛,并加入稳定性/信用分配修复:针对长度异质性的 CPAS+LAGR;难度门控熵的 CEEH;以及 Search-P1 的路径级稠密奖励。
  • 一个反复出现的模式是“重过程而非结果”:路径中心奖励(Search-P1)、扩散拼接中的步骤级打分与复用、以及 AgentSentry 的因果边界诊断都从中间结构提取信号。
  • 工具边界正在成为安全与评测的天然控制点:AgentSentry 的边界锚定反事实、ESAA 的契约校验意图、以及 IoT MQTT 主题强制的缺口都位于工具/传输层。
  • 基准越来越通过确定性来强制可复现性(MobilityBench API 回放;DRA 缓存搜索),以区分模型方差与环境方差。
  • 多项工作强调测量建模是一等组件:IRT/MFRM 处理评分者效应;随机性作为对规范化发现/引用的总方差;系统安全作为时序/外流指标。
  • 记忆/上下文管理正在分化为两条路线:语义驱逐/压缩(SideQuest 的模型驱动 KV 驱逐工具输出)与结构化外部记忆(AMA-Agent 因果图 + 工具增强检索)。
  • 安全对齐正在超越微调:用于多语种安全的免训练权重编辑(稀疏低秩编辑)与用于审核的政策文本替换(CourtGuard)。
  • 隐私审计正走向优化式、模型拟合攻击(MOFIT)与具治理意识的 DP 接口(DPSQL+),提示防守方需要 ML 与系统双重缓解。
  • 在多模态与智能体场景中,一个共同失败是“信息存在但不可用”:模态坍塌被表述为解码不匹配(GMI vs MI),以及智能体记忆失败中构建/检索丢失关键状态。

4) Top 5 论文(含“为何是现在”)

1) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

  • 量化人类能力提升:LLM 访问使新手准确率约提升 ~4.16×(优势比;校正后准确率约 5% → >17%)。
  • Treatment 在 7/8 个基准上优于 Control,并在部分任务上可超过“仅互联网”的专家基线(如 HPCT、VCT)。
  • 增加行为信号(更长、更结构化的回答;更高自信),并报告 89.6% 参与者表示克服安全护栏没有困难。
  • 质疑点:研究执行中途因模型可用性改变了流程;部分任务存在泄漏(参与者在网上找到题目);并非完全双盲。

2) AgentSentry: Mitigating Indirect Prompt Injection…

  • 推理时、黑盒兼容防御:使用边界锚定的反事实重执行与因果效应估计(ACE/IE/DE)。
  • 报告在 AgentDojo 套件与多种底座上 ASR = 0% 且保持较高效用;消融显示“净化后的反事实”至关重要。
  • 强调通过上下文净化 + 最小动作修订实现安全续写,而非一概拒答。
  • 质疑点:轻量配置(如 K=1)可能依赖注入点靠近边界;工具/运行时被攻破不在范围内。

3) CourtGuard: Zero-Shot Policy Adaptation in LLM Safety

  • 基于检索的“证据辩论(Evidentiary Debate)”实现无需微调的政策切换;报告强劲的宏平均基准表现。
  • 展示对维基百科破坏政策的零样本适配(在平衡子集上 90%)以及带专家评审对齐的法律落地变体。
  • 提供可解释、带政策引用的轨迹,并声称可用于数据集标签噪声审计。
  • 质疑点:增加推理延迟;依赖底座对指令/格式的遵循;受限于政策语料广度。

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

  • 提供缺失的基准原语:56 个模型、14 种隐蔽行为,并被设计为被问及时不自曝。
  • 在不同工具配置下评估自主调查智能体,发现带脚手架的黑盒工具常优于白盒工具。
  • 提出关键警示:工具到智能体的鸿沟——静态证据不保证智能体成功。
  • 质疑点:目标是单一底座模型(Llama 3.3 70B)上的窄幅微调;植入行为可能不同于真实世界的涌现问题。

5) Systems-Level Attack Surface of Edge Agent Deployments on IoT

  • 在架构层把“智能体安全”具体化:测量执行到审计延迟、溯源完整性、数据外流与故障切换窗口。
  • 发现 MQTT broker 默认接受伪造/重放/直接向安全主题发布;强制回退可触发静默云端路由,仅能通过 DNS/tcpdump 观察。
  • 量化故障切换:端到端 WiFi 丢失到回退路径 35.7s,而 MQTT 重连本身仅毫秒级——凸显真正的窗口所在。
  • 质疑点:单一测试床/拓扑;云端外流对比未做工作负载匹配;缓解措施未实现/未评估。

5) 实用下一步

  • 对使用工具的智能体,增加边界级安全观测:记录工具返回边界、缓存工具输出以便回放,并在自有工作流上用受控反事实重执行(AgentSentry 风格)测量接管风险。
  • 若部署边缘/混合智能体,定义并监控系统安全 SLO:执行到审计延迟、故障切换黑窗、溯源链完整性,以及对任何云端回退/外流的显式告警。
  • 对审核/治理,原型化政策文本 RAG 裁决与明确评分量表(监管 vs 实际威胁),并跨底座测量延迟与格式失败率。
  • 对智能体 RAG 的 RL 训练,用轨迹/路径奖励(自一致性 + 参考对齐)替代仅二元奖励,并为“接近命中”提供部分分;跟踪收敛速度与冗余工具动作。
  • 对推理效率,测试模式控制 token(/think vs /no_think),并用长度感知梯度加权稳定 RL;另可尝试难度门控熵以避免在难题上熵坍塌。
  • 对评测,加入随机性审计:每个查询运行 k 次,计算发现/引用的方差,并在调温前将方差定位到模块(查询 vs 总结 vs 更新)。
  • 对人工标注评测,在用原始均值做模型选择前,考虑评分者效应校正(MFRM/IRT)与评分者诊断。
  • 对隐私,假设更强审计者:在无字幕 MIA设置下评估扩散模型;对分析系统同时强制 DP 与治理约束(最小频次)并集成记账;对文本评估风格计量/去匿名化风险并测试引导式改写。

由逐篇分析生成;未进行外部浏览。