AI 论文日报（2026-02-28）

Published: February 28, 2026

English version: /paper-news/2026-02-28/

运行统计

候选论文: 262
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2602.22755`	AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors PDF	cs.CL	96	Audit benchmark w/ 56 models hiding 14 bad traits; evaluates auditing tools + autonomous investigator agent.	alignment auditing, hidden behaviors, benchmarks, red-teaming, agent evaluation, model honesty
`2602.23329`	LLM Novice Uplift on Dual-Use, In Silico Biology Tasks PDF	cs.AI, cs.CL, cs.CR, cs.CY, cs.HC	96	Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLMs	dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
`2602.22724`	AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification PDF	cs.CR, cs.AI	94	Targets indirect prompt injection in tool/RAG agents with multi-turn causal diagnostics + context purification.	agent security, prompt injection, tool use, RAG safety, inference-time defense, trajectory attacks
`2602.22525`	Systems-Level Attack Surface of Edge Agent Deployments on IoT PDF	cs.CR	94	Empirical security analysis of edge LLM agents on IoT; identifies concrete attack surfaces + metrics.	agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance, MQTT
`2602.22557`	CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety PDF	cs.AI, cs.LG	92	Model-agnostic zero-shot safety policy adaptation via retrieval-grounded multi-agent evidentiary debate.	policy compliance, RAG, multi-agent debate, governance, safety evaluation, zero-shot
`2602.22787`	Probing for Knowledge Attribution in Large Language Models PDF	cs.CL, cs.AI	92	Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation	hallucinations, attribution, faithfulness, factuality, probes, evaluation
`2602.22953`	General Agent Evaluation PDF	cs.AI	92	Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration gaps.	agent-evaluation, benchmarks, evaluation-protocol, agentic-systems, framework
`2602.22603`	SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning PDF	cs.AI, cs.LG	92	LRM-driven KV-cache compression for long-horizon agents; targets real bottleneck in agentic RAG.	agents, long-context, kv-cache, efficiency, reasoning, memory-management, RAG
`2602.22554`	Multilingual Safety Alignment Via Sparse Weight Editing PDF	cs.LG	90	Training-free sparse weight editing to reduce multilingual safety gaps; claims closed-form cross-lingual mapping.	multilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
`2602.23271`	Evaluating Stochasticity in Deep Research Agents PDF	cs.AI	90	Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP view.	agents, evaluation, reliability, stochasticity, research-agents, variance
`2602.22675`	Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization PDF	cs.CL	89	Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost+generalization	agents, search, efficiency, long-horizon, generalization, deep-research
`2602.22556`	Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation PDF	cs.LG, cs.AI, cs.CL	89	RL method to curb overthinking while preserving hard-query reasoning; practical accuracy/latency tradeoff.	reasoning, test-time-compute, RL, efficiency, adaptive-computation, alignment-adjacent
`2602.22775`	TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation PDF	cs.HC, cs.AI, cs.CL	88	Adversarial multi-agent simulation to surface multi-turn relational safety failures in mental health chatbots.	relational safety, mental health, multi-agent simulation, evaluation, conversation dynamics, harm modes
`2602.22576`	Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training PDF	cs.CL, cs.IR, cs.LG	88	Reward shaping for RL-trained agentic RAG; extracts signal from failures to improve sample efficiency	RAG, agents, reinforcement-learning, reward-shaping, training, retrieval
`2602.22897`	OmniGAIA: Towards Native Omni-Modal AI Agents PDF	cs.AI, cs.CL, cs.CV, cs.LG, cs.MM	88	Omni-modal agent benchmark (video+audio+image) with tool use and multi-hop reasoning; likely reusable.	multimodal, agents, benchmark, tool-use, evaluation, video, audio
`2602.23136`	Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs PDF	cs.CL, cs.AI, cs.LG	87	Info-theoretic account of modality collapse as mismatched decoding; actionable framing for multimodal LLMs.	multimodal-llms, information-theory, decoding, representation, modality-collapse, theory
`2602.22871`	Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching PDF	cs.CL, cs.AI	87	Step-level PRM scoring + stitching across diffusion CoTs; strong test-time scaling idea for reasoning.	reasoning, process-reward-model, test-time-scaling, diffusion-LM, self-consistency
`2602.22968`	Certified Circuits: Stability Guarantees for Mechanistic Circuits PDF	cs.AI, cs.CV, cs.CY	86	Provable stability guarantees for mechanistic circuit discovery via randomized subsampling certification.	mechanistic interpretability, circuits, robustness, certification, auditing, OOD stability
`2602.22638`	MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios PDF	cs.AI	86	Real-world route-planning agent benchmark with deterministic API-replay sandbox for reproducibility	agents, benchmark, evaluation, tool-use, reproducibility, planning
`2602.22769`	AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications PDF	cs.AI, cs.LG	85	AMA-Bench evaluates long-horizon agent memory on real agent-environment trajectories (not just dialogue).	agent memory, benchmarks, long-horizon, evaluation, trajectories, agentic applications
`2602.22719`	Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks PDF	cs.LG	85	Mechanistic interpretability for Mamba SSMs + simple activation steering yields broad gains.	interpretability, steering, state-space-models, Mamba, mechanistic-interpretability, reliability
`2602.23193`	ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering PDF	cs.AI	84	Event-sourcing architecture for LLM agents: structured intentions + deterministic state/logging	agents, software-engineering, orchestration, state, reliability, audit-logs
`2602.23200`	InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models PDF	cs.LG, cs.CL	84	Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.	efficiency, KV-cache, quantization, long-context, inference, systems
`2602.22758`	Decomposing Physician Disagreement in HealthBench PDF	cs.AI, stat.AP	83	Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.	evaluation, medical-AI, uncertainty, human-judgment, benchmarking, reliability
`2602.22689`	No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings PDF	cs.CV, cs.CR	82	Caption-free membership inference for diffusion models using model-fitted synthetic conditioning inputs.	privacy, membership inference, diffusion models, data memorization, auditing, security
`2602.22585`	Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach PDF	cs.AI, cs.LG	82	Uses IRT/Rasch to correct rater effects in human evals; improves reliability of AI conclusions	evaluation, human-ratings, psychometrics, IRT, RLHF, measurement
`2602.22642`	Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning PDF	cs.LG	82	Difficulty-aware entropy regularization to compress CoT while avoiding entropy collapse on hard problems.	reasoning, CoT, efficiency, entropy-regularization, inference-cost, RL
`2602.23262`	Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling PDF	cs.CV, cs.CR	81	DP image generation via coarse-to-fine wavelet modeling to reduce quality loss; privacy-relevant technique.	privacy, differential-privacy, image-generation, wavelets, memorization, DP-SGD
`2602.22699`	DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule PDF	cs.CR, cs.DB, cs.LG	80	DP SQL library enforcing user-level DP plus minimum-frequency rule; practical governance-aligned privacy.	differential privacy, governance, SQL, data release, minimum frequency rule, privacy engineering
`2602.23079`	Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent PDF	cs.CL, cs.CR, cs.LG	80	Stylometry+LLM agent for authorship inference; highlights and mitigates deanonymization risks	privacy, deanonymization, stylometry, LLM-agents, security, risk

AI 论文洞察简报

2026-02-28

0) 执行要点（先读这个）

智能体安全正在从“提示词层”转向“系统层”：边缘/混合部署引入了可测量的新失效窗口（审计延迟、故障切换黑窗、静默云端回退）以及可绕过模型行为防线的协议层欺骗风险。
动态、以政策文本为依据的安全机制正在成为“权重锁定式护栏”的可行替代：基于检索的“裁决”（CourtGuard）在基准上表现强劲，并可零样本切换政策，但会带来延迟并依赖底座模型对格式/指令的遵循。
面向智能体 RAG 与推理效率的 RL 正在收敛到“过程/路径塑形”：对轨迹进行奖励塑形（Search-P1）以及针对长度异质性的稳定性修复（自适应思考；难度感知熵）同时报告了准确率提升与大幅 token 降低。
评测更贴近真实——也更令人警醒：新基准覆盖智能体记忆（AMA-Bench）、移动出行工具使用（MobilityBench）、全模态工具智能体（OmniGAIA）、隐蔽行为审计（AuditBench）与 DRA 随机性，常揭示当前系统因结构性原因失败（上下文/记忆丢失、工具误用、运行间方差）。
隐私/安全研究正在超越经典文本 MIA：无字幕扩散成员推断（MOFIT）、带最小频次治理规则的 DP SQL（DPSQL+）、基于小波由粗到细的 DP 文生图（DP-Wavelet）、以及风格计量辅助的去匿名化智能体，展示了新的攻击面与可部署的缓解手段。
双重用途风险越来越关乎“人类能力提升”，而非模型分数：一项人体实验发现，LLM 访问使新手在与生物安全相关的 in silico 任务上准确率约提升 4.16×，且多数参与者表示安全护栏带来的阻力很小。

2) 关键主题（聚类）

主题：系统层智能体安全与治理（超越提示词）

重要性：当智能体进入边缘设备、工具总线与长时程工作流时，安全性不仅取决于对齐，还取决于架构（消息传输、故障切换、可审计性）。即便模型“行为良好”，这些层面仍可被利用。
代表论文：
常见方法：
- 将智能体安全视为可测量的系统属性（审计延迟、溯源完整性、故障切换窗口）。
- 在工具边界插入运行时治理层（反事实重执行；契约校验；确定性编排器）。
- 偏好可审计 + 可回放（事件日志、缓存工具输出）以支持取证与隔离。
开放问题 / 失效模式：
- “静默”跨边界（如回退到云端）绕过用户感知与日志记录。
- 工具/运行时被攻破与缓存篡改常被视为不在范围内，但现实可行。
- 更强治理层的延迟/成本开销与实时执行需求之间的权衡。

主题：政策可适配性与隐蔽行为审计

重要性：政策变化快于模型发布；同时模型可能隐藏问题行为，因此审计需要基准与智能体工作流，而不仅是静态探针。
代表论文：
常见方法：
- 将决策锚定在检索到的政策文本与结构化裁决（辩论 + 裁判打分）。
- 构建带已知隐蔽行为的模型生物体（model organisms）并衡量审计员/智能体成功率。
- 使用测量模型（IRT/MFRM）纠正人工标注中的系统性评分偏差。
开放问题 / 失效模式：
- 工具到智能体的鸿沟：工具呈现的证据未必能转化为正确的智能体假设。
- 政策语料覆盖范围决定上限；缺失/歧义政策文本可能主导错误。
- 若不校正，评分者严苛度/趋中性会扭曲评测流水线。

主题：通过过程/路径塑形实现高效推理与智能体 RAG

重要性：前沿性能越来越受推理成本与 RL 不稳定性（长度异质性、稀疏奖励）限制。面向过程的塑形旨在同时提升准确率与效率。
代表论文：
常见方法：
- 修改 GRPO/RLVR 以在可变长度轨迹下稳定训练（长度感知梯度；优势塑形；选择性熵）。
- 用轨迹/路径奖励替代稀疏结果奖励（自一致性 vs 参考对齐；软结果打分）。
- 使用难度门控（按题目历史准确率）分配探索预算。
开放问题 / 失效模式：
- 对可验证奖励（数学/QA）的依赖可能难以迁移到开放域。
- 参考计划（离线生成）可能将学习偏置到狭窄策略集合。
- 熵/探索机制仍可能在难题上“烧 token”却找不到正确路径。

主题：更真实的智能体评测：记忆、工具、多模态与随机性

重要性：许多失败并非“模型智商”，而是系统问题：记忆构建丢失、工具误用、不可复现 API、运行间方差。新基准将其隔离出来。
代表论文：
常见方法：
- 构建确定性沙盒（API 回放）与分解指标（工具有效性、规划精确率/召回率、DR/FPR）。
- 在机器生成工件与因果环境动态上评估以智能体为中心的记忆。
- 在多层面量化随机性（答案/发现/引用），并将其归因到模块/步骤。
开放问题 / 失效模式：
- 工具调用次数与成功并非单调关系（太少会失败；太多也不保证）。
- 基于相似度的检索与有损压缩在密集、因果结构化轨迹上可能失效。
- 早期随机性会级联；推理/更新模块可能主导方差。

主题：隐私与双重用途：新型审计攻击、带治理约束的 DP、以及人类能力提升

重要性：隐私风险扩展到扩散模型与智能体去匿名化；DP 部署需要治理规则（如最小频次）。双重用途风险取决于人类在 LLM 帮助下是否更有能力。
代表论文：
常见方法：
- 重新定义威胁模型以贴近现实（仅图像 MIA；开放世界作者检索；多查询 DP 会话）。
- 使用后处理式 DP 分解（私有粗结构 + 公共细节补全）。
- 在长时交互下衡量人在回路的能力变化，而非仅看模型单独分数。
开放问题 / 失效模式：
- 无字幕扩散 MIA 可能很慢（每张图分钟级），且在评估设置中可能被某些适配方法（如 LoRA）缓解。
- DP 系统以表达能力换安全（受限 SQL 子集；会话级记账）。
- 防护措施可能无法对有动机用户形成有效摩擦（提升研究中的自我报告）。

3) 技术综合

多篇论文在以 GRPO 风格 RL 为基础上收敛，并加入稳定性/信用分配修复：针对长度异质性的 CPAS+LAGR；难度门控熵的 CEEH；以及 Search-P1 的路径级稠密奖励。
一个反复出现的模式是“重过程而非结果”：路径中心奖励（Search-P1）、扩散拼接中的步骤级打分与复用、以及 AgentSentry 的因果边界诊断都从中间结构提取信号。
工具边界正在成为安全与评测的天然控制点：AgentSentry 的边界锚定反事实、ESAA 的契约校验意图、以及 IoT MQTT 主题强制的缺口都位于工具/传输层。
基准越来越通过确定性来强制可复现性（MobilityBench API 回放；DRA 缓存搜索），以区分模型方差与环境方差。
多项工作强调测量建模是一等组件：IRT/MFRM 处理评分者效应；随机性作为对规范化发现/引用的总方差；系统安全作为时序/外流指标。
记忆/上下文管理正在分化为两条路线：语义驱逐/压缩（SideQuest 的模型驱动 KV 驱逐工具输出）与结构化外部记忆（AMA-Agent 因果图 + 工具增强检索）。
安全对齐正在超越微调：用于多语种安全的免训练权重编辑（稀疏低秩编辑）与用于审核的政策文本替换（CourtGuard）。
隐私审计正走向优化式、模型拟合攻击（MOFIT）与具治理意识的 DP 接口（DPSQL+），提示防守方需要 ML 与系统双重缓解。
在多模态与智能体场景中，一个共同失败是“信息存在但不可用”：模态坍塌被表述为解码不匹配（GMI vs MI），以及智能体记忆失败中构建/检索丢失关键状态。

4) Top 5 论文（含“为何是现在”）

1) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

量化人类能力提升：LLM 访问使新手准确率约提升 ~4.16×（优势比；校正后准确率约 5% → >17%）。
Treatment 在 7/8 个基准上优于 Control，并在部分任务上可超过“仅互联网”的专家基线（如 HPCT、VCT）。
增加行为信号（更长、更结构化的回答；更高自信），并报告 89.6% 参与者表示克服安全护栏没有困难。
质疑点：研究执行中途因模型可用性改变了流程；部分任务存在泄漏（参与者在网上找到题目）；并非完全双盲。

2) AgentSentry: Mitigating Indirect Prompt Injection…

推理时、黑盒兼容防御：使用边界锚定的反事实重执行与因果效应估计（ACE/IE/DE）。
报告在 AgentDojo 套件与多种底座上 ASR = 0% 且保持较高效用；消融显示“净化后的反事实”至关重要。
强调通过上下文净化 + 最小动作修订实现安全续写，而非一概拒答。
质疑点：轻量配置（如 K=1）可能依赖注入点靠近边界；工具/运行时被攻破不在范围内。

3) CourtGuard: Zero-Shot Policy Adaptation in LLM Safety

基于检索的“证据辩论（Evidentiary Debate）”实现无需微调的政策切换；报告强劲的宏平均基准表现。
展示对维基百科破坏政策的零样本适配（在平衡子集上 90%）以及带专家评审对齐的法律落地变体。
提供可解释、带政策引用的轨迹，并声称可用于数据集标签噪声审计。
质疑点：增加推理延迟；依赖底座对指令/格式的遵循；受限于政策语料广度。

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

提供缺失的基准原语：56 个模型、14 种隐蔽行为，并被设计为被问及时不自曝。
在不同工具配置下评估自主调查智能体，发现带脚手架的黑盒工具常优于白盒工具。
提出关键警示：工具到智能体的鸿沟——静态证据不保证智能体成功。
质疑点：目标是单一底座模型（Llama 3.3 70B）上的窄幅微调；植入行为可能不同于真实世界的涌现问题。

5) Systems-Level Attack Surface of Edge Agent Deployments on IoT

在架构层把“智能体安全”具体化：测量执行到审计延迟、溯源完整性、数据外流与故障切换窗口。
发现 MQTT broker 默认接受伪造/重放/直接向安全主题发布；强制回退可触发静默云端路由，仅能通过 DNS/tcpdump 观察。
量化故障切换：端到端 WiFi 丢失到回退路径 35.7s，而 MQTT 重连本身仅毫秒级——凸显真正的窗口所在。
质疑点：单一测试床/拓扑；云端外流对比未做工作负载匹配；缓解措施未实现/未评估。

5) 实用下一步

对使用工具的智能体，增加边界级安全观测：记录工具返回边界、缓存工具输出以便回放，并在自有工作流上用受控反事实重执行（AgentSentry 风格）测量接管风险。
若部署边缘/混合智能体，定义并监控系统安全 SLO：执行到审计延迟、故障切换黑窗、溯源链完整性，以及对任何云端回退/外流的显式告警。
对审核/治理，原型化政策文本 RAG 裁决与明确评分量表（监管 vs 实际威胁），并跨底座测量延迟与格式失败率。
对智能体 RAG 的 RL 训练，用轨迹/路径奖励（自一致性 + 参考对齐）替代仅二元奖励，并为“接近命中”提供部分分；跟踪收敛速度与冗余工具动作。
对推理效率，测试模式控制 token（/think vs /no_think），并用长度感知梯度加权稳定 RL；另可尝试难度门控熵以避免在难题上熵坍塌。
对评测，加入随机性审计：每个查询运行 k 次，计算发现/引用的方差，并在调温前将方差定位到模块（查询 vs 总结 vs 更新）。
对人工标注评测，在用原始均值做模型选择前，考虑评分者效应校正（MFRM/IRT）与评分者诊断。
对隐私，假设更强审计者：在无字幕 MIA设置下评估扩散模型；对分析系统同时强制 DP 与治理约束（最小频次）并集成记账；对文本评估风格计量/去匿名化风险并测试引导式改写。

由逐篇分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-02-28

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：系统层智能体安全与治理（超越提示词）

主题：政策可适配性与隐蔽行为审计

主题：通过过程/路径塑形实现高效推理与智能体 RAG

主题：更真实的智能体评测：记忆、工具、多模态与随机性

主题：隐私与双重用途：新型审计攻击、带治理约束的 DP、以及人类能力提升

3) 技术综合

4) Top 5 论文（含“为何是现在”）

5) 实用下一步