AI 论文日报（2026-03-01）

Published: March 01, 2026

English version: /paper-news/2026-03-01/

运行统计

候选论文: 262
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-02-26T01:00:00Z → 2026-02-28T01:00:00Z (arxiv_announce, expanded=1)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2602.23329`	LLM Novice Uplift on Dual-Use, In Silico Biology Tasks PDF	cs.AI, cs.CL, cs.CR, cs.CY, cs.HC	96	Careful human study shows large LLM uplift on bio dual-use tasks; key for risk assessment.	dual-use, biosecurity, human-uplift, evaluation, misuse-risk
`2602.22755`	AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors PDF	cs.CL	95	Benchmark of hidden misalignment behaviors + agentic auditing tools; strong for eval & oversight.	alignment auditing, benchmarks, hidden behaviors, agent evaluators, model honesty
`2602.22724`	AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification PDF	cs.CR, cs.AI	93	Directly targets indirect prompt injection in agents with trajectory-aware diagnostics + mitigation.	agent security, prompt injection, tool outputs, inference-time defense, causal diagnostics
`2602.22525`	Systems-Level Attack Surface of Edge Agent Deployments on IoT PDF	cs.CR	93	Empirical security analysis of edge LLM agents; concrete attack surfaces + measurable security metrics.	agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance
`2602.22603`	SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning PDF	cs.AI, cs.LG	92	LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.	agents, long-context, memory, KV-cache, efficiency, reasoning
`2602.22557`	CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety PDF	cs.AI, cs.LG	91	Model-agnostic zero-shot safety policy adaptation via RAG multi-agent debate grounded in policies.	policy compliance, RAG, multi-agent debate, governance, safety evaluation
`2602.22787`	Probing for Knowledge Attribution in Large Language Models PDF	cs.CL, cs.AI	91	Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination control.	hallucinations, attribution, faithfulness, factuality, interpretability
`2602.22953`	General Agent Evaluation PDF	cs.AI	91	Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration bias.	agent-evaluation, benchmarks, general-agents, protocols, reproducibility
`2602.22775`	TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation PDF	cs.HC, cs.AI, cs.CL	90	Adversarial multi-agent simulation to surface long-horizon relational safety failures in therapy bots.	mental health, conversational safety, multi-turn evaluation, red teaming, agent simulation
`2602.22576`	Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training PDF	cs.CL, cs.IR, cs.LG	89	Reward shaping for agentic RAG RL improves sample efficiency using trajectory-level signals.	agentic-RAG, reinforcement-learning, reward-shaping, retrieval, reasoning
`2602.22897`	OmniGAIA: Towards Native Omni-Modal AI Agents PDF	cs.AI, cs.CL, cs.CV, cs.LG, cs.MM	89	Omni-modal agent benchmark (audio+video+image+tools) with event-graph construction; high reuse potential.	multimodal, agents, benchmark, tool-use, evaluation, long-horizon
`2602.22556`	Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation PDF	cs.LG, cs.AI, cs.CL	89	RL framework to curb overthinking while preserving correctness; practical for reliable reasoning models.	reasoning, RL, efficiency, adaptive-compute, alignment, robustness
`2602.22554`	Multilingual Safety Alignment Via Sparse Weight Editing PDF	cs.LG	88	Training-free sparse weight editing to reduce multilingual safety gaps; practical alignment lever.	multilingual safety, weight editing, safety neurons, alignment, low-resource languages
`2602.22675`	Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization PDF	cs.CL	87	Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost and generalization.	agents, search, efficiency, long-horizon, deep-research
`2602.23271`	Evaluating Stochasticity in Deep Research Agents PDF	cs.AI	87	Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP framing.	agents, evaluation, stochasticity, reliability, research-agents, variance
`2602.22769`	AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications PDF	cs.AI, cs.LG	86	AMA-Bench evaluates long-horizon agent memory on real agent trajectories beyond dialogue setups.	agent memory, benchmarks, long-horizon, evaluation, trajectories
`2602.23136`	Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs PDF	cs.CL, cs.AI, cs.LG	86	Theory for multimodal 'modality collapse' as mismatched decoding; probes + info-theoretic limits (GMI).	multimodal-LLMs, information-theory, decoding, representation, robustness
`2602.22719`	Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks PDF	cs.LG	86	Mechanistic interpretability + test-time steering for Mamba/SSMs; notable gains via simple intervention.	interpretability, steering, SSM, Mamba, mechanistic, reliability
`2602.22968`	Certified Circuits: Stability Guarantees for Mechanistic Circuits PDF	cs.AI, cs.CV, cs.CY	85	Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliability.	mechanistic interpretability, circuits, certification, robustness, auditing
`2602.22638`	MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios PDF	cs.AI	84	Real-world route-planning benchmark with deterministic API-replay sandbox for reproducible agent eval.	agents, benchmark, tool-use, evaluation, sandbox
`2602.23200`	InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models PDF	cs.LG, cs.CL	84	Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding; practical impact.	LLM-efficiency, KV-cache, quantization, long-context, inference
`2602.22871`	Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching PDF	cs.CL, cs.AI	84	Step-level PRM-guided stitching for diffusion LMs; improves test-time scaling beyond trace voting.	test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
`2602.22689`	No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings PDF	cs.CV, cs.CR	82	Caption-free membership inference for diffusion models; strengthens privacy auditing realism.	privacy, membership inference, diffusion models, data memorization, security
`2602.23193`	ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering PDF	cs.AI	82	Event-sourcing architecture for LLM agents: structured intentions + deterministic state/logging.	agents, software-engineering, state, reliability, orchestration
`2602.22642`	Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning PDF	cs.LG	82	Reasoning compression via difficulty-aware entropy regularization to avoid exploration collapse on hard tasks.	LLM-reasoning, CoT, efficiency, entropy-regularization, RL
`2602.22758`	Decomposing Physician Disagreement in HealthBench PDF	cs.AI, stat.AP	82	Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.	evaluation, medical-AI, uncertainty, human-judgment, benchmarks, reliability
`2602.23262`	Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling PDF	cs.CV, cs.CR	81	DP image generation via wavelet coarse-to-fine; targets privacy/utility tradeoff with spectral hypothesis.	privacy, differential-privacy, image-generation, wavelets, memorization
`2602.22699`	DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule PDF	cs.CR, cs.DB, cs.LG	80	DP SQL system enforcing minimum frequency rule; relevant for governance-grade privacy releases.	differential privacy, data governance, SQL, minimum frequency rule, privacy engineering
`2602.22585`	Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach PDF	cs.AI, cs.LG	80	Uses IRT/Rasch to correct rater effects in human eval; improves reliability of AI conclusions.	evaluation, human-raters, psychometrics, RLHF, measurement
`2602.22983`	Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search PDF	cs.AI, cs.CR	79	Shows classical Chinese as jailbreak vector + automated black-box prompt search; useful for red-teaming.	jailbreaks, multilingual attacks, adversarial prompts, red teaming, prompt optimization

AI 论文洞察简报

2026-03-01

0) 执行要点（先读这个）

智能体安全正在从“提示词层面”转向“系统层面”：边缘 IoT 群体表明，协调总线（MQTT）、故障切换行为以及静默云端回退可能主导真实风险——即使模型行为本身不变。
推理时、以政策为依据的安全更易更新：CourtGuard 展示了通过 RAG + 对抗式辩论实现零样本政策替换，并在基准上表现强劲，提示一种无需再训练即可降低“对齐滞后”的路径。
多轮智能体攻防正在变得因果化与时间化：AgentSentry 通过在工具返回边界用反事实重执行定位接管点，然后仅净化不可信的中介上下文以安全继续执行，报告在 AgentDojo 上攻击成功率为 0%。
效率工作正收敛到“自适应计算”并配套稳定性修复：多篇论文通过 GRPO 稳定器（CPAS/LAGR）、难度感知熵正则（CEEH）与步骤级复用（扩散拼接）来解决过度思考/长时程成本，而非粗暴的长度惩罚。
评测正在走向方差/噪声感知的测量：评分者效应校正（IRT）可改变系统排名；HealthBench 的分歧多为个案特异；深度研究智能体存在可测的运行间方差，并可进行模块归因与缓解。
双重用途风险证据更直接：一项人体研究发现，LLM 访问使新手在 in silico 生物任务上的准确率提高 4.16×，且多数用户表示绕过防护几乎不费力。

2) 关键主题（聚类）

主题：超越提示词的工具型智能体安全（系统 + 时间性防御）

重要性：当智能体通过工具与物理设备行动时，主要漏洞越来越来自协调底座、上下文持久化与运行时回退——而不只是单轮提示注入。
代表论文：
共同方法：
- 将安全属性视为系统指标（审计延迟、溯源完整性、外联、故障切换窗口），而非仅看模型行为。
- 在工具返回/状态迁移点（不可信内容进入处）插入边界检查。
- 倾向采用可审计的状态内核（只追加日志、确定性回放、契约）以减少状态漂移并支持治理。
开放问题 / 失效模式：
- 如何在边缘约束下为协调层（如 MQTT）加入加密溯源/ACL 而不破坏时延。
- 反事实诊断带来开销；对长时程延迟接管的鲁棒性在现有基准之外仍不清楚。
- 事件溯源内核可验证合规/回放，但不直接衡量软件质量或更广泛的安全侧信道。

主题：动态、以政策为依据的对齐与多语言安全迁移

重要性：线上政策变化快于再训练周期，且跨语言安全缺口仍可被利用；两者都推动可更新、模块化对齐。
代表论文：
共同方法：
- 将安全决策锚定在检索到的政策文本上，并产出可审计的理由（CourtGuard）。
- 通过在识别出的“安全神经元”上做稀疏、低秩编辑来迁移安全，避免全量再训练（Sparse Weight Editing）。
- 用分布偏移语言与自动化黑盒优化对护栏进行压力测试（Classical Chinese FOA search）。
开放问题 / 失效模式：
- RAG+辩论式评估器以准确性换取时延/成本；较小骨干可能无法稳定输出结构化结果。
- 权重编辑在自适应越狱或语言间安全表征差异很大时可能脆弱。
- 文言文越狱结果表明，针对现代语言表面形式调参的防御可能灾难性失效。

主题：面向推理与智能体 RAG 的稳定效率扩展

重要性：服务成本已是一等约束；朴素长度惩罚会压垮探索或准确率。趋势转向稳定性优先的效率机制。
代表论文：
共同方法：
- 修改 GRPO/RLVR 以避免在长度异质性下模式坍塌（CPAS/LAGR；难度感知熵）。
- 用轨迹/过程信号替代稀疏/二值结果奖励（路径中心评分；步骤级 PRM 评分）。
- 复用部分进度（跨扩散轨迹拼接步骤），而非只选择整条轨迹。
开放问题 / 失效模式：
- 依赖可验证奖励（数学/QA）限制领域迁移；开放式任务的验证仍困难。
- PRM 误排序可能丢弃关键步骤，或在拼接中对错误锚点过度信任。
- 难度估计与选择性熵可能把过多预算投入到仍无法解决的难例。

主题：长时程智能体记忆 + 推理基础设施

重要性：长时程智能体会遭遇上下文/KV 瓶颈与记忆检索失败；改进越来越需要系统 + 模型协同设计。
代表论文：
共同方法：
- 将工具输出视为缓存的一等公民；用语义、模型驱动决策进行驱逐（SideQuest）。
- 将量化布局与解码时内核对齐（内维分组）以减少内存流量（InnerQ）。
- 在智能体–环境轨迹上评测记忆，包含机器生成工件与因果状态迁移（AMA-Bench）。
开放问题 / 失效模式：
- SideQuest 目前只驱逐工具响应（不含“思考”token），并出现一定 OOD 退化。
- KV 量化结果展示在 GSM8K few-shot；更广任务影响及与驱逐策略的交互仍待研究。
- 记忆基准部分评分依赖 LLM-as-judge；判分鲁棒性仍是担忧。

主题：评测可靠性、随机性与分歧作为一等信号

重要性：当模型趋同，测量误差（评分者效应、分歧、运行间方差）可能主导“进步”观感并导致系统误排名。
代表论文：
共同方法：
- 显式建模标注者（MFRM）以调整分数并诊断严苛度/中心化倾向。
- 将方差分解为评分者/量表/残差或模块/时间步贡献。
- 在智能体使用情境中评测工具，暴露工具到智能体的差距，而非静态工具质量。
开放问题 / 失效模式：
- HealthBench 分歧多为残差/个案特异；从嵌入预测几乎接近随机，限制自动化。
- 即使温度为 0，API 非确定性仍可能存在，影响可复现性。
- 审计工具在更难目标上可能变差；效果强依赖目标训练方式（TD vs SDF；KTO vs SFT）。

3) 技术综合

以边界为中心的思路反复出现：AgentSentry 的工具返回边界、ESAA 的意图/效果边界、以及边缘 IoT 的 MQTT 协调边界都把“状态变化发生处”视为衡量/控制风险的关键位置。
GRPO 正成为通用底座：既用于推理效率（自适应思考；CEEH），也用于智能体 RAG 训练（Search-P1），论文重点在异质性下稳定梯度/奖励。
过程信号正在替代二值结果：Search-P1 的路径中心评分与扩散步骤拼接都从部分正确轨迹中提取学习/选择信号。
“模型作为系统组件”的范围在扩大：SideQuest 用 LRM 管理自身 KV cache；AgentSentry 在受控重执行中使用模型；CourtGuard 用多角色（攻击者/防御者/裁判）结构化评估。
评测工作正收敛到方差分解：评分者效应（IRT）、医生分歧 ICC、以及 DRA 随机性都在形式化“方差来自哪里”，而非将其视为噪声。
语言分布偏移仍是主要越狱向量：文言文优化显示对多个闭源模型几乎完全攻破；Sparse Weight Editing 试图在不再训练下弥合多语言缺口。
隐私审计正在扩展威胁模型：MOFIT 去除了扩散 MIA 的“真实 caption”假设；DP-Wavelet 与 DPSQL+ 聚焦可部署 DP 与实际约束（后处理、最小频次规则）。
智能体基准更贴近环境且更可复现：MobilityBench 的 API 回放沙箱与 General Agent Evaluation 的 Unified Protocol 都面向可复现与跨系统可比。
可解释性越来越与干预绑定：SSM 瓶颈引导（Mamba）与可认证电路稳定性都旨在让机制性产物可操作且可靠。

4) Top 5 论文（含“为何现在”）

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

引入以边界为锚的反事实诊断（orig/mask/mask_sanitized/orig_sanitized）以定位中介接管。
报告在 AgentDojo 上跨多种攻击家族与骨干模型 ASR = 0%，且在攻击下保持高效用。
通过仅将不可信中介内容重写为“仅证据”形式来缓解，使系统可继续而非终止。
需保持怀疑：反事实运行带来的额外推理开销；基准可能低估长时程延迟接管。

2) Systems-Level Attack Surface of Edge Agent Deployments on IoT

让智能体安全可度量：执行到审计延迟（某路径均值约 23 ms）、溯源完整性、外联、故障切换窗口。
显示 MQTT broker 在无加密强制下接受欺骗/重放/直接发布安全主题。
演示通过强制回退（DNS 到 api.anthropic.com）实现静默主权边界跨越，且应用层无异常。
需保持怀疑：单一测试床/拓扑；云外联对比未做工作负载匹配。

3) CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

基于检索的对抗辩论产出可解释裁决，包含威胁分数与政策引用。
报告强基准表现（macro Acc 0.87 / F1 0.86）且在人类验证套件上高召回。
通过更换政策语料展示 零样本政策替换（如 Wikipedia vandalism）。
需保持怀疑：RAG + 多轮辩论的时延/成本；对骨干模型格式遵循的依赖。

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

提供 56 个加固目标模型，含 14 种隐藏行为，并降低“自白率”（KTO 比 SFT 更难）。
智能体式评估显示脚手架化黑盒工具优于许多白盒工具；效果依赖目标训练方式。
揭示具体“工具到智能体差距”（使用不足、噪声干扰、假设失败）。
需保持怀疑：目标是基于单一底座模型微调的“模型生物体”；可能不匹配自然涌现的隐藏行为。

5) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

人体受试证据：LLM 访问使新手准确率提高 4.16×；Treatment 在 8 个基准中赢 7 个。
Treatment 新手有时超过专家基线；但独立 LLM 往往超过 LLM 辅助新手（引导/提取差距）。
报告多数 Treatment 参与者表示绕过防护不困难（89.6%）。
需保持怀疑：研究限制包括模型可用性变化、可能的信息泄漏（部分问题可在网上找到）、以及未完全盲法。

5) 实用下一步

对工具型智能体，加入工具返回边界的监测：记录中介内容、拟执行动作与轻量“接管风险”代理指标；衡量高影响动作被归因于中介的频率。
在边缘/IoT 部署中，将消息总线安全视为安全关键：在 MQTT（或等价）中测试欺骗/重放/直接主题发布；测量执行到审计延迟与故障切换黑窗。
若需要快速政策更新，原型化一个policy-RAG 评估器，提供明确引用与确定性裁决映射；对比静态分类器的时延。
对多语言安全，评估语言偏移越狱（含风格偏移）并考虑稀疏干预；衡量非安全任务上的效用漂移。
对推理效率，避免粗暴长度惩罚：尝试难度感知的探索控制（仅对难例加熵）或在长度异质性下做优势/梯度调节；跟踪模式坍塌。
对长时程智能体，结合语义 KV 驱逐（工具响应垃圾回收）与硬件对齐的 KV 量化；测量吞吐与未完成/解析失败。
升级评测流水线：(i) 使用人工标签时建模评分者效应，(ii) 报告分歧感知指标，(iii) 对研究型智能体报告答案/发现/引用的运行间方差及模块归因。
对双重用途治理，将人类+LLM 提升研究纳入风险评估（而非仅 LLM-only 基准），并明确测试防护是否能实质性减慢任务完成。

由逐篇论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-03-01

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：超越提示词的工具型智能体安全（系统 + 时间性防御）

主题：动态、以政策为依据的对齐与多语言安全迁移

主题：面向推理与智能体 RAG 的稳定效率扩展

主题：长时程智能体记忆 + 推理基础设施

主题：评测可靠性、随机性与分歧作为一等信号

3) 技术综合

4) Top 5 论文（含“为何现在”）

5) 实用下一步