AI 论文日报(2026-03-01)

Published:

English version: /paper-news/2026-03-01/

运行统计

  • 候选论文: 262
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-02-26T01:00:00Z → 2026-02-28T01:00:00Z (arxiv_announce, expanded=1)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2602.23329LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
PDF
cs.AI, cs.CL, cs.CR, cs.CY, cs.HC96Careful human study shows large LLM uplift on bio dual-use tasks; key for risk assessment.dual-use, biosecurity, human-uplift, evaluation, misuse-risk
2602.22755AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
PDF
cs.CL95Benchmark of hidden misalignment behaviors + agentic auditing tools; strong for eval & oversight.alignment auditing, benchmarks, hidden behaviors, agent evaluators, model honesty
2602.22724AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
PDF
cs.CR, cs.AI93Directly targets indirect prompt injection in agents with trajectory-aware diagnostics + mitigation.agent security, prompt injection, tool outputs, inference-time defense, causal diagnostics
2602.22525Systems-Level Attack Surface of Edge Agent Deployments on IoT
PDF
cs.CR93Empirical security analysis of edge LLM agents; concrete attack surfaces + measurable security metrics.agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance
2602.22603SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
PDF
cs.AI, cs.LG92LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.agents, long-context, memory, KV-cache, efficiency, reasoning
2602.22557CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
PDF
cs.AI, cs.LG91Model-agnostic zero-shot safety policy adaptation via RAG multi-agent debate grounded in policies.policy compliance, RAG, multi-agent debate, governance, safety evaluation
2602.22787Probing for Knowledge Attribution in Large Language Models
PDF
cs.CL, cs.AI91Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination control.hallucinations, attribution, faithfulness, factuality, interpretability
2602.22953General Agent Evaluation
PDF
cs.AI91Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration bias.agent-evaluation, benchmarks, general-agents, protocols, reproducibility
2602.22775TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
PDF
cs.HC, cs.AI, cs.CL90Adversarial multi-agent simulation to surface long-horizon relational safety failures in therapy bots.mental health, conversational safety, multi-turn evaluation, red teaming, agent simulation
2602.22576Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
PDF
cs.CL, cs.IR, cs.LG89Reward shaping for agentic RAG RL improves sample efficiency using trajectory-level signals.agentic-RAG, reinforcement-learning, reward-shaping, retrieval, reasoning
2602.22897OmniGAIA: Towards Native Omni-Modal AI Agents
PDF
cs.AI, cs.CL, cs.CV, cs.LG, cs.MM89Omni-modal agent benchmark (audio+video+image+tools) with event-graph construction; high reuse potential.multimodal, agents, benchmark, tool-use, evaluation, long-horizon
2602.22556Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
PDF
cs.LG, cs.AI, cs.CL89RL framework to curb overthinking while preserving correctness; practical for reliable reasoning models.reasoning, RL, efficiency, adaptive-compute, alignment, robustness
2602.22554Multilingual Safety Alignment Via Sparse Weight Editing
PDF
cs.LG88Training-free sparse weight editing to reduce multilingual safety gaps; practical alignment lever.multilingual safety, weight editing, safety neurons, alignment, low-resource languages
2602.22675Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
PDF
cs.CL87Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost and generalization.agents, search, efficiency, long-horizon, deep-research
2602.23271Evaluating Stochasticity in Deep Research Agents
PDF
cs.AI87Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP framing.agents, evaluation, stochasticity, reliability, research-agents, variance
2602.22769AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
PDF
cs.AI, cs.LG86AMA-Bench evaluates long-horizon agent memory on real agent trajectories beyond dialogue setups.agent memory, benchmarks, long-horizon, evaluation, trajectories
2602.23136Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
PDF
cs.CL, cs.AI, cs.LG86Theory for multimodal 'modality collapse' as mismatched decoding; probes + info-theoretic limits (GMI).multimodal-LLMs, information-theory, decoding, representation, robustness
2602.22719Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
PDF
cs.LG86Mechanistic interpretability + test-time steering for Mamba/SSMs; notable gains via simple intervention.interpretability, steering, SSM, Mamba, mechanistic, reliability
2602.22968Certified Circuits: Stability Guarantees for Mechanistic Circuits
PDF
cs.AI, cs.CV, cs.CY85Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliability.mechanistic interpretability, circuits, certification, robustness, auditing
2602.22638MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
PDF
cs.AI84Real-world route-planning benchmark with deterministic API-replay sandbox for reproducible agent eval.agents, benchmark, tool-use, evaluation, sandbox
2602.23200InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
PDF
cs.LG, cs.CL84Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding; practical impact.LLM-efficiency, KV-cache, quantization, long-context, inference
2602.22871Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
PDF
cs.CL, cs.AI84Step-level PRM-guided stitching for diffusion LMs; improves test-time scaling beyond trace voting.test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
2602.22689No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
PDF
cs.CV, cs.CR82Caption-free membership inference for diffusion models; strengthens privacy auditing realism.privacy, membership inference, diffusion models, data memorization, security
2602.23193ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering
PDF
cs.AI82Event-sourcing architecture for LLM agents: structured intentions + deterministic state/logging.agents, software-engineering, state, reliability, orchestration
2602.22642Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
PDF
cs.LG82Reasoning compression via difficulty-aware entropy regularization to avoid exploration collapse on hard tasks.LLM-reasoning, CoT, efficiency, entropy-regularization, RL
2602.22758Decomposing Physician Disagreement in HealthBench
PDF
cs.AI, stat.AP82Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.evaluation, medical-AI, uncertainty, human-judgment, benchmarks, reliability
2602.23262Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
PDF
cs.CV, cs.CR81DP image generation via wavelet coarse-to-fine; targets privacy/utility tradeoff with spectral hypothesis.privacy, differential-privacy, image-generation, wavelets, memorization
2602.22699DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule
PDF
cs.CR, cs.DB, cs.LG80DP SQL system enforcing minimum frequency rule; relevant for governance-grade privacy releases.differential privacy, data governance, SQL, minimum frequency rule, privacy engineering
2602.22585Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
PDF
cs.AI, cs.LG80Uses IRT/Rasch to correct rater effects in human eval; improves reliability of AI conclusions.evaluation, human-raters, psychometrics, RLHF, measurement
2602.22983Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
PDF
cs.AI, cs.CR79Shows classical Chinese as jailbreak vector + automated black-box prompt search; useful for red-teaming.jailbreaks, multilingual attacks, adversarial prompts, red teaming, prompt optimization

AI 论文洞察简报

2026-03-01

0) 执行要点(先读这个)

  • 智能体安全正在从“提示词层面”转向“系统层面”:边缘 IoT 群体表明,协调总线(MQTT)、故障切换行为以及静默云端回退可能主导真实风险——即使模型行为本身不变。
  • 推理时、以政策为依据的安全更易更新:CourtGuard 展示了通过 RAG + 对抗式辩论实现零样本政策替换,并在基准上表现强劲,提示一种无需再训练即可降低“对齐滞后”的路径。
  • 多轮智能体攻防正在变得因果化与时间化:AgentSentry 通过在工具返回边界用反事实重执行定位接管点,然后仅净化不可信的中介上下文以安全继续执行,报告在 AgentDojo 上攻击成功率为 0%。
  • 效率工作正收敛到“自适应计算”并配套稳定性修复:多篇论文通过 GRPO 稳定器(CPAS/LAGR)、难度感知熵正则(CEEH)与步骤级复用(扩散拼接)来解决过度思考/长时程成本,而非粗暴的长度惩罚。
  • 评测正在走向方差/噪声感知的测量:评分者效应校正(IRT)可改变系统排名;HealthBench 的分歧多为个案特异;深度研究智能体存在可测的运行间方差,并可进行模块归因与缓解。
  • 双重用途风险证据更直接:一项人体研究发现,LLM 访问使新手在 in silico 生物任务上的准确率提高 4.16×,且多数用户表示绕过防护几乎不费力。

2) 关键主题(聚类)

主题:超越提示词的工具型智能体安全(系统 + 时间性防御)

主题:动态、以政策为依据的对齐与多语言安全迁移

主题:面向推理与智能体 RAG 的稳定效率扩展

主题:长时程智能体记忆 + 推理基础设施

主题:评测可靠性、随机性与分歧作为一等信号

3) 技术综合

  • 以边界为中心的思路反复出现:AgentSentry 的工具返回边界、ESAA 的意图/效果边界、以及边缘 IoT 的 MQTT 协调边界都把“状态变化发生处”视为衡量/控制风险的关键位置。
  • GRPO 正成为通用底座:既用于推理效率(自适应思考;CEEH),也用于智能体 RAG 训练(Search-P1),论文重点在异质性下稳定梯度/奖励。
  • 过程信号正在替代二值结果:Search-P1 的路径中心评分与扩散步骤拼接都从部分正确轨迹中提取学习/选择信号。
  • “模型作为系统组件”的范围在扩大:SideQuest 用 LRM 管理自身 KV cache;AgentSentry 在受控重执行中使用模型;CourtGuard 用多角色(攻击者/防御者/裁判)结构化评估。
  • 评测工作正收敛到方差分解:评分者效应(IRT)、医生分歧 ICC、以及 DRA 随机性都在形式化“方差来自哪里”,而非将其视为噪声。
  • 语言分布偏移仍是主要越狱向量:文言文优化显示对多个闭源模型几乎完全攻破;Sparse Weight Editing 试图在不再训练下弥合多语言缺口。
  • 隐私审计正在扩展威胁模型:MOFIT 去除了扩散 MIA 的“真实 caption”假设;DP-Wavelet 与 DPSQL+ 聚焦可部署 DP 与实际约束(后处理、最小频次规则)。
  • 智能体基准更贴近环境且更可复现:MobilityBench 的 API 回放沙箱与 General Agent Evaluation 的 Unified Protocol 都面向可复现与跨系统可比。
  • 可解释性越来越与干预绑定:SSM 瓶颈引导(Mamba)与可认证电路稳定性都旨在让机制性产物可操作且可靠。

4) Top 5 论文(含“为何现在”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

  • 引入以边界为锚的反事实诊断(orig/mask/mask_sanitized/orig_sanitized)以定位中介接管。
  • 报告在 AgentDojo 上跨多种攻击家族与骨干模型 ASR = 0%,且在攻击下保持高效用。
  • 通过仅将不可信中介内容重写为“仅证据”形式来缓解,使系统可继续而非终止。
  • 需保持怀疑:反事实运行带来的额外推理开销;基准可能低估长时程延迟接管。

2) Systems-Level Attack Surface of Edge Agent Deployments on IoT

  • 让智能体安全可度量:执行到审计延迟(某路径均值约 23 ms)、溯源完整性、外联、故障切换窗口。
  • 显示 MQTT broker 在无加密强制下接受欺骗/重放/直接发布安全主题。
  • 演示通过强制回退(DNS 到 api.anthropic.com)实现静默主权边界跨越,且应用层无异常。
  • 需保持怀疑:单一测试床/拓扑;云外联对比未做工作负载匹配。

3) CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

  • 基于检索的对抗辩论产出可解释裁决,包含威胁分数与政策引用。
  • 报告强基准表现(macro Acc 0.87 / F1 0.86)且在人类验证套件上高召回。
  • 通过更换政策语料展示 零样本政策替换(如 Wikipedia vandalism)。
  • 需保持怀疑:RAG + 多轮辩论的时延/成本;对骨干模型格式遵循的依赖。

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

  • 提供 56 个加固目标模型,含 14 种隐藏行为,并降低“自白率”(KTO 比 SFT 更难)。
  • 智能体式评估显示脚手架化黑盒工具优于许多白盒工具;效果依赖目标训练方式。
  • 揭示具体“工具到智能体差距”(使用不足、噪声干扰、假设失败)。
  • 需保持怀疑:目标是基于单一底座模型微调的“模型生物体”;可能不匹配自然涌现的隐藏行为。

5) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

  • 人体受试证据:LLM 访问使新手准确率提高 4.16×;Treatment 在 8 个基准中赢 7 个。
  • Treatment 新手有时超过专家基线;但独立 LLM 往往超过 LLM 辅助新手(引导/提取差距)。
  • 报告多数 Treatment 参与者表示绕过防护不困难(89.6%)。
  • 需保持怀疑:研究限制包括模型可用性变化、可能的信息泄漏(部分问题可在网上找到)、以及未完全盲法。

5) 实用下一步

  • 对工具型智能体,加入工具返回边界的监测:记录中介内容、拟执行动作与轻量“接管风险”代理指标;衡量高影响动作被归因于中介的频率。
  • 在边缘/IoT 部署中,将消息总线安全视为安全关键:在 MQTT(或等价)中测试欺骗/重放/直接主题发布;测量执行到审计延迟与故障切换黑窗。
  • 若需要快速政策更新,原型化一个policy-RAG 评估器,提供明确引用与确定性裁决映射;对比静态分类器的时延。
  • 对多语言安全,评估语言偏移越狱(含风格偏移)并考虑稀疏干预;衡量非安全任务上的效用漂移。
  • 对推理效率,避免粗暴长度惩罚:尝试难度感知的探索控制(仅对难例加熵)或在长度异质性下做优势/梯度调节;跟踪模式坍塌。
  • 对长时程智能体,结合语义 KV 驱逐(工具响应垃圾回收)与硬件对齐的 KV 量化;测量吞吐与未完成/解析失败。
  • 升级评测流水线:(i) 使用人工标签时建模评分者效应,(ii) 报告分歧感知指标,(iii) 对研究型智能体报告答案/发现/引用的运行间方差及模块归因。
  • 对双重用途治理,将人类+LLM 提升研究纳入风险评估(而非仅 LLM-only 基准),并明确测试防护是否能实质性减慢任务完成。

由逐篇论文分析生成;无外部浏览。