AI 论文日报(2026-03-02)

Published:

English version: /paper-news/2026-03-02/

运行统计

  • 候选论文: 262
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-02-26T01:00:00Z → 2026-02-28T01:00:00Z (arxiv_announce, expanded=1)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2602.23329LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
PDF
cs.AI, cs.CL, cs.CR, cs.CY, cs.HC96Human uplift study on biosecurity-relevant tasks; quantifies dual-use risk from LLM accessdual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
2602.22755AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
PDF
cs.CL95Benchmark of hidden misalignment behaviors + agentic auditing; strong for real-world evalsalignment auditing, benchmarks, hidden behaviors, agent tools, evaluation, deception
2602.22724AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
PDF
cs.CR, cs.AI93Directly targets indirect prompt injection in tool/RAG agents with inference-time mitigationagent security, prompt injection, tool use, RAG safety, inference-time defense, causal diagnostics
2602.22525Systems-Level Attack Surface of Edge Agent Deployments on IoT
PDF
cs.CR93Empirical security analysis of edge LLM agents; concrete attack surfaces + measurable security metrics.agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance, MQTT
2602.22787Probing for Knowledge Attribution in Large Language Models
PDF
cs.CL, cs.AI92Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigationhallucinations, attribution, faithfulness, factuality, probes, interpretability
2602.22953General Agent Evaluation
PDF
cs.AI91Proposes unified protocol + framework for general-agent evaluation; high reuse for benchmarking agents.agent-evaluation, benchmarks, protocols, framework, general-agents, tool-use
2602.22557CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
PDF
cs.AI, cs.LG90Model-agnostic zero-shot safety policy adaptation via RAG multi-agent debate over policiespolicy compliance, RAG, multi-agent debate, governance, safety evaluation, zero-shot
2602.22576Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
PDF
cs.CL, cs.IR, cs.LG90Reward shaping for agentic RAG RL; extracts signal from failures, improves stability/sample efficiencyagentic-RAG, reinforcement-learning, reward-shaping, retrieval, training, efficiency
2602.22603SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
PDF
cs.AI, cs.LG90LRM-driven KV-cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.agents, long-context, kv-cache, memory, efficiency, reasoning, systems
2602.22775TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
PDF
cs.HC, cs.AI, cs.CL89Multi-turn relational safety failures in therapy chatbots via adversarial simulation methodmental health, conversational safety, multi-turn evaluation, red teaming, agent simulation
2602.22675Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
PDF
cs.CL88Agentic search framework emphasizing parallel evidence over deep reasoning; targets cost+generalizationagents, search, efficiency, long-horizon, generalization, evidence-gathering
2602.22897OmniGAIA: Towards Native Omni-Modal AI Agents
PDF
cs.AI, cs.CL, cs.CV, cs.LG, cs.MM88Omni-modal agent benchmark requiring multi-turn tool use across video/audio/image; likely impactful eval.multimodal, agents, benchmark, tool-use, evaluation, long-horizon, reasoning
2602.22556Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
PDF
cs.LG, cs.AI, cs.CL88RL method to reduce overthinking while preserving hard-query reasoning; practical accuracy/latency tradeoff.reasoning, test-time, rl, efficiency, adaptive-compute, alignment-adjacent
2602.22968Certified Circuits: Stability Guarantees for Mechanistic Circuits
PDF
cs.AI, cs.CV, cs.CY87Provable stability guarantees for mechanistic circuit discovery; improves interpretability rigormechanistic interpretability, circuits, certification, robustness, auditing
2602.22554Multilingual Safety Alignment Via Sparse Weight Editing
PDF
cs.LG86Training-free sparse weight editing to reduce multilingual safety gaps; high practical valuemultilingual safety, weight editing, alignment, low-resource languages, guardrails
2602.22638MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
PDF
cs.AI86Real-world route-planning agent benchmark with deterministic API-replay sandbox for reproducibilitybenchmarks, agents, tool-use, evaluation, reproducibility, mobility
2602.23271Evaluating Stochasticity in Deep Research Agents
PDF
cs.AI86Formalizes and measures stochasticity/variance in research agents; targets deployment reliability.agents, reliability, evaluation, stochasticity, variance, MDP, citations
2602.22719Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
PDF
cs.LG86Mechanistic interp for Mamba SSMs + simple test-time steering via bottleneck scaling; broad gains claimed.interpretability, steering, state-space-models, mamba, mechanistic, test-time
2602.23136Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
PDF
cs.CL, cs.AI, cs.LG85Info-theoretic account of modality collapse as mismatched decoding; useful theory for multimodal LLMs.multimodal-llm, information-theory, decoding, modality-collapse, representation, GMI
2602.22769AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
PDF
cs.AI, cs.LG84New benchmark for long-horizon agent memory on machine-generated trajectories, not chatagent memory, benchmarks, long-horizon, evaluation, trajectories
2602.23193ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering
PDF
cs.AI84Event-sourcing architecture for LLM agents: structured intentions + deterministic orchestrator/stateagent-architecture, state, reliability, software-engineering, orchestration, logging
2602.22871Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
PDF
cs.CL, cs.AI84Step-level PRM scoring + stitching across diffusion CoTs; improves test-time scaling beyond voting.reasoning, diffusion-lm, process-reward-model, self-consistency, test-time-scaling
2602.23200InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
PDF
cs.LG, cs.CL83Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding; practical gain.efficiency, KV-cache, quantization, long-context, inference, hardware-aware
2602.23258AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
PDF
cs.AI, cs.CL82Test-time rectify-or-reject pruning to stop error cascades in multi-agent systemsmulti-agent systems, robustness, test-time methods, error correction, RAG
2602.22585Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
PDF
cs.AI, cs.LG82IRT/Rasch correction for rater effects in human evals; improves validity of AI evaluation conclusionsevaluation, human-ratings, psychometrics, item-response-theory, measurement, RLHF
2602.22642Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
PDF
cs.LG82Difficulty-aware entropy regularization to compress CoT without entropy collapse; targets efficient reasoning.reasoning, CoT, efficiency, entropy-regularization, inference-cost, exploration
2602.22758Decomposing Physician Disagreement in HealthBench
PDF
cs.AI, stat.AP82Analyzes physician disagreement sources in HealthBench; highlights irreducible uncertainty in eval labels.evaluation, medical-ai, uncertainty, benchmarking, human-judgment, reliability
2602.22584Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
PDF
cs.CL80Industrial RAG reliability: joint retrieval+generation optimization to reduce hallucinated URLsRAG, hallucinations, faithfulness, reinforcement learning, industrial deployment
2602.23079Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
PDF
cs.CL, cs.CR, cs.LG80Stylometry+LLM agent for authorship inference; highlights and mitigates deanonymization/privacy risksprivacy, deanonymization, stylometry, LLM-agents, security, authorship-attribution
2602.23262Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
PDF
cs.CV, cs.CR80DP image generation via wavelet coarse-to-fine; aims to preserve quality while improving privacy guarantees.privacy, differential-privacy, image-generation, wavelets, memorization, DP-SGD

AI 论文洞察简报

2026-03-02

0) 执行要点(先读这个)

  • 智能体安全正在向“技术栈更底层”迁移:多篇论文显示,系统架构(边缘 IoT 部署、事件溯源式编排、KV-cache/记忆管理)往往主导风险与可靠性,并且经常绕过提示词/模型层面的防护。
  • 推理时、模型无关的安全正在变得更锋利:基于检索的政策裁决(CourtGuard)与针对间接提示注入的反事实因果诊断(AgentSentry)都在不更新权重的情况下报告了强结果——代价是额外的推理开销。
  • 面向智能体的 RL 正从稀疏结果奖励转向结构化过程信号:面向智能体 RAG 的路径中心奖励塑形(Search-P1)与面向推理压缩的难度感知熵/长度控制(CEEH)旨在解决 GRPO 风格训练中的稳定性与样本效率失败。
  • 评测正在变得更“运维化”:新的基准/工具链强调可复现与可分解(MobilityBench 的 API 回放;用于隐藏行为的 AuditBench;用于智能体记忆的 AMA-Bench;General Agent Evaluation 的 Unified Protocol),并有工作量化评估者噪声(IRT 评分者效应;医生分歧分解)。
  • 长时程智能体的算力效率成为一等研究轴:语义 KV 驱逐(SideQuest)与硬件感知 KV 量化(InnerQ)在有限精度损失下报告了显著吞吐/时延收益,直接使固定预算下更长的智能体时程成为可能。
  • 双重用途风险正在以“人类”为对象被测量,而不只是模型:一项长周期 uplift 研究发现,LLM 访问使新手在与生物安全相关的 in silico 任务上显著更准确(OR 4.16),且多数用户报告绕过防护“没有困难”。

2) 关键主题(聚类)

主题:系统级智能体安全与治理(超越提示词)

主题:动态策略执行与隐藏行为审计

主题:面向推理/智能体 RAG 的 RL 稳定化(过程信号优于稀疏结果)

主题:长时程智能体效率(KV cache、记忆与搜索并行)

主题:评测可靠性与可复现性(人类、API 与协议)

3) 技术综合

  • 边界控制正在收敛:AgentSentry 的工具返回边界诊断、ESAA 的意图/效果边界、以及边缘 IoT 的 MQTT 边界,都将“状态跨越信任域的位置”视为关键安全杠杆。
  • GRPO 是共同底座,但论文在修复其病灶上分歧:CPAS/LAGR 针对长度异质性与模式坍塌;Search-P1 通过计划/路径评分加密奖励;CEEH 通过难度感知熵来对抗熵坍塌。
  • “过程监督”正在无需完全监督地被运维化:Search-P1 使用离线参考规划器 + 步骤覆盖;diffusion stitching 使用 PRM 步骤分数;工业 RAG 使用包含 URL 有效性检查在内的多维奖励。
  • RAG 正在分裂为两类关注点:(i) 检索质量/覆盖(GraphRAG + 并行通道;智能体多步搜索),以及 (ii) 证据的忠实使用(URL 有效性、忠实性奖励、知识归因探针)。
  • 智能体可靠性越来越以方差衡量,而非仅均值:深度研究智能体的随机性指标(对答案/发现/引用的总方差)补充成功率榜单,并揭示早期步骤随机性放大。
  • 记忆与 KV cache 被视为一等优化目标:SideQuest 降低峰值 token 使用与 KV 读取;InnerQ 针对解码阶段 matmul 布局降低时延,而不只是内存占用。
  • 评测基础设施正在成为研究贡献:确定性回放(MobilityBench)、统一协议工具链(Exgentic)、以及带不坦白目标的审计基准(AuditBench)旨在防止“对基准怪癖的过拟合”。
  • 人因成为能力/风险测量的一部分:bio uplift 显示新手在 LLM 帮助下提升,但仍可能不如纯 LLM 基线;安全治理需要建模人–LLM 系统,而非仅模型。
  • 可解释性正在分叉:从用于多语种安全的神经元/子空间编辑(稀疏权重编辑),到电路的形式化鲁棒性保证(认证电路),再到 SSM 特定的瓶颈引导(Mamba 子空间瓶颈)。

4) Top 5 论文(含“为何是现在”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

  • 引入对多轮间接提示注入(IPI)的 反事实、时间因果 框架,聚焦工具返回边界。
  • 在 AgentDojo 上针对三类 IPI 家族与三种黑盒 LLM 报告 ASR = 0%,同时保留了较高效用(UA 按模型报告)。
  • 消融显示净化后的反事实与时间诊断至关重要(移除后在一套任务上 ASR 升至约 22%)。
  • 怀疑点:总体开销可能很大(每个边界最多四种机制并带采样);基准设置使用轻量 K/w。

2) CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

  • 通过基于 RAG 的对抗辩论与基于引用的裁决,使安全 策略可在推理时更新
  • 报告了强基准表现(macro Acc 0.87 / F1 0.86)以及在人类验证套件上的高召回。
  • 展示 策略/领域切换(Wikipedia vandalism)而无需更新权重。
  • 怀疑点:推理时延/成本增加;性能受策略语料覆盖与骨干模型格式遵循能力所限。

3) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

  • 提供 56 个不坦白的目标模型,覆盖 14 种隐藏行为,包含多种灌输与反坦白训练方式。
  • 发现带脚手架的黑盒工具在智能体审计中最好,并揭示 工具到智能体差距(静态工具信号 ≠ 智能体成功)。
  • 显示可审计性强依赖训练配方(TD+KTO 最难)。
  • 怀疑点:目标是带显式行为的 LoRA 微调;可能比弥散的涌现行为更易审计。

4) Systems-Level Attack Surface of Edge Agent Deployments on IoT

  • 实证表明边缘本地 swarm 可减少外联(一次会话 0 外部 IP/字节),但引入 MQTT C2 漏洞故障切换盲区
  • 显示溯源元数据可被轻易绕过(缺失/伪造发送者、重放、直接发布到安全主题也被接受)。
  • 测量了具体运维指标(执行到审计时延;blackout 窗口;仅 DNS 的主权跨越)。
  • 怀疑点:单一小型测试床;未评估缓解原型;云端对比未做工作负载匹配。

5) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

  • 人体受试证据:与仅互联网相比,LLM 访问使新手在生物任务套件上 准确率提高 4.16×(8 个基准中 7 个)。
  • 强调治理相关细节:LLM-only 常常胜过 LLM 辅助的新手,因此 uplift 取决于用户策略/任务结构。
  • 报告多数用户 绕过防护没有困难,与双重用途风险评估相关。
  • 怀疑点:非双盲;研究中期模型可用性变化;仅限 in silico 任务(向湿实验迁移未知)。

5) 实用下一步

  • 面向智能体安全:为智能体协同平面(如 MQTT)加入加密强制/ACL,并测量在对抗性发布/重放下溯源是否变得不可绕过。
  • 监测主权边界:将“回退到云推理”视为安全事件;记录并告警 DNS/API 边界跨越,并与智能体级追踪关联。
  • 采用边界锚定防御:原型化 AgentSentry 式工具返回检查(即便简化版),并在多轮 IPI 下测量 ASR/效用权衡。
  • 让策略更新可运维:搭建类似 CourtGuard 的策略 RAG 存储用于组织治理文档;在较小骨干模型上测量时延与失效模式(格式/解析)。
  • 用过程奖励训练智能体:若使用 GRPO/RLVR,测试路径中心或难度感知塑形(Search-P1/CEEH 思路),并显式监控熵/模式坍塌指标。
  • 优化长时程成本:在你的智能体工作负载上评估 SideQuest 式语义驱逐与/或 InnerQ 式 KV 量化;跟踪 KV 读取、吞吐与任务完成率。
  • 更真实地基准记忆:在智能体轨迹基准(AMA-Bench 风格)上运行你的记忆系统,并加入 needle protocol 量化构建损失 vs 检索损失。
  • 加固评测流水线:有人类评分时考虑 IRT/MFRM 校正;涉及工具/API 时优先使用可回放沙箱(MobilityBench 模式)以降低方差。

由逐篇论文分析生成;无外部浏览。