AI 论文日报(2026-03-26)

Published:

English version: /paper-news/2026-03-26/

运行统计

  • 候选论文: 232
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-24T00:00:00Z → 2026-03-25T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.22928SoK: The Attack Surface of Agentic AI -- Tools, and Autonomy
PDF
cs.CR96SoK mapping agentic AI attack surface (RAG/tools/multi-agent); strong taxonomy + synthesis for defendersagentic-ai, security, prompt-injection, tool-security, rag-poisoning, multi-agent, survey
2603.23064Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution
PDF
cs.CR, cs.AI, cs.SI95Finds agent memory pollution via background “heartbeat”; concrete vuln model for personal agentsagent-security, memory-poisoning, prompt-injection, tool-agents, provenance, background-execution
2603.22853Agent Audit: A Security Analysis System for LLM Agent Applications
PDF
cs.CR, cs.AI93Practical static/security analysis for LLM agent apps (code+configs+creds+privileges); deployable outputsagents, appsec, static-analysis, mcp, credentials, tooling, deployment-security
2603.23171Robust Safety Monitoring of Language Models via Activation Watermarking
PDF
cs.CR, cs.AI, cs.CY, cs.LG92Frames robust LLM monitoring as a security game; targets adaptive evasion with activation watermarkingmonitoring, misuse-detection, adaptive-adversary, watermarking, inference-security, safety
2603.23269Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs
PDF
cs.CR, cs.AI, cs.LG92Query-efficient jailbreak fuzzing via token-importance; enables stronger red-teaming under budgetsjailbreaks, fuzzing, adversarial-prompts, red-teaming, surrogate-models, evaluation
2603.23184ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
PDF
cs.CL, cs.AI, stat.AP92Unbiased reward modeling from implicit feedback; tackles bias + missing negatives for RLHF at scalealignment, RLHF, reward-modeling, implicit-feedback, debiasing, preference-learning
2603.22934ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning
PDF
cs.AI90Training-free retriever-side defense for RAG corpus poisoning using probe gradients + perturbation testsrag, retrieval-security, data-poisoning, dense-retriever, defense, robustness
2603.22767Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
PDF
cs.AI, cs.CL90RWE-bench tests long-horizon agents executing real DB observational studies; structured evidence evalagent-evaluation, benchmarks, tool-use, long-horizon, databases, healthcare
2603.22829Improving Safety Alignment via Balanced Direct Preference Optimization
PDF
cs.AI90Targets safety-alignment overfitting in DPO; proposes balanced objective to improve robustness.alignment, DPO, RLHF, safety, overfitting, preference-learning
2603.23268SafeSeek: Universal Attribution of Safety Circuits in Language Models
PDF
cs.LG, cs.AI88Optimization-based attribution of safety circuits; aims for generalizable mechanistic safety interpretabilitymechanistic-interpretability, safety-circuits, jailbreaks, backdoors, attribution, sparse-masks
2603.22744Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
PDF
cs.AI88LH-Bench evaluates subjective enterprise long-horizon workflows with rubrics + artifact-based signalsagent-evaluation, benchmarks, enterprise, rubrics, long-horizon, LLM-judges
2603.23047Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation
PDF
cs.CL, cs.AI, cs.CE88RAG fine-tuning analysis + new factual attribution eval (TriFEX) and parametric-knowledge metric (PKP).RAG, evaluation, factuality, attribution, metrics, fine-tuning
2603.23114Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment
PDF
cs.AI, cs.CL, cs.CY, cs.HC88Contextual moral dilemmas dataset; shows LLM moral sensitivity differs from humans; control problemsafety, alignment, evaluation, moral-judgment, dataset, context-sensitivity, human-comparison
2603.22868Agent-Sentry: Bounding LLM Agents via Execution Provenance
PDF
cs.CR, cs.AI87Execution provenance to bound/validate agent behavior vs irrelevant/compromised actions; security+privacy angleagents, provenance, runtime-verification, policy-bounding, auditability, security
2603.22882TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
PDF
cs.LG, cs.CV86Autonomous red-teaming for VLMs via hierarchical strategy exploration; seeks novel/diverse exploitsred-teaming, vlm-safety, adversarial-testing, automation, evaluation, attack-discovery
2603.23355Off-Policy Value-Based Reinforcement Learning for Large Language Models
PDF
cs.LG, cs.CL86Off-policy value-based RL for LLMs with replay; could improve sample efficiency for reasoning RLRL-for-LLMs, off-policy, value-learning, reasoning, verification-signals, sample-efficiency
2603.22751CIPL: A Target-Independent Framework for Channel-Inversion Privacy Leakage in Agents
PDF
cs.CR85General framework for privacy leakage in agents as channel inversion; broadens beyond memory leakageprivacy, agents, information-leakage, attack-framework, side-channels, threat-modeling
2603.23117TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches
PDF
cs.CR84Shows CoT in VLA robots enables targeted control hijacking via adversarial patches; important embodied riskrobotics, vla, chain-of-thought, adversarial-patches, embodied-security, attack
2603.22717Does Teaming-Up LLMs Improve Secure Code Generation? A Comprehensive Evaluation with Multi-LLMSecCodeEval
PDF
cs.CR, cs.SE84Evaluates multi-LLM collaboration + static analysis for secure codegen; practical security pipeline datasecure-code-generation, LLM-ensembles, static-analysis, software-security, evaluation
2603.23292LLM Olympiad: 入选理由 Model Evaluation Needs a Sealed Exam
PDF
cs.AI, cs.CL84Proposes sealed-exam 'LLM Olympiad' to reduce benchmark leakage/chasing and improve trust in evals.evaluation, benchmarks, data-contamination, leaderboards, reproducibility, governance
2603.23485Failure of contextual invariance in gender inference with large language models
PDF
cs.CL, cs.AI, cs.CY84Finds large instability under contextually equivalent prompts in gender inference; evaluation warningreliability, robustness, bias, evaluation, prompt-sensitivity, context-invariance
2603.23501MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage
PDF
cs.CV, cs.AI, cs.CL83Benchmark for medical VLM input-validity sanity checks; targets a safety-critical failure modeevaluation, medical-ai, vlm, robustness, input-validation, benchmark
2603.22714PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset
PDF
cs.CY, cs.AI83PopResume dataset enables causal/path-specific fairness audits for LLM/VLM resume screeners at scalefairness, auditing, datasets, causal-evaluation, hiring, VLM
2603.22812Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration
PDF
cs.CL83Adaptive semantic-entropy hallucination detection cuts sampling cost by adjusting budget to uncertainty.hallucination, uncertainty, semantic-entropy, Bayesian, efficient-eval, reliability
2603.23483SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
PDF
cs.CV, cs.CL82Speculative planning to cut agentic multimodal tool-loop latency; system-level speedups for MLLM agentsagentic-MLLM, speculative-decoding, tool-use, latency, planning, efficiency
2603.22754PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation
PDF
cs.CL82Joint step+layer reasoning diagnostics; identifies failure modes like verification loops/divergenceinterpretability, reasoning, analysis, hidden-states, failure-modes, diagnostics
2603.22823Empirical Comparison of Agent Communication Protocols for Task Orchestration
PDF
cs.AI81Benchmark comparing tool-only vs delegation vs hybrid multi-agent protocols; useful for agent design.agents, multi-agent, tool-use, orchestration, benchmarks, protocols
2603.23013Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
PDF
cs.CL81Memory-augmented routing for persistent agents; big cost cuts without training; strong deployment angleagents, memory, efficiency, routing, long-term-interaction, serving
2603.23149Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models
PDF
cs.AI80Fast 'describe-then-act' steering layer predicts outcomes from latent+actions; aims at proactive safety.agents, world-models, steering, safety, latents, planning
2603.22879Confidence Calibration under Ambiguous Ground Truth
PDF
cs.LG, cs.AI79Calibration breaks with annotator disagreement; proposes ambiguity-aware post-hoc calibratorscalibration, uncertainty, evaluation, ambiguous-labels, reliability, post-hoc

AI 论文洞察简报

2026-03-26

0) 执行要点(先读这个)

  • 智能体安全正从“提示注入”转向“系统表面”:多篇论文显示,主导风险来自通道(工具参数/返回、轨迹)、架构(heartbeat 后台执行)与运行时行为(执行溯源),而不只是最终文本。
  • 实用防御正变得更“系统化”且可度量:面向 RAG 投毒的检索端重排(不调用生成器)、基于溯源图的运行时工具调用约束、面向 MCP 配置的智能体静态分析,都报告了强的安全/效用权衡。
  • 评估正在超越单一数字的正确率:长时程主观企业任务(量表 + 工件契约 + 人工验证)、基于真实数据库后端的端到端医疗观察性研究、以及医疗 VLM 的“输入合理性检查”揭示了经典基准遗漏的失败。
  • 对齐训练与监控更“分布感知”:B-DPO 针对偏好对理解不均衡;ImplicitRM 使隐式反馈奖励建模在缺失负样本与倾向性偏差下无偏;激活水印加入带密钥、对抗者感知的监控。
  • 上下文敏感性是反复出现的失败模式:极小的上下文变化就能翻转性别代词推断;道德判断会随细微线索变化且与人类不同;这些效应可控(激活引导),但并非零代价(能力小幅下降)。
  • 成本/时延优化越来越依赖“门控 + 验证”:面向智能体 MLLM 的推测式“免工具”绕行与记忆增强路由带来大幅提速/降本,但依赖置信度/校准与检索保真度。

2) 关键主题(聚类)

主题:跨通道、工具与自治的智能体安全

主题:RAG 与记忆:鲁棒性、投毒与“知识 vs 模型规模”

主题:面向长时程、主观与高风险工作流的评估

主题:分布漂移与对抗者下的对齐与监控

主题:公平与道德判断中的上下文敏感性(及可控性)

3) 技术综合

  • 多项工作收敛到“门控 + 回退”架构:SpecEyes 通过可分离性门控免工具回答;记忆路由通过 logprob 置信度门控升级;Agent-Sentry 通过溯源图 + 意图评审门控工具调用;幻觉检测通过熵-方差停止门控采样。
  • 评审器依赖无处不在,但用法不同:LH-Bench 用多评审 + 人工验证;RWE-bench 用门控问题与队列评审;TreeTeaming 与 MedObvious 强调格式/评审敏感风险;TRIFEX 用 LLM 归因并测得准确率(~80%)。
  • 安全度量正变为基于比率且覆盖生命周期:跨生成/检测/修补的安全代码率;ASR vs 效用;泄露的 CER/AER;检索阶段的投毒命中/召回与下游 ASR。
  • 免训练防御更受部署青睐:ProGRank 无需再训练即可重排;Agent Audit 是静态;SpecEyes 是路由;记忆路由是检索 + 置信度;与之对比的是基于微调的监控(激活水印)与对齐(B-DPO)。
  • 因果/结构分解正在扩散:PopResume 将受保护属性效应分解为直接 vs 中介(业务必要性 vs 红线/代理);ImplicitRM 将隐式反馈分解为偏好 vs 行为倾向;校准论文分解“真标签” vs 投票标签的校准目标。
  • 稀疏结构反复出现:TriageFuzz 发现拒答由稀疏 token 区域主导;SafeSeek 发现极稀疏的安全/后门电路;两者暗示防御/攻击可聚焦小子结构。
  • 上下文与模态增加对敏感属性的直接访问:PopResume 显示照片会增加 VLM 筛选器中的直接效应(NDE);TRAP 显示 CoT 可主导动作生成并可被视觉补丁劫持。
  • 成本成为一等指标:多 LLM 安全编码量化 CodeQL 运行时主导;MCP vs A2A 显示 token 膨胀的交叉点;SpecEyes 形式化吞吐提速;记忆路由报告相对大模型约 96% 的有效成本降低。
  • 可靠性取决于中间工件:队列审计表、清单/截图、工具调用溯源与三元组归因越来越被用作评估与调试的“契约”。

4) Top 5 论文(含“为何现在”)

1) Agent-Sentry: Bounding LLM Agents via Execution Provenance

  • 从轨迹中引入功能图(良性/对抗/模糊)并进行运行时拦截工具调用。
  • 使用意图对齐评审器,且限制为仅可信输入(提示 + 工具规格 + 工具历史),明确排除检索内容。
  • 在新的 6,733 条轨迹基准上报告强的安全/效用权衡(全覆盖下效用 94.61%,ASR 9.46%)。
  • 质疑点:依赖覆盖率,且模仿攻击(良性路径 + 恶意参数)可能绕过。

2) ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

  • 免训练、检索端防御:利用扰动下 probe-gradient 不稳定性 + 分数门控。
  • 降低投毒 Top-K 暴露并报告强的下游鲁棒性(其评估中 Top-5 的宏平均、基于评审的 ASR 报告为 0.000)。
  • 比高成本基线快得多(平均 4.73 s/查询 vs RAGuard 的 118.17 s/查询)。
  • 质疑点:计算开销取决于探测重复与候选缓冲;干净效用权衡依赖数据集。

3) Agent Audit: A Security Analysis System for LLM Agent Applications

  • 面向 Python 智能体 + MCP 配置语义的静态分析,带工具边界污点传播与置信分层。
  • 新基准 AVB(22 样本,42 漏洞);报告 95.24% 召回率(40/42),而 Semgrep/Bandit 约 24–30% 召回。
  • CI/IDE 友好输出(SARIF),在 22k LOC 上亚秒级扫描。
  • 质疑点:仅过程内污点;MCP 启发式带来误报;JS/TS 支持有限。

4) PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset

  • 提供具有人口学基础的简历数据集(60,884)与路径特定因果分解:直接 vs 中介,并将中介分为业务必要性(BIE)vs 红线/代理(RIE)。
  • 发现仅看结果指标会掩盖的歧视模式(如抵消:TE≈0 但 NDE/NIE 非零;120 个案例中 53 个存在混合中介)。
  • 显示加入照片会在 VLM 中增大直接歧视幅度(20 组配对中 8 组 NDE 增加)。
  • 质疑点:合成渲染 + 美国特定人口假设;中介分组(B vs R)依赖情境/司法辖区。

5) Robust Safety Monitoring of Language Models via Activation Watermarking

  • 带密钥的内部监控:微调使有害激活对齐到秘密方向;推理时用余弦相似度检测。
  • 报告在越狱家族上更高 AUROC(如 AutoDAN AUROC 0.9048)并改进低 FPR 运行;加入“秘密提取”归因博弈(~80% 对角线)。
  • 推理开销低(投影)而非守卫模型的额外前向。
  • 质疑点:假设黑盒攻击者;无可证明保证;部分效用下降(尤其 GSM8K −7.13 pp)。

5) 实用下一步

  • 为智能体做通道级泄露监测:枚举可观测通道(最终文本、工具参数/返回、轨迹),用 CIPL 风格指标(AER/CER)度量泄露,而非只做输出脱敏。
  • 上线前加入面向智能体的静态检查:扫描工具边界污点流、提示构造风险、MCP 过度授权/未验证服务器(Agent Audit 风格 SARIF 接入 CI)。
  • 部署运行时工具调用约束:记录执行溯源并基于学习到的良性/对抗路径允许/阻断;将模糊调用路由到意图评审器,且排除不可信检索内容(Agent-Sentry 模式)。
  • 在检索阶段加固 RAG 抗投毒:尝试对 top-B 候选做检索端重排/惩罚;跟踪投毒命中/召回与下游 ASR,以及干净 EM 的权衡(ProGRank 风格评估)。
  • 将记忆视为安全边界:分离后台“heartbeat”上下文与面向用户上下文;在提升到长期记忆前要求溯源 + 明确用户可见性(HEARTBEAT E→M→B)。
  • 将公平审计从结果升级到机制:对高风险评分(招聘)估计路径特定效应(直接 vs 中介;业务必要性 vs 代理/红线)并测试照片诱发的直接效应(PopResume)。
  • 压力测试上下文敏感性:在公平与安全探针中加入“上下文无关”的引导句;在部署前测量不变性失败(性别推断)与情境漂移(道德判断)。
  • 若使用偏好优化或隐式反馈:在信任奖励模型前检查偏好对理解不均衡(B-DPO 思路)与倾向性偏差/缺失负样本(ImplicitRM)。
  • 采用长时程评估契约:要求中间工件(清单、队列审计表、截图)并用量表评分 + 人工验证评估主观企业任务(LH-Bench/RWE-bench 模式)。

由逐篇分析生成;未进行外部浏览。