AI 论文日报（2026-03-26）

Published: March 26, 2026

English version: /paper-news/2026-03-26/

运行统计

候选论文: 232
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-24T00:00:00Z → 2026-03-25T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.22928`	SoK: The Attack Surface of Agentic AI -- Tools, and Autonomy PDF	cs.CR	96	SoK mapping agentic AI attack surface (RAG/tools/multi-agent); strong taxonomy + synthesis for defenders	agentic-ai, security, prompt-injection, tool-security, rag-poisoning, multi-agent, survey
`2603.23064`	Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution PDF	cs.CR, cs.AI, cs.SI	95	Finds agent memory pollution via background “heartbeat”; concrete vuln model for personal agents	agent-security, memory-poisoning, prompt-injection, tool-agents, provenance, background-execution
`2603.22853`	Agent Audit: A Security Analysis System for LLM Agent Applications PDF	cs.CR, cs.AI	93	Practical static/security analysis for LLM agent apps (code+configs+creds+privileges); deployable outputs	agents, appsec, static-analysis, mcp, credentials, tooling, deployment-security
`2603.23171`	Robust Safety Monitoring of Language Models via Activation Watermarking PDF	cs.CR, cs.AI, cs.CY, cs.LG	92	Frames robust LLM monitoring as a security game; targets adaptive evasion with activation watermarking	monitoring, misuse-detection, adaptive-adversary, watermarking, inference-security, safety
`2603.23269`	Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs PDF	cs.CR, cs.AI, cs.LG	92	Query-efficient jailbreak fuzzing via token-importance; enables stronger red-teaming under budgets	jailbreaks, fuzzing, adversarial-prompts, red-teaming, surrogate-models, evaluation
`2603.23184`	ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment PDF	cs.CL, cs.AI, stat.AP	92	Unbiased reward modeling from implicit feedback; tackles bias + missing negatives for RLHF at scale	alignment, RLHF, reward-modeling, implicit-feedback, debiasing, preference-learning
`2603.22934`	ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning PDF	cs.AI	90	Training-free retriever-side defense for RAG corpus poisoning using probe gradients + perturbation tests	rag, retrieval-security, data-poisoning, dense-retriever, defense, robustness
`2603.22767`	Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases PDF	cs.AI, cs.CL	90	RWE-bench tests long-horizon agents executing real DB observational studies; structured evidence eval	agent-evaluation, benchmarks, tool-use, long-horizon, databases, healthcare
`2603.22829`	Improving Safety Alignment via Balanced Direct Preference Optimization PDF	cs.AI	90	Targets safety-alignment overfitting in DPO; proposes balanced objective to improve robustness.	alignment, DPO, RLHF, safety, overfitting, preference-learning
`2603.23268`	SafeSeek: Universal Attribution of Safety Circuits in Language Models PDF	cs.LG, cs.AI	88	Optimization-based attribution of safety circuits; aims for generalizable mechanistic safety interpretability	mechanistic-interpretability, safety-circuits, jailbreaks, backdoors, attribution, sparse-masks
`2603.22744`	Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks PDF	cs.AI	88	LH-Bench evaluates subjective enterprise long-horizon workflows with rubrics + artifact-based signals	agent-evaluation, benchmarks, enterprise, rubrics, long-horizon, LLM-judges
`2603.23047`	Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation PDF	cs.CL, cs.AI, cs.CE	88	RAG fine-tuning analysis + new factual attribution eval (TriFEX) and parametric-knowledge metric (PKP).	RAG, evaluation, factuality, attribution, metrics, fine-tuning
`2603.23114`	Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment PDF	cs.AI, cs.CL, cs.CY, cs.HC	88	Contextual moral dilemmas dataset; shows LLM moral sensitivity differs from humans; control problem	safety, alignment, evaluation, moral-judgment, dataset, context-sensitivity, human-comparison
`2603.22868`	Agent-Sentry: Bounding LLM Agents via Execution Provenance PDF	cs.CR, cs.AI	87	Execution provenance to bound/validate agent behavior vs irrelevant/compromised actions; security+privacy angle	agents, provenance, runtime-verification, policy-bounding, auditability, security
`2603.22882`	TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration PDF	cs.LG, cs.CV	86	Autonomous red-teaming for VLMs via hierarchical strategy exploration; seeks novel/diverse exploits	red-teaming, vlm-safety, adversarial-testing, automation, evaluation, attack-discovery
`2603.23355`	Off-Policy Value-Based Reinforcement Learning for Large Language Models PDF	cs.LG, cs.CL	86	Off-policy value-based RL for LLMs with replay; could improve sample efficiency for reasoning RL	RL-for-LLMs, off-policy, value-learning, reasoning, verification-signals, sample-efficiency
`2603.22751`	CIPL: A Target-Independent Framework for Channel-Inversion Privacy Leakage in Agents PDF	cs.CR	85	General framework for privacy leakage in agents as channel inversion; broadens beyond memory leakage	privacy, agents, information-leakage, attack-framework, side-channels, threat-modeling
`2603.23117`	TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches PDF	cs.CR	84	Shows CoT in VLA robots enables targeted control hijacking via adversarial patches; important embodied risk	robotics, vla, chain-of-thought, adversarial-patches, embodied-security, attack
`2603.22717`	Does Teaming-Up LLMs Improve Secure Code Generation? A Comprehensive Evaluation with Multi-LLMSecCodeEval PDF	cs.CR, cs.SE	84	Evaluates multi-LLM collaboration + static analysis for secure codegen; practical security pipeline data	secure-code-generation, LLM-ensembles, static-analysis, software-security, evaluation
`2603.23292`	LLM Olympiad: 入选理由 Model Evaluation Needs a Sealed Exam PDF	cs.AI, cs.CL	84	Proposes sealed-exam 'LLM Olympiad' to reduce benchmark leakage/chasing and improve trust in evals.	evaluation, benchmarks, data-contamination, leaderboards, reproducibility, governance
`2603.23485`	Failure of contextual invariance in gender inference with large language models PDF	cs.CL, cs.AI, cs.CY	84	Finds large instability under contextually equivalent prompts in gender inference; evaluation warning	reliability, robustness, bias, evaluation, prompt-sensitivity, context-invariance
`2603.23501`	MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage PDF	cs.CV, cs.AI, cs.CL	83	Benchmark for medical VLM input-validity sanity checks; targets a safety-critical failure mode	evaluation, medical-ai, vlm, robustness, input-validation, benchmark
`2603.22714`	PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset PDF	cs.CY, cs.AI	83	PopResume dataset enables causal/path-specific fairness audits for LLM/VLM resume screeners at scale	fairness, auditing, datasets, causal-evaluation, hiring, VLM
`2603.22812`	Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration PDF	cs.CL	83	Adaptive semantic-entropy hallucination detection cuts sampling cost by adjusting budget to uncertainty.	hallucination, uncertainty, semantic-entropy, Bayesian, efficient-eval, reliability
`2603.23483`	SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning PDF	cs.CV, cs.CL	82	Speculative planning to cut agentic multimodal tool-loop latency; system-level speedups for MLLM agents	agentic-MLLM, speculative-decoding, tool-use, latency, planning, efficiency
`2603.22754`	PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation PDF	cs.CL	82	Joint step+layer reasoning diagnostics; identifies failure modes like verification loops/divergence	interpretability, reasoning, analysis, hidden-states, failure-modes, diagnostics
`2603.22823`	Empirical Comparison of Agent Communication Protocols for Task Orchestration PDF	cs.AI	81	Benchmark comparing tool-only vs delegation vs hybrid multi-agent protocols; useful for agent design.	agents, multi-agent, tool-use, orchestration, benchmarks, protocols
`2603.23013`	Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents PDF	cs.CL	81	Memory-augmented routing for persistent agents; big cost cuts without training; strong deployment angle	agents, memory, efficiency, routing, long-term-interaction, serving
`2603.23149`	Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models PDF	cs.AI	80	Fast 'describe-then-act' steering layer predicts outcomes from latent+actions; aims at proactive safety.	agents, world-models, steering, safety, latents, planning
`2603.22879`	Confidence Calibration under Ambiguous Ground Truth PDF	cs.LG, cs.AI	79	Calibration breaks with annotator disagreement; proposes ambiguity-aware post-hoc calibrators	calibration, uncertainty, evaluation, ambiguous-labels, reliability, post-hoc

AI 论文洞察简报

2026-03-26

0) 执行要点（先读这个）

智能体安全正从“提示注入”转向“系统表面”：多篇论文显示，主导风险来自通道（工具参数/返回、轨迹）、架构（heartbeat 后台执行）与运行时行为（执行溯源），而不只是最终文本。
实用防御正变得更“系统化”且可度量：面向 RAG 投毒的检索端重排（不调用生成器）、基于溯源图的运行时工具调用约束、面向 MCP 配置的智能体静态分析，都报告了强的安全/效用权衡。
评估正在超越单一数字的正确率：长时程主观企业任务（量表 + 工件契约 + 人工验证）、基于真实数据库后端的端到端医疗观察性研究、以及医疗 VLM 的“输入合理性检查”揭示了经典基准遗漏的失败。
对齐训练与监控更“分布感知”：B-DPO 针对偏好对理解不均衡；ImplicitRM 使隐式反馈奖励建模在缺失负样本与倾向性偏差下无偏；激活水印加入带密钥、对抗者感知的监控。
上下文敏感性是反复出现的失败模式：极小的上下文变化就能翻转性别代词推断；道德判断会随细微线索变化且与人类不同；这些效应可控（激活引导），但并非零代价（能力小幅下降）。
成本/时延优化越来越依赖“门控 + 验证”：面向智能体 MLLM 的推测式“免工具”绕行与记忆增强路由带来大幅提速/降本，但依赖置信度/校准与检索保真度。

2) 关键主题（聚类）

主题：跨通道、工具与自治的智能体安全

重要性：真实智能体部署会通过可观测通道（工具参数/返回、轨迹）、工具边界与自治后台行为发生泄露与劫持——往往不需要明显的提示注入。
代表论文：
共同方法：
- 定义系统级威胁模型（流水线、阶段、通道、溯源），而非只看输出文本。
- 用显式比率度量安全（CER/AER、ASR、效用成功率）并做受控消融/覆盖率实验。
- 在边界处加入执行/分析：对工具代码 + MCP 配置做静态扫描；运行时拦截工具调用。
开放问题 / 失败模式：
- 覆盖缺口：未见过的良性行为会变得“模糊”，需要意图检查（Agent-Sentry）。
- 模仿攻击：遵循良性工具序列但操纵参数（Agent-Sentry）。
- 玩具/有限目标与字符串匹配式泄露指标可能低估语义泄露（CIPL）。
- 产品级设计缺陷（共享会话 heartbeat）制造“类零点击”的记忆污染路径（HEARTBEAT）。

主题：RAG 与记忆：鲁棒性、投毒与“知识 vs 模型规模”

重要性：检索与记忆可能主导正确性与成本，但也引入投毒与泄露表面；防御与指标正转向检索端与归因感知。
代表论文：
共同方法：
- 尽可能保持防御免训练、免生成器（检索端重排；不额外调用 LLM）。
- 区分“模型知道什么”与“从上下文复制了什么”（三元组归因；PKP、PR、SK 分解）。
- 使用混合检索（dense + BM25）并分析不同类型的增益与回退（如时间推理）。
开放问题 / 失败模式：
- 探测重复/候选缓冲带来的干净效用权衡与计算开销（ProGRank）。
- 记忆保真风险：存储摘要可能污染检索；非结构化轮次对会削弱时间推理（Memory routing）。
- 归因噪声：基于 LLM 的三元组归因约 80% 准确率；固定 top-k 证据可能漏掉支撑（TRIFEX/PKP）。

主题：面向长时程、主观与高风险工作流的评估

重要性：上线智能体的失败方式并非单元测试与最终答案准确率能捕捉——过程质量、中间工件、队列/队列群体正确性与输入有效性成为一等公民。
代表论文：
共同方法：
- 构建带工件契约与运行时验证钩子的环境（容器、工具服务器、清单）。
- 使用多评委/量表评分，并用人工成对偏好验证（LH-Bench）。
- 将端到端成功门控在中间正确性上（队列审计表；带验证的数值结果遮蔽）。
开放问题 / 失败模式：
- 超出少数环境/数据库的泛化（LH-Bench：2 个环境；RWE-bench：仅 MIMIC-IV）。
- 自动筛查/评审成本高（RWE-bench 队列筛查约 2 亿 tokens）。
- 医疗 VLM 门控中的格式敏感与误报（MedObvious）。

主题：分布漂移与对抗者下的对齐与监控

重要性：安全失败常来自过拟合（偏好优化）、有偏反馈通道（隐式信号）与自适应攻击者（越狱）；更鲁棒的监控与更好的目标函数正在出现。
代表论文：
共同方法：
- 建模数据生成过程（隐式反馈 = 偏好 × 行为倾向；基于 MI 的理解不均衡）。
- 在自适应/低预算攻击设定下评估（AutoDAN；查询预算；超低 FPR 监控）。
- 加入带密钥或内部信号（激活水印），而非只依赖外部文本守卫。
开放问题 / 失败模式：
- 水印假设黑盒攻击者；无可证明保证；数学任务效用下降（ActWM）。
- 越狱模糊测试依赖本地替代模型与自动成功评审（TriageFuzz）。
- ImplicitRM 假设正向行为总表示正向偏好（未处理误点噪声）。

主题：公平与道德判断中的上下文敏感性（及可控性）

重要性：小的、“理论上无关”的上下文就能主导输出，削弱偏差审计并使行为可被对抗性引导；部分敏感性可控，但存在权衡。
代表论文：
共同方法：
- 受控扰动以隔离上下文效应（引导句；道德情境变体；简历格式/照片）。
- 超越结果差异到机制：因果路径分解（BIE vs RIE）与形式化情境性测试（CbD）。
- 探索干预：激活引导调节上下文敏感性；带置信区间的因果估计用于公平路径。
开放问题 / 失败模式：
- 数据集真实度与中介变量选择影响法律/因果解释（PopResume）。
- 性别推断任务是二元代词与极简上下文；缺少人类不变性基线（contextual invariance 论文）。
- 引导的泛化性（超出单一模型家族）与生态有效性（超出二元困境）（道德判断）。

3) 技术综合

多项工作收敛到“门控 + 回退”架构：SpecEyes 通过可分离性门控免工具回答；记忆路由通过 logprob 置信度门控升级；Agent-Sentry 通过溯源图 + 意图评审门控工具调用；幻觉检测通过熵-方差停止门控采样。
评审器依赖无处不在，但用法不同：LH-Bench 用多评审 + 人工验证；RWE-bench 用门控问题与队列评审；TreeTeaming 与 MedObvious 强调格式/评审敏感风险；TRIFEX 用 LLM 归因并测得准确率（~80%）。
安全度量正变为基于比率且覆盖生命周期：跨生成/检测/修补的安全代码率；ASR vs 效用；泄露的 CER/AER；检索阶段的投毒命中/召回与下游 ASR。
免训练防御更受部署青睐：ProGRank 无需再训练即可重排；Agent Audit 是静态；SpecEyes 是路由；记忆路由是检索 + 置信度；与之对比的是基于微调的监控（激活水印）与对齐（B-DPO）。
因果/结构分解正在扩散：PopResume 将受保护属性效应分解为直接 vs 中介（业务必要性 vs 红线/代理）；ImplicitRM 将隐式反馈分解为偏好 vs 行为倾向；校准论文分解“真标签” vs 投票标签的校准目标。
稀疏结构反复出现：TriageFuzz 发现拒答由稀疏 token 区域主导；SafeSeek 发现极稀疏的安全/后门电路；两者暗示防御/攻击可聚焦小子结构。
上下文与模态增加对敏感属性的直接访问：PopResume 显示照片会增加 VLM 筛选器中的直接效应（NDE）；TRAP 显示 CoT 可主导动作生成并可被视觉补丁劫持。
成本成为一等指标：多 LLM 安全编码量化 CodeQL 运行时主导；MCP vs A2A 显示 token 膨胀的交叉点；SpecEyes 形式化吞吐提速；记忆路由报告相对大模型约 96% 的有效成本降低。
可靠性取决于中间工件：队列审计表、清单/截图、工具调用溯源与三元组归因越来越被用作评估与调试的“契约”。

4) Top 5 论文（含“为何现在”）

1) Agent-Sentry: Bounding LLM Agents via Execution Provenance

从轨迹中引入功能图（良性/对抗/模糊）并进行运行时拦截工具调用。
使用意图对齐评审器，且限制为仅可信输入（提示 + 工具规格 + 工具历史），明确排除检索内容。
在新的 6,733 条轨迹基准上报告强的安全/效用权衡（全覆盖下效用 94.61%，ASR 9.46%）。
质疑点：依赖覆盖率，且模仿攻击（良性路径 + 恶意参数）可能绕过。

2) ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

免训练、检索端防御：利用扰动下 probe-gradient 不稳定性 + 分数门控。
降低投毒 Top-K 暴露并报告强的下游鲁棒性（其评估中 Top-5 的宏平均、基于评审的 ASR 报告为 0.000）。
比高成本基线快得多（平均 4.73 s/查询 vs RAGuard 的 118.17 s/查询）。
质疑点：计算开销取决于探测重复与候选缓冲；干净效用权衡依赖数据集。

3) Agent Audit: A Security Analysis System for LLM Agent Applications

面向 Python 智能体 + MCP 配置语义的静态分析，带工具边界污点传播与置信分层。
新基准 AVB（22 样本，42 漏洞）；报告 95.24% 召回率（40/42），而 Semgrep/Bandit 约 24–30% 召回。
CI/IDE 友好输出（SARIF），在 22k LOC 上亚秒级扫描。
质疑点：仅过程内污点；MCP 启发式带来误报；JS/TS 支持有限。

4) PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset

提供具有人口学基础的简历数据集（60,884）与路径特定因果分解：直接 vs 中介，并将中介分为业务必要性（BIE）vs 红线/代理（RIE）。
发现仅看结果指标会掩盖的歧视模式（如抵消：TE≈0 但 NDE/NIE 非零；120 个案例中 53 个存在混合中介）。
显示加入照片会在 VLM 中增大直接歧视幅度（20 组配对中 8 组 NDE 增加）。
质疑点：合成渲染 + 美国特定人口假设；中介分组（B vs R）依赖情境/司法辖区。

5) Robust Safety Monitoring of Language Models via Activation Watermarking

带密钥的内部监控：微调使有害激活对齐到秘密方向；推理时用余弦相似度检测。
报告在越狱家族上更高 AUROC（如 AutoDAN AUROC 0.9048）并改进低 FPR 运行；加入“秘密提取”归因博弈（~80% 对角线）。
推理开销低（投影）而非守卫模型的额外前向。
质疑点：假设黑盒攻击者；无可证明保证；部分效用下降（尤其 GSM8K −7.13 pp）。

5) 实用下一步

为智能体做通道级泄露监测：枚举可观测通道（最终文本、工具参数/返回、轨迹），用 CIPL 风格指标（AER/CER）度量泄露，而非只做输出脱敏。
上线前加入面向智能体的静态检查：扫描工具边界污点流、提示构造风险、MCP 过度授权/未验证服务器（Agent Audit 风格 SARIF 接入 CI）。
部署运行时工具调用约束：记录执行溯源并基于学习到的良性/对抗路径允许/阻断；将模糊调用路由到意图评审器，且排除不可信检索内容（Agent-Sentry 模式）。
在检索阶段加固 RAG 抗投毒：尝试对 top-B 候选做检索端重排/惩罚；跟踪投毒命中/召回与下游 ASR，以及干净 EM 的权衡（ProGRank 风格评估）。
将记忆视为安全边界：分离后台“heartbeat”上下文与面向用户上下文；在提升到长期记忆前要求溯源 + 明确用户可见性（HEARTBEAT E→M→B）。
将公平审计从结果升级到机制：对高风险评分（招聘）估计路径特定效应（直接 vs 中介；业务必要性 vs 代理/红线）并测试照片诱发的直接效应（PopResume）。
压力测试上下文敏感性：在公平与安全探针中加入“上下文无关”的引导句；在部署前测量不变性失败（性别推断）与情境漂移（道德判断）。
若使用偏好优化或隐式反馈：在信任奖励模型前检查偏好对理解不均衡（B-DPO 思路）与倾向性偏差/缺失负样本（ImplicitRM）。
采用长时程评估契约：要求中间工件（清单、队列审计表、截图）并用量表评分 + 人工验证评估主观企业任务（LH-Bench/RWE-bench 模式）。

由逐篇分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-03-26

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：跨通道、工具与自治的智能体安全

主题：RAG 与记忆：鲁棒性、投毒与“知识 vs 模型规模”

主题：面向长时程、主观与高风险工作流的评估

主题：分布漂移与对抗者下的对齐与监控

主题：公平与道德判断中的上下文敏感性（及可控性）

3) 技术综合

4) Top 5 论文（含“为何现在”）

5) 实用下一步