AI 论文日报（2026-03-22）

Published: March 22, 2026

English version: /paper-news/2026-03-22/

运行统计

候选论文: 1253
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-20T00:00:00Z → 2026-03-21T00:00:00Z (weekend_backlog_unknown, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.14987`	Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI PDF	cs.CL, cs.DB	93	Argues for representative trustworthiness eval for agentic AI; proposes HAA framework.	agent-evaluation, trustworthiness, sociotechnical, benchmarks, agents
`2603.19011`	Security awareness in LLM agents: the NDAI zone case PDF	cs.CR, cs.AI	92	Measures whether LLM agents can infer secure vs insecure execution; key for TEE/tool-use safety.	agent-security, TEE, situational-awareness, evaluation, tool-use
`2603.18577`	MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning PDF	cs.AI	92	Large benchmark + grounded reasoning for medical deepfake detection; strong safety relevance	deepfake-detection, multimodal, benchmark, grounded-reasoning, medical-safety, localization
`2603.15542`	InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems PDF	cs.CY, cs.AI	92	InterveneBench: 744 real studies to test LLM causal intervention & design reasoning; strong eval gap.	benchmark, evaluation, causal-reasoning, interventions, social-science, LLM
`2603.14761`	BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models PDF	cs.AI	92	New commonsense benchmark; shows big gaps on brainteasers even for frontier LLMs.	evaluation, commonsense, reasoning, benchmark, robustness
`2603.17623`	ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery PDF	cs.LG, cs.CR	92	Practical gradient inversion attack (no arch mods) reconstructs data from large FL batches.	federated-learning, privacy, gradient-inversion, security, data-leakage, attack
`2603.14730`	GNNVerifier: Graph-based Verifier for LLM Task Planning PDF	cs.LG	91	Non-LLM graph verifier for LLM plans; targets structural hallucinations & dependency errors in agents	agents, planning, verification, hallucinations, graph-methods, robustness
`2603.15397`	SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration PDF	cs.CR, cs.AI	90	Monitors/calibrates unsafe intermediate CoT steps to resist jailbreaks, not just final output.	jailbreaks, chain-of-thought, safety-monitoring, calibration, defense
`2603.15615`	Mechanistic Origin of Moral Indifference in Language Models PDF	cs.CL, cs.AI	90	Mechanistic study of moral concept collapse + latent “moral indifference”; proposes representation fix.	mechanistic-interpretability, alignment, representations, moral-reasoning, safety
`2603.18895`	From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making PDF	cs.HC, cs.AI, cs.LG	90	Practical readiness metrics for human-AI teaming; targets miscalibrated reliance & safety signals.	human-AI teaming, evaluation, calibration, reliance, safety-metrics, deployment
`2603.17948`	VideoAtlas: Navigating Long-Form Video in Logarithmic Compute PDF	cs.CV, cs.AI	90	Hierarchical lossless video representation enabling long-video navigation with log compute.	long-context, video, agents, memory, efficient-inference, multimodal
`2603.18767`	A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models PDF	cs.AI	89	Improves diffusion concept unlearning beyond keywords; reduces brittle/over-forgetting in safety edits	diffusion, unlearning, content-safety, model-editing, robustness
`2603.15364`	CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving PDF	cs.AI, cs.CL	89	LLM agent for AV incident analysis + curated 2,168-case dataset; practical safety auditing	agent, autonomous-driving, safety, incident-analysis, dataset, LLM
`2603.15372`	SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations PDF	cs.SE, cs.AI, cs.CR	88	Tool-using LLM agent benchmark with live mock APIs + deterministic rubrics for telecom ops.	agents, tool-use, benchmark, evaluation, enterprise, APIs
`2603.14778`	$p^2$RAG: Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval PDF	cs.CR, cs.AI	88	Privacy-preserving RAG enabling arbitrary top-k without costly secure sorting; practical for LLM apps.	RAG, privacy, secure-retrieval, cryptography, deployment
`2603.17759`	Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor PDF	cs.CL, cs.AI	88	Multimodal+multilingual benchmark for harmful humor incl. covert harm; strong safety eval value	AI safety, benchmark, harmful content, multimodal, multilingual, humor, toxicity detection, Arabic
`2603.17683`	Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents PDF	cs.AI, cs.LG	88	Structured test-time learning for LLM game agents; curriculum + steerable context control-plane.	llm-agents, test-time-learning, curriculum-learning, agent-architecture, memory, evaluation
`2603.18680`	Revisiting Label Inference Attacks in Vertical Federated Learning: 入选理由 They Are Vulnerable and How to Defend PDF	cs.LG, cs.CR	88	Reframes label inference in VFL via mutual info; explains vulnerabilities and proposes defenses.	vertical-federated-learning, privacy, label-inference, mutual-information, defense
`2603.18793`	Functional Subspace Watermarking for Large Language Models PDF	cs.CR, cs.AI	86	LLM watermarking robust to fine-tune/quantize/distill by anchoring signals in functional subspace.	watermarking, model-ownership, robustness, LLMs, security
`2603.14756`	Towards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark PDF	cs.CL, cs.AI	86	Defines inference-time privacy task+benchmark for MT; fills evaluation gap for privacy-preserving NLP	privacy, machine-translation, benchmark, inference, evaluation
`2603.14771`	OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence PDF	cs.AI	86	Interactive arena to evolve/benchmark multi-agent collective intelligence; strong eval framing.	agents, multi-agent, collective-intelligence, benchmark, evaluation, healthcare
`2603.14911`	Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs PDF	cs.CR, cs.CL	86	CVE→CWE classifier competitive with LLMs; large dataset + strong macro-F1 on rare classes.	cybersecurity, vulnerability-classification, CVE, CWE, robustness, dataset
`2603.14855`	PCodeTrans: Translate Decompiled Pseudocode to Compilable and Executable Equivalent PDF	cs.SE, cs.AI	86	Feedback + dynamic validation to prevent semantic hallucinations in decompiled code recovery.	code, verification, hallucinations, program-synthesis, security
`2603.15566`	Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents PDF	cs.SE, cs.AI, eess.SY	86	Practical protocol to preserve agent coding rationale in git; improves auditability & safer agent workflows	coding-agents, software-engineering, auditability, agent-workflows, knowledge-management, tooling
`2603.09253`	Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training PDF	cs.LG	86	Training-only priors for efficient reasoning at fixed test-time compute; broadly reusable.	efficient-reasoning, test-time-compute, attention, training-tricks, transformers
`2603.18570`	Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks PDF	cs.LG, cs.CR	85	Shows approximate unlearning can be weaponized into attacks; introduces unlearning corruption.	machine-unlearning, adversarial-attacks, privacy, GNNs, security
`2603.17522`	Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions PDF	cs.CL, cs.AI	84	Broad benchmark of AI-text detectors across domains/LLMs with adversarial conditions; useful for eval.	evaluation, AI-generated-text, robustness, adversarial, benchmark
`2603.18538`	Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning PDF	cs.LG, stat.ME	84	Active auditing metrics + topology-aware defenses for decentralized FL backdoors; practical security angle	federated-learning, backdoors, auditing, anomaly-detection, security, graph-topology
`2603.19182`	Box Maze: A Process-Control Architecture for Reliable LLM Reasoning PDF	cs.AI, cs.CL	84	Process-control architecture to reduce hallucination/adversarial failures; safety-oriented framing	LLM-safety, hallucination, robustness, process-supervision, architecture, adversarial
`2603.15421`	CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents PDF	cs.CL, cs.AI	84	Agent memory clustering to reduce irrelevant/corrupt context; practical for small-model agents	agents, memory, retrieval, small language models, RAG, context management, robustness

AI 论文洞察简报

2026-03-22

0) 执行要点（先读这个）

验证正在从“再问一个 LLM”转向结构化、可检查的信号：基于图结构的计划验证（带节点/边风险，GNNVerifier）以及逐步 CoT 安全评分 + 干预（SFCoT）都相较仅靠提示词的基线展现出显著的鲁棒性提升。
隐私/安全研究正变得更“系统真实”：私有 RAG 现在高效支持任意大 top‑k（p²RAG）；联邦学习攻击去除了“需要修改架构”的假设（ARES）；VFL 防御利用标签信息实际“集中在哪儿”（移动 cut layer）。
基准测试更具诊断性（也更多维）：BrainBench 将准确性与一致性（随机性）分离；有害幽默加入多模态 + 阿拉伯语 + 隐性伤害；AI 文本检测在长度匹配 + 领域迁移 + 对抗改写下进行压力测试。
Agent 可靠性的瓶颈越来越在于表征与记忆组织：CLAG 的簇内局部记忆演化提升 SLM 鲁棒性与时延；“道德冷漠（moral indifference）”工作指出行为对齐可能让潜在几何仍不对齐，并展示基于 SAE 的引导可提升对抗安全指标。
以执行为落地的反馈闭环优于静态检查（代码/安全流水线）：PCodeTrans 通过原位二进制替换 + ASan + 差分追踪驱动 LLM 修复，在 coreutils/binutils 上实现接近完美的函数级等价。

2) 关键主题（聚类）

主题：结构化验证与面向过程的 Agent 安全

重要性：Agent 失败往往来自跨步骤结构（计划）或中间推理（CoT），而最终答案过滤器会漏掉。能暴露哪里出错的验证器可实现定向修复与更安全的自主性。
代表论文：
共同方法：
- 将非结构化的 agent 产物转换为结构化对象（计划图；逐步 CoT 片段；场景分布）。
- 产出局部化诊断（节点/边风险；逐步安全分数），并基于验证信号对编辑/续写进行门控。
- 在缺少真实细粒度标注时使用合成监督 / 可控扰动（计划图扰动；场景套件）。
开放问题 / 失效模式：
- 合成扰动可能与真实规划错误不匹配（GNNVerifier 的分布差距）。
- 逐步 CoT 评估 + 释义方差检查的运行开销与可扩展性（SFCoT 未报告时延）。
- “代表性场景采样”在大规模上仍缺乏验证（HAAF 演示为 24 个场景、单一模型）。

主题：隐私保护推理与泄露感知的 ML 系统

重要性：随着 LLM 服务与联邦学习进入敏感领域，实用隐私取决于推理时泄露（RAG/MT）与现实攻击者能力（FL/VFL）。
代表论文：
共同方法：
- 用协议重设计替代“一刀切”的密码学原语（p²RAG 用二分阈值替代安全排序）。
- 定义任务 + 指标使隐私可度量（PPMT + PER）。
- 用信息论 / 机理分析解释泄露（VFL 按深度的互信息；FL 激活恢复表述为稀疏恢复）。
开放问题 / 失效模式：
- 部署假设：可信 dealer + 两个不串通服务器（p²RAG）；恶意服务器可设置权重/偏置（ARES）。
- 在隐式 MT 场景中，NER 质量主导隐私暴露（PPMT）。
- 防御可能将泄露转移到别处（VFL 的 cut-layer 防御可能增加特征泄露风险）。

主题：记忆、长上下文导航与固定算力效率

重要性：许多失败是“上下文管理”失败——检索到干扰项、丢失长程依赖，或为长输入（文本/视频）支付线性成本。新工作在固定算力下关注结构与预算。
代表论文：
共同方法：
- 引入层级结构：先聚类再检索（CLAG）、视频递归 K×K 网格（VideoAtlas）、长度感知注意力先验（RPA）。
- 通过缓存偏置 / 深度预算 / 两阶段检索保持推理成本固定或有界。
- 使用仅训练期控制器或外部编排，在不增加推理开销的情况下保留收益（Guardian；Master–Worker；簇演化）。
开放问题 / 失效模式：
- 超出已展示设置的泛化能力（RPA+Guardian 仅在 WikiText-2；VideoAtlas 主要是 MCQ QA）。
- 聚类记忆的路由/提示敏感性与分布迁移（CLAG）。
- “无可测时延变化”的主张缺少微基准（RPA）。

主题：暴露可靠性缺口的基准（随机性、迁移、隐性伤害）

重要性：许多“已解决”的指标掩盖脆弱性：随机推理、领域迁移、隐性伤害与对抗改写。新基准将这些维度隔离出来。
代表论文：
共同方法：
- 衡量不止准确率：一致性 vs 准确率（BrainBench）、多指标检测器评估（AUROC/AUPRC/EER/Brier/FPR@95）、隐性 vs 显性召回（有害幽默）、基于 rubric 的因果设计评分（InterveneBench）。
- 强调分布迁移（跨领域检测器；中文翻译；阿拉伯语 + 多模态）。
- 使用多次运行协议量化方差与可靠性（BrainBench 每题 10 次）。
开放问题 / 失效模式：
- 一些诊断集规模较小（BrainBench 100 题）。
- 基准依赖自动评审与专有模型（有害幽默视频；BrainBench judge）。
- 检测器在“领域 + 生成器”同时变化时的泛化仍较差（AI 文本检测）。

主题：模型与 ML 流水线的安全与溯源

重要性：当模型被微调、量化、蒸馏并再分发时，溯源与鲁棒性需要在真实变换下仍然成立；同时出现新的攻击面（遗忘、去中心化 FL）。
代表论文：
共同方法：
- 关注变换不变性（FSW 用 Fisher-vs-compression 的 GEVP 找到稳定子空间）。
- 建模动态与触发器而非静态指标（DFL 扩散 + 主动探测；将遗忘视为“触发器”）。
- 结合理论 + 实证压力测试覆盖攻击/防御（FSW 鲁棒性表；DFL ASR/ACC；遗忘崩塌）。
开放问题 / 失效模式：
- 水印威胁模型限制：鲁棒性假设功能骨干保持；载荷容量约 16 bit（FSW）。
- 去中心化 FL 中主动探测成本与非 IID 限制（DFL 防御指出极端非 IID 会破坏可区分性）。
- 遗忘攻击依赖替代模型保真度，且主要在学术图数据上展示（Cora/Citeseer/Pubmed/Flickr）。

3) 技术综合

“结构优先（structure-first）”是反复出现的模式：计划→图（GNNVerifier）、CoT→步骤（SFCoT）、记忆→簇（CLAG）、视频→递归网格（VideoAtlas）。共同押注是：显式结构带来更好的诊断、门控与算力控制。
当缺少细粒度标注时，合成监督正在成为默认：计划扰动（REPLACE/DROP/COMPRESS）、沙盒场景（HAAF）、合成病人（OpenHospital）、医疗伪造生成（MedForge-90K）。
验证闭环越来越需要验收标准：GNNVerifier 仅在图分数提升时接受编辑；SFCoT 基于逐步安全分数重写/截断；PCodeTrans 迭代直到测试 + ASan/BP-Diff 通过。
算力预算被正式化为一等旋钮：VideoAtlas 深度上界 d；RPA 缓存偏置 + 仅训练期控制器；CLAG 两阶段检索降低搜索空间与时延。
隐私中“信息定位”很关键：VFL 显示标签信息集中在更深/更上层；防御可通过结构性手段（cut-layer 放置）而非仅加噪。
攻击现实性在提升：ARES 假设攻击者可设置权重/偏置（无需改架构）并用稀疏恢复；遗忘污染以法律强制删除为触发；p²RAG 面向任意 top‑k（贴近长上下文实用）。
可靠性正以方差而非仅均值来衡量：BrainBench 的准确率–一致性差距（平均 10.3 个百分点）凸显随机推理是安全/可靠性维度。
“评审模型（judge models）”无处不在但角色不同：评分（InterveneBench）、披露评分（NDAI-zone 研究）、推理质量（MedForge）、BrainBench 答案判定——引出关于评审偏差与可复现性的横向担忧。
以执行为落地的评估是强区分点：PCodeTrans 用原始二进制 + 官方测试套件作为 oracle；这是减少代码变换“语义幻觉”的模板。

4) Top 5 论文（含“为什么是现在”）

1) GNNVerifier: Graph-based Verifier for LLM Task Planning（GNNVerifier：用于 LLM 任务规划的图验证器）

引入图结构验证器，对整体计划打分并定位高风险节点/边（工具/步骤不匹配、依赖问题）。
使用合成扰动在缺少真实标注时构造节点/边监督，从而训练诊断头。
展示验证引导的局部编辑（替换/插入），仅当验证器分数提升时接受；报告相较 VeriPlan 在多数据集/规划器上稳定提升。
质疑点：合成错误分布可能与真实规划失败不一致；未做在线工具执行评估。

2) $p^2$RAG: Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval（$p^2$RAG：支持任意 Top‑$k$ 检索的隐私保护 RAG 服务）

用交互式二分替代安全排序，高效支持任意/大 k——契合长上下文 LLM 趋势。
使用标准 MPC 原语（Shamir sharing、Beaver triples、DCFs），并报告在 k=16–1024 时相较 PRAG 3–300× 加速。
给出明确泄露界：物理泄露 O(log²N) + 功能泄露 k+ξ。
质疑点：假设可信 dealer + 两个不串通的半诚实服务器；PIR 与离线阶段未做基准。

3) SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration（SFCoT：通过主动安全评估与校准实现更安全的思维链）

将安全从最终输出过滤前移到逐步 CoT 监控，包含词法/语义/策略评分与灰区校准。
报告显著越狱降低：ASR 58.97% → 12.31%，同时在 MMLU/GSM8K/MBPP 上保持约 91.2% 平均效用。
消融将收益归因于一致性验证器与重写干预。
质疑点：未报告运行/时延开销；仅在单一模型（Qwen3-8B）上评估。

4) PCodeTrans: Translate Decompiled Pseudocode to Compilable and Executable Equivalent（PCodeTrans：将反编译伪代码翻译为可编译且可执行的等价实现）

提出原位可替换执行（in-situ substitutable execution）：将修复后的函数热替换进原始二进制，用真实执行作为等价性 oracle。
使用 ASan（仅替换部分）+ 断点匹配的差分追踪生成可操作的运行时差异，驱动 LLM 迭代修复。
在 coreutils/binutils（未剥离）上实现100% 函数级编译与 ~99.6–99.9% 行为等价。
质疑点：平台特定（Linux ELF/x86_64）；间接调用签名恢复与独立重编译仍困难。

5) Mechanistic Origin of Moral Indifference in Language Models（语言模型中“道德冷漠”的机理起源）

将“道德冷漠”诊断为潜在几何问题（类别/梯度/结构/维度），并用基于原型的道德向量真值进行分析。
使用 SAEs + 定向特征微调 + 加性引导（additive steering）提升 Flames 上的对抗安全结果（如 PSC1 908→953；胜率峰值 75.4%）。
将机理可解释性与对齐连接起来，展示对内部特征的因果干预。
质疑点：干预主要在 Qwen3-8B 上展示；仅极少 SAE 特征与道德维度相关；引导对 α 敏感。

5) 实用下一步

如果你在构建工具调用型 agent：原型化一个计划图验证器输出节点/边风险，并用它驱动带验收测试的局部编辑（分数必须提升），对齐 GNNVerifier。
对启用 CoT 的系统做越狱防护：对比有/无逐步 CoT 门控的 ASR；记录逐步安全分数，并在核心任务上量化效用保留（SFCoT 风格）。
对私有 RAG：评估产品是否需要动态/大 top‑k；若需要，在真实 RTT 与 PIR 成本下基准测试阈值/二分式检索 vs 基于排序的安全 top‑k（p²RAG 指出应测什么）。
对联邦/垂直 FL 部署：做按层互信息（MI-by-layer）诊断以定位标签信息集中位置，再测试cut-layer 前移作为零开销缓解——同时衡量特征泄露风险（VFL 论文的权衡）。
对小型 agent 的长上下文记忆：尝试簇内局部记忆演化 + 两阶段检索，同时跟踪答案质量与时延；消融“局部演化 vs 全局检索”（CLAG）。
对评估：在内部推理基准中加入多次运行一致性（不只准确率）（BrainBench 协议），并在依赖 AI 文本检测器时加入领域迁移 + 对抗改写。
对溯源/IP：若分发的模型可能被量化/蒸馏，在实际变换流水线下测试子空间水印鲁棒性并保持载荷适度（FSW 暗示 ~16-bit 的实用容量）。

由逐篇论文分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-03-22

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：结构化验证与面向过程的 Agent 安全

主题：隐私保护推理与泄露感知的 ML 系统

主题：记忆、长上下文导航与固定算力效率

主题：暴露可靠性缺口的基准（随机性、迁移、隐性伤害）

主题：模型与 ML 流水线的安全与溯源

3) 技术综合

4) Top 5 论文（含“为什么是现在”）

5) 实用下一步