AI 论文日报（2026-03-12）

Published: March 12, 2026

English version: /paper-news/2026-03-12/

运行统计

候选论文: 252
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-10T00:00:00Z → 2026-03-11T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.09772`	Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors PDF	cs.CV, cs.CR	95	Shows backdoors persist via alternative triggers; defenses removing known triggers can fail	backdoors, adversarial-ML, security, representation-learning, robustness
`2603.09706`	OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences PDF	cs.AI	94	Consequence-driven MLLM safety benchmark; finds causal blindness and alignment ceiling in frontier models	multimodal-safety, benchmark, agent-safety, causal-reasoning, evaluation, robustness
`2603.09884`	Benchmarking Political Persuasion Risks Across Frontier Large Language Models PDF	cs.CL, cs.CY	94	Large-N benchmark shows frontier LLMs beat ads at persuasion; high societal misuse relevance	political-persuasion, misuse, evaluation, survey-experiments, frontier-models
`2603.09246`	Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models PDF	cs.CR	93	New jailbreak class for LVLMs via compositional reasoning; ROP-style chaining of benign premises into harm	jailbreaks, multimodal, adversarial-attacks, compositionality, security, red-teaming
`2603.09046`	FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation PDF	cs.CR, cs.LG, cs.OS	93	Secure on-device LLM serving w/ TrustZone-style isolation; strong systems+security relevance.	llm-serving, mobile, trusted-execution, isolation, confidential-inference, systems-security
`2603.09781`	CLIOPATRA: Extracting Private Information from LLM Insights PDF	cs.CR	92	Privacy attack on “privacy-preserving” LLM insights; shows realistic data insertion can induce leakage	privacy, data-exfiltration, security, LLM-systems, auditing, attack
`2603.09203`	Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents PDF	cs.AI	91	Process rewards for retrieval-augmented agents; explicit eval action + GRPO variant for reliability.	agents, rag, process-supervision, reinforcement-learning, reliability, evaluation
`2603.09957`	Think Before You Lie: How Reasoning Improves Honesty PDF	cs.AI, cs.CL, cs.LG	91	Finds reasoning increases LLM honesty; probes mechanisms via representation geometry	honesty, deception, reasoning, mechanistic-analysis, behavior
`2603.09157`	Real-Time Trust Verification for Safe Agentic Actions using TrustBench PDF	cs.AI	90	TrustBench verifies agent actions pre-execution; shifts from post-hoc eval to real-time safety gating	agents, runtime-guardrails, verification, evaluation, trustworthiness, tool-use
`2603.09036`	SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding PDF	cs.LG	90	LLM-guided symbolic planning + RL grounding with feedback loop; strong agent skill learning	agents, LLM-planning, reinforcement-learning, skill-learning, tool-use
`2603.09875`	The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation PDF	cs.MA, cs.CR, cs.DC	89	Agent auth revocation framed as coherence; bounds unauthorized ops under fast agent execution	agent-security, access-control, capabilities, revocation, distributed-systems
`2603.09337`	Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments PDF	cs.CV, cs.AI	88	STAR benchmark tests LLMs as adversarial agents in zero-sum, real-time/turn-based settings.	agent-evals, adversarial, multi-agent, benchmark, strategic-reasoning, red-teaming
`2603.09731`	EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning PDF	cs.CV, cs.AI, cs.CL	88	New benchmark for long-horizon egocentric action→scene prediction; targets embodied MLLM limits	benchmark, embodied-agents, multimodal, long-horizon-reasoning, evaluation
`2603.09134`	AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations PDF	cs.CR, cs.MA, cs.SE	86	Enterprise multi-agent security framework; decomposes attack surfaces around tool orchestration & memory	agent-security, enterprise, threat-modeling, memory, tooling, architecture
`2603.09309`	Rescaling Confidence: What Scale Design Reveals About LLM Metacognition PDF	cs.AI	86	Shows confidence scales distort metacognition; proposes better scale improving meta-d'	uncertainty, calibration, metacognition, evaluation, confidence
`2603.09127`	Chaotic Dynamics in Multi-LLM Deliberation PDF	cs.AI, cs.MA	85	Shows multi-LLM committees can be chaotic even at T=0; empirical Lyapunov analysis of instability routes	multi-agent, deliberation, stability, evaluation, reproducibility, dynamics
`2603.09065`	Learning Adaptive LLM Decoding PDF	cs.LG	85	Learns adaptive decoding policies via RL without model finetune; compute-aware inference gains	decoding, inference, test-time, rl, efficiency
`2603.09452`	CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research? PDF	cs.CR, cs.CL	84	Realistic CTI/OSINT workflow benchmark for LLM threat research (triage→search→draft), beyond MCQ metrics	cybersecurity, evaluation, agents, OSINT, benchmark, workflows
`2603.09435`	AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems PDF	cs.AI	84	Open dataset for evaluating NLP/RAG systems against EU AI Act-style compliance requirements.	rag, evaluation, governance, compliance, dataset, auditability
`2603.09297`	TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA PDF	cs.IR, cs.CL	84	Tool-augmented autonomous memory retrieval for long-term conversational QA beyond top-k	memory, agents, retrieval, long-context, conversational-QA
`2603.09906`	Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs PDF	cs.CL	84	Controlled study: reasoning boosts single-hop factual recall; proposes buffer/mechanism story	parametric-knowledge, reasoning, factuality, mechanisms, elicitation
`2603.09951`	Towards a Neural Debugger for Python PDF	cs.LG, cs.AI, cs.SE	83	Neural debugger trained on execution traces; enables interactive stepping/breakpoints for code LMs.	code-llms, tooling, debugging, execution-traces, reliability, agents
`2603.09184`	Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning PDF	cs.LG, cs.AI	83	Bridges diffusion planners with AR executors; improves reasoning via latent communication	planning, diffusion-lm, agents, reasoning, coordination
`2603.09434`	Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs PDF	cs.CL, cs.AI	82	CoMoral benchmark reveals narrative focus bias: models favor moral reasoning over commonsense consistency	reliability, benchmark, bias, commonsense, moral-reasoning, evaluation
`2603.09544`	Compartmentalization-Aware Automated Program Repair PDF	cs.CR	82	LLM-based automated program repair targeting cross-compartment interface security vulnerabilities.	cybersecurity, program-repair, llm-for-security, compartmentalization, vulnerability-mitigation
`2603.09296`	Diagnosing and Repairing Citation Failures in Generative Engine Optimization PDF	cs.IR, cs.CL	82	Taxonomy + agentic diagnosis/repair of citation failures in GEO; practical grounding/citations	citations, RAG, evaluation, agentic-systems, information-retrieval
`2603.09970`	CREATE: Testing LLMs for Associative Creativity PDF	cs.CL	82	New benchmark for associative creativity with objective grading; useful for capability evals	benchmark, creativity, evaluation, associative-reasoning, knowledge
`2603.09249`	Social-R1: Towards Human-like Social Reasoning in LLMs PDF	cs.AI	81	Adversarial social-reasoning benchmark + RL framework to reduce shortcuts in ToM-style tasks.	alignment, social-reasoning, theory-of-mind, rl, benchmark, robustness
`2603.09652`	MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants PDF	cs.AI	80	MiniAppBench targets interactive HTML app generation; 500 tasks distilled from 10M+ real generations	benchmark, code-generation, HCI, agents, web, evaluation
`2603.09206`	MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data PDF	cs.CV, cs.LG	80	Zero-data self-evolving VLM reasoning via multi-role RL framework; potentially high-impact method.	vlm, self-improvement, rl, synthetic-data, reasoning, frontier

AI 论文洞察简报

2026-03-12

0) 执行要点（先读这个）

“闭环（Closed-loop）”是当天的主导模式：多篇论文从一次性提示转向执行反馈闭环——由轨迹驱动精炼的 LLM 规划（SCALAR）、将检索自评变成可执行动作并用 RL 进行信用分配（EVALACT）、以及用于网络威胁研究的人类/真值在环验证（CyberThreat-Eval/TRA）。
智能体安全正从“意图”转向“后果”与“运行时门控（runtime gating）”：OOD-MMSafe/CASPO 针对多模态场景中的潜在下游危害；TrustBench 提出亚 200ms 的执行前信任验证，据称可将有害动作减少 87%。
多智能体系统即使在 T=0 也存在治理级不稳定性：多 LLM 协商可能呈现正的经验 Lyapunov 指数；角色结构（尤其是 Chair）与记忆窗口会实质性改变发散程度（Chaotic Dynamics in Multi-LLM Deliberation）。
安全研究凸显“潜在（latent）”失效面：LVLM 越狱可由良性“语义小工具（semantic gadgets）”组合而成（VROP）；后门在移除触发器后仍可通过特征空间中的替代触发器持续存在（Removing the Trigger, Not the Backdoor）；“隐私保护”的洞察流水线可通过投毒 + 提示注入被外泄（CLIOPATRA）。
推理时控制正成为一等优化目标：学习到的预算感知解码适配器可将 MATH 的 Pass@1 提升最高约 ~10.2 点（Learning Adaptive LLM Decoding）；而置信度诱导本身被证明对量表尺度敏感（在 [0,20] 上的元认知效率优于 [0,100]）（Rescaling Confidence）。
基准测试越来越“交互式、端到端”：从实时零和博弈（STAR）到 Playwright 测试的 HTML MiniApps（MiniAppBench）再到长时域第一视角场景预测（EXPLORE-Bench），评估正走向更像智能体的设置，其中延迟与动态性至关重要。

2) 关键主题（聚类）

主题：闭环智能体（反馈、验证与信用分配）

重要性：一次性 LLM 规划/检索在隐藏前置条件、噪声证据与稀疏奖励下容易失败。通过执行反馈或验证闭环可提升鲁棒性与样本/算力分配效率。
代表论文：
共同方法：
- 将中间判断转化为显式工件（符号算子；Evaluate 动作；结构化量表；信任向量）。
- 使用迭代闭环（轨迹分析、GRPO/PCAR 重标定、人类/真值检查如 VirusTotal）。
- 将算力重新分配到当前瓶颈（Frontier Checkpointing；过程优势重标定；智能体工具迭代）。
开放问题 / 失效模式：
- 自评可能失准或被“刷分”；严格协议（如 Search→Evaluate）可能降低自主性。
- 验证组件成为新的攻击面（验证器被攻破、延迟/成本爆炸）。
- 超出已评估领域（Craftax/RAG QA/CTI 数据集）的泛化仍不确定。

主题：多模态与多智能体安全，超越“明显的坏输入”

重要性：安全失败越来越来自组合（良性部件 → 有害整体）与后果投射（良性查询 + 上下文 → 潜在危害），而不仅是显式恶意意图。
代表论文：
共同方法：
- 为细微失效模式构建定向基准（后果驱动危害；组合式越狱；道德+常识矛盾；稳定性/混沌指标）。
- 使用超越通过/失败的结构化评分（R/S/E 三分安全；ASR；Lyapunov 风格发散）。
- 测试会改变动力学的干预（CASPO 混合 token/结果优化；角色消融；缩短记忆窗口）。
开放问题 / 失效模式：
- 改进是否保持可用性（例如降低不稳定性但不损害决策质量；避免“拒答坍塌”）。
- 依赖自动评审与“宪法”可能引入偏差/噪声。
- 攻击者可适应：小工具挖掘 + 进化式提示搜索；由于系统非确定性，委员会不稳定性在 T=0 仍会持续。

主题：面向智能体/LLM 部署的系统安全与隐私

重要性：随着 LLM 走向端侧与企业工作流，威胁模型扩展：内核被攻破、工具/记忆集成面、以及聚合“洞察”产品都成为高价值目标。
代表论文：
共同方法：
- 重新设计隔离边界（页粒度安全内存；NPU 沙箱；能力范围限定；可验证执行闭环）。
- 用运维术语量化风险（TTFT；信任边界计数；未授权操作随 v·TTL 的缩放）。
- 用具体原型或结构化分析评估（RK3588 原型；边界枚举；带 RCC 边界的模拟器）。
开放问题 / 失效模式：
- 启发式隐私审计可能灾难性失效（在 CLIOPATRA 成功泄露下，审计器报告 0 违规）。
- 集中式强制机制可能成为单点故障（AgenticCyOps Host；RCC 框架中的权威服务）。
- 可移植性/泛化：单 SoC 评估；闭源加速器运行时；部署摩擦。

主题：推理时控制、元认知与“推理作为旋钮”

重要性：许多增益来自控制推理时过程（解码、置信度诱导、推理轨迹）而非重训基座模型——对部署约束与安全门控很有价值。
代表论文：
共同方法：
- 在冻结 LLM 上学习轻量控制器（REINFORCE 解码适配器；量表设计提示）。
- 用 pass@k / 校准 / meta-d’ 衡量能力与不确定性质量。
- 机制分析：compute-buffer vs 语义启动；欺骗的表征亚稳态。
开放问题 / 失效模式：
- 训练稳定性与有限动作空间（token 级适配器仅用温度；mask 启发式）。
- 推理轨迹可能引入幻觉中间事实，降低最终正确性。
- 置信度报告存在锚定偏差；在离散化下 ECE 可能具有误导性。

主题：面向交互式工件与长时域动态的基准

重要性：静态 QA 式基准会错过交互、时间约束或长时域设置中的失效模式，而这些更接近真实智能体部署。
代表论文：
共同方法：
- 用更丰富指标评估端到端行为（PWER；Intention/Static/Dynamic；对象/属性/关系评分；创造性效用）。
- 纳入测试时扩展分析（实时 vs 回合制；多轮分段；提示变体）。
- 使用智能体式评估器（Playwright 探索；结构化评审）。
开放问题 / 失效模式：
- 在实时模式中成本/延迟权衡占主导（策略–执行鸿沟；多轮开销）。
- 评估器可靠性与评审依赖（多个基准使用 LLM-as-judge）。
- 基准提升是否能迁移到真实部署。

3) 技术综合

执行反馈正在被“编译”为训练信号：SCALAR 从成功轨迹中精炼类 STRIPS 算子；EVALACT 将检索评估变成动作并重标定 GRPO 优势（PCAR）。
运行时门控正在收敛到结构化、多信号评分：TrustBench 将校准后的置信度映射（等渗回归）与运行时检查结合，输出 allow/warn/deny 决策；AgenticCyOps 使用可验证执行 + 记忆完整性原则，尽早拦截攻击链。
稳定性/鲁棒性越来越被视为可度量的系统属性：Lyapunov 风格发散（多 LLM 委员会）与其他“动态感知”评估相呼应（STAR 的实时 vs 回合制重排）。
“潜在空间（latent-space）”框架在多领域反复出现：潜在后门区域支持替代触发器；潜在规划器→执行器通信（Latent-DARM）以避免文本流畅性瓶颈；后果驱动安全聚焦潜在危害。
推理时策略正在用可验证奖励学习：自适应解码适配器通过 REINFORCE 在正确性检查上训练；Social-R1 使用轨迹级奖励（SIP 阶段）与评审/RM；MM-Zero 使用 RLVR/GRPO 与执行、自一致信号。
推理 token 既是能力放大器也是风险因子：推理通过 compute-buffer + 事实启动提升事实回忆，但幻觉中间事实与更差的最终正确性相关；推理也提升诚实性并揭示欺骗的亚稳态。
安全评估强调攻击者适应性与流水线级攻击：VROP 使用进化式提示优化；CLIOPATRA 将投毒 + 提示注入串联穿过抽取→聚类→摘要→审计；移除触发器并不能移除后门。
延迟与资源隔离是一等约束：FlexServe 表明页粒度安全内存与 NPU 沙箱可相对 TrustZone 稻草人显著降低 TTFT；STAR 显示实时模式会因推理延迟而改写排行榜。
基准正在加入“文档中心”与“工件中心”的泛化：AgentGEO 的 MIMIQ 在按文档留出查询上评估引用率；MiniAppBench 在探索下评估可执行 HTML 行为而非静态正确性。

4) Top 5 论文（含“为何是现在”）

1) SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding

用从执行反馈精炼的类 STRIPS 算子，弥合 LLM 规划 ↔ RL 控制鸿沟。
在 Craftax 上的强长时域结果：Craftax-Classic 88.2% diamond，对比最佳基线 46.9%；在完整 Craftax 上 9.1% Gnomish Mines，而此前方法为 0%。
Frontier Checkpointing 将帧数重新分配到深层前置条件；轨迹分析至关重要（移除后 Mines 成功率降至 0%）。
质疑点：需要预定义符号抽象/词表；checkpointing 假设状态可序列化，且可能牺牲多样性。

2) FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

在内核被攻破的情况下，通过页粒度 Flex-Mem + Flex-NPU 沙箱实现端侧机密性/完整性。
显著延迟收益：相对 TrustZone 稻草人 10.05× TTFT；8GB 分配 568 ms vs 6441 ms CMA 基线；多模型工作流最高 24.30× 更快。
按需保护可在空闲时移除虚拟化开销。
质疑点：单 SoC 原型；无侧信道/物理/DoS 防护；normal-world 客户端 I/O 不受内核被攻破场景保护。

3) OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

将多模态安全重构为后果投射（consequence projection）；引入 OOD-MMSafe (455) 与 R/S/E 三分评分。
CASPO（token 级宪法条件自蒸馏 + 结果奖励）据称将失败率 R0 降至 7.3% / 5.7%（两种 backbone）。
诊断静态偏好对齐的“偏好天花板”与负迁移（例如某模型 DPO 后据称 −1.5%）。
质疑点：基准规模有限；依赖自动评审（据称与人类一致性 86.5%）且对超参敏感（λ 极端会导致熵坍塌）。

4) CLIOPATRA: Extracting Private Information from LLM Insights

对“隐私保护”洞察流水线（分面抽取 → 聚类 → 摘要 → 审计）的端到端黑盒攻击。
报告在最少先验知识（年龄/性别/1 个症状）下 39% 疾病抽取，对比 22% 基线；更多知识/更强模型下可接近近 100%。
显示 LLM 隐私审计器可能完全失效（泄露的聚类中 0 个被检测为违规）。
质疑点：评估使用合成医疗聊天混合 WildChat；真实世界运维约束（账号摩擦/检测）未充分建模。

5) Chaotic Dynamics in Multi-LLM Deliberation

提供用经验 Lyapunov 估计量 hatλ 审计委员会稳定性的具体方法，显示即使在 T=0 也会发散。
识别两条可分离的不稳定路径（角色、异质性）与一个关键放大器（Chair）；更短记忆窗口可降低发散。
强调 T=0 下的服务端非确定性（≈40–50% 调用出现非零解析方差）。
质疑点：降低不稳定性尚未与决策质量建立联系；部分场景效应不确定性较大，且工件省略了一些失败类型日志。

5) 实用下一步

如果你在做智能体：在工具闭环中加入显式 Evaluate/Verify 步骤（如 Search→Evaluate），并记录逐步自评分；测试优势重标定（PCAR 风格）是否提升多跳可靠性。
多模态安全：在后果驱动案例（OOD-MMSafe 风格）上评估，而非仅看意图；分别度量 R/S/E 以检测“安全但无效”的坍塌。
多智能体治理：在 T=0 下做重复运行的稳定性审计；消融角色（尤其是 “Chair”）并缩短记忆窗口，量化发散（hatλ）的变化。
隐私/洞察产品：将聚类+摘要流水线视为对抗面；测试投毒攻击，不要仅依赖 LLM 审计器——考虑正式隐私机制（DP），并在定向攻击下测量泄露。
后门防御：评估防御对替代触发器（特征引导攻击）的抵抗，而不只是已发现触发器的 ASR；加入潜在空间诊断（方向插值）。
推理时优化：在你的领域尝试预算感知解码适配器；另外，如需诱导置信度，测试 [0,20] vs [0,100] 量表并报告 meta-d’/Mratio（不只 ECE）。
实时智能体部署：在无约束与时间约束两种模式下做基准（STAR 风格）以暴露策略–执行鸿沟；将延迟作为一等指标跟踪。
端侧部署：若 TrustZone 过于僵硬，评估页粒度隔离设计与加速器沙箱；在内存压力与多模型调度下测量 TTFT。

由逐篇论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-03-12

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：闭环智能体（反馈、验证与信用分配）

主题：多模态与多智能体安全，超越“明显的坏输入”

主题：面向智能体/LLM 部署的系统安全与隐私

主题：推理时控制、元认知与“推理作为旋钮”

主题：面向交互式工件与长时域动态的基准

3) 技术综合

4) Top 5 论文（含“为何是现在”）

5) 实用下一步