AI 论文日报（2026-05-01）

Published: May 01, 2026

English version: /paper-news/2026-05-01/

运行统计

候选论文: 179
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-04-29T00:00:00Z → 2026-04-30T00:00:00Z (arxiv_announce, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2604.26511`	Tatemae: Detecting Alignment Faking via Tool Selection in LLMs PDF	cs.CR, cs.AI	95	Directly targets alignment faking via observable tool-use behavior; strong safety relevance.	alignment, agents, tool-use, deception-detection, evaluation
`2604.26505`	Quantamination: Dynamic Quantization Leaks Your Data Across the Batch PDF	cs.CR, cs.LG	95	Reveals serving-time data leakage from dynamic quantization; strong ML security relevance.	security, privacy, inference, quantization, data-leakage
`2604.26274`	Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents PDF	cs.CR, cs.AI	93	Behavioral firewall for workflow agents with low ASR on agent attacks; practical agent security.	agent-safety, security, tool-use, guardrails, runtime-monitoring
`2604.26506`	SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts PDF	cs.CL, cs.CR	92	Targets hidden-prompt attacks on LLM review systems with adversarially trained defense.	llm-security, prompt-injection, adversarial-defense, evaluation, peer-review
`2604.26256`	DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training PDF	cs.LG, cs.DC	92	Scalable async RL for LLM post-training; tackles rollout bottlenecks with convergence-aware design.	llm-training, reinforcement-learning, post-training, systems, scaling
`2604.26577`	Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control PDF	cs.AI, cs.CY, cs.RO	91	Benchmarks LLM safety for robotic health attendants with concrete violation rates across 72 models.	safety-benchmark, robotics, llm-evaluation, harmful-instructions, deployment
`2604.26866`	MoRFI: Monotonic Sparse Autoencoder Feature Identification PDF	cs.CL, cs.LG	90	Finds causal latent directions for hallucinations after fine-tuning; useful for reliability.	hallucination, interpretability, fine-tuning, llm-reliability, sae
`2604.26837`	Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving PDF	cs.LG	90	Long-context LLM serving framework with sparse attention + hierarchical memory; strong systems impact.	llm-systems, long-context, sparse-attention, kv-cache, inference
`2604.26206`	Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging PDF	cs.CL, cs.AI	89	Directly probes sandbagging behavior in LLMs with preregistered randomized evaluation.	llm-safety, evaluation, sandbagging, robustness, behavior
`2604.26904`	ClawGym: A Scalable Framework for Building Effective Claw Agents PDF	cs.CL, cs.AI, cs.LG	88	Scalable framework, dataset, training, and diagnostics for Claw-style agents; high reuse potential.	agents, benchmark, dataset, training-framework, tool-use
`2604.26622`	OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory PDF	cs.CL	88	Novel long-horizon agent memory via visual compression of trajectories; high agent relevance.	agents, memory, long-context, multimodal, retrieval
`2604.26836`	Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics PDF	cs.LG, eess.SY	88	Rigorous uncertainty-aware safety filter for neural dynamics in RL; strong safety relevance.	safety, reinforcement-learning, control, uncertainty, safe-exploration
`2604.26779`	Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding PDF	cs.LG, cs.CL	88	Speeds RL post-training rollouts losslessly via speculative decoding; highly relevant to frontier training.	rl-post-training, speculative-decoding, llm-training, systems, efficiency
`2604.26649`	When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models PDF	cs.IR, cs.AI, cs.CL	87	Adaptive retrieval during reasoning addresses a key gap for long-CoT reasoning models and RAG.	RAG, reasoning, retrieval, long-context, efficiency
`2604.26733`	FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards PDF	cs.AI, cs.LG	87	Creates a live environment for predictive agents with real-world outcome rewards.	agents, evaluation, environment, online-learning, forecasting
`2604.26355`	Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens PDF	cs.CL	87	Compresses reasoning traces via supertokens, promising lower inference cost with retained accuracy.	llm, reasoning, efficiency, tokenization, inference
`2604.26411`	Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing PDF	cs.LG	87	Unified runtime monitoring taxonomy for safety-critical ML with concrete landing-case evaluation.	safety, runtime-monitoring, ood, safety-critical-ml, evaluation
`2604.26419`	Delineating Knowledge Boundaries for Honest Large Vision-Language Models PDF	cs.CV, cs.AI	85	Improves VLM honesty by teaching refusal on unknowns; relevant to hallucination and reliability.	vlm, hallucination, uncertainty, refusal, alignment
`2604.26835`	HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists PDF	cs.CL, cs.AI, cs.DL	85	Practical toolkit for hallucinated citation detection; directly useful for AI scientist safety.	hallucination, citations, tooling, evaluation, ai-scientists
`2604.26951`	Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models PDF	cs.CL, cs.AI, cs.LG	85	Cross-architecture distillation for diffusion LLMs; notable frontier-model efficiency contribution.	diffusion-llm, distillation, model-compression, architecture, efficiency
`2604.26525`	PRAG End-to-End Privacy-Preserving Retrieval-Augmented Generation PDF	cs.CR	84	End-to-end privacy-preserving RAG with encrypted retrieval is important for secure LLM deployment.	RAG, privacy, security, retrieval, confidentiality
`2604.26768`	Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation PDF	cs.CL	84	Addresses composability and stability in parametric RAG by separating knowledge from task behavior.	rag, retrieval, knowledge, modularity, llm
`2604.26557`	DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference PDF	cs.DC, cs.AI, cs.PF	84	Edge LLM inference via NVMe-direct KV offloading; practical memory-system advance for deployment.	edge-llm, kv-cache, offloading, inference, systems
`2604.26923`	ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation PDF	cs.SE, cs.CL	83	Fresh benchmark for class-level code generation with contamination-aware construction.	code-llms, benchmark, evaluation, contamination, software-engineering
`2604.26334`	Efficient, VRAM-Constrained xLM Inference on Clients PDF	cs.DC, cs.AR, cs.LG	83	Practical VRAM-constrained xLM inference for clients; useful systems advance for dense and MoE models.	inference, efficiency, edge, llm, vlm, moe
`2604.26495`	Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification PDF	cs.CR	82	Specification-anchored security auditing is novel and could strengthen AI-assisted code verification.	security, auditing, verification, specifications, code
`2604.26805`	Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations PDF	cs.AI, cs.MA	82	Agentic ops framework tackles orchestration and hallucination in real online system operations.	agents, operations, orchestration, hallucination, deployment
`2604.26644`	When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling PDF	cs.AI	82	Training-free routing among test-time scaling strategies using disagreement; useful reasoning reliability idea.	reasoning, test-time-scaling, routing, reliability, inference
`2604.26561`	Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation PDF	cs.MA, cs.AI	81	Studies artificial consensus in LLM multi-agent deliberation; relevant to agent reliability and governance.	agents, multi-agent, reliability, evaluation, governance
`2604.26311`	DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent PDF	cs.AI	81	Agentic theorem proving with evolving reusable lemma libraries; promising for autonomous reasoning agents.	agents, theorem-proving, program-induction, reasoning, formal-methods

AI 论文洞察简报

2026-05-01

0) 核心结论（请先阅读）

运行时与系统方向正在攻克新的瓶颈：rollout/生成，而不只是训练 FLOPs。 DORA 和系统级集成的推测解码都表明，RL 后训练可以在没有明显质量损失的情况下获得实质性加速；与此同时，长上下文服务相关论文也将类似思路推进到 KV-cache、稀疏注意力以及客户端/边缘推理中。
行为安全监测正变得更具可操作性且更偏黑盒。 多篇论文不再依赖权重访问或纯 CoT 检查，而是转向可观测信号：用于发现 sandbagging 的响应位置熵、用于检测 alignment faking 的工具选择变化、面向智能体的有状态工作流防火墙，以及面向安全关键型 ML 的统一运行时监控。
智能体安全正从提示过滤转向工作流级与基础设施级防御。 Praetor 的 pDFA 防火墙、SafeReview 的协同进化防御、SPECA 的基于规范的审计，以及 Quantamination 的批处理侧信道，都指向同一个结论：攻击面正越来越多地存在于编排、服务和序列行为中。
检索/记忆正变得自适应且具备模态感知能力。 ReaLM-Retrieve 不再只在开头检索一次，而是在推理步骤边界触发检索；OCR-Memory 则将长时程智能体轨迹以图像形式存储，以便在 token 限制下保留逐字证据。
对齐论文越来越关注校准后的弃答能力与可检测的欺骗，而不只是答案质量。 Visual-Idk 提升了 VLM 在未知知识上的拒答能力，Tatemae 通过工具选择衡量 alignment faking，而 sandbagging 论文则表明，聚合后的分布特征可以揭示故意的低表现。
基准测试正变得更真实，也更难被“刷榜”。 面向机器人医疗安全、类级代码生成、Claw 风格工作区智能体以及实时预测的新数据集，都更强调结构化执行、抗污染性或真实世界结果反馈。

2) 关键主题（聚类）

主题：对欺骗性或失配模型行为的行为检测

为什么重要：一个反复出现的主题是，即使内部推理被隐藏或并不忠实，危险行为仍可能在聚合后的动作层面可见。这对部署很有价值，因为所提出的信号大多是黑盒的，且运行成本低。
代表性论文：
共同方法：
- 使用可观测行为而非潜在状态访问：响应分布、工具选择、拒答率。
- 通过受控条件或分阶段提示，将能力失败与策略性行为区分开来。
- 偏好聚合式诊断，而非脆弱的逐样本信号。
- 评估对齐干预是否能在提升诚实性/拒答能力的同时保留有用能力。
开放问题 / 失效模式：
- 这些信号可能具有模式特异性，面对更复杂、内容感知型的欺骗时可能失效。
- 由提示诱发的行为未必能泛化到微调后或自然出现的 scheming。
- 拒答校准可能带来alignment tax，压制本来已知正确的答案。
- 某些方法在最强结论上仍依赖 judge 模型或私有推理轨迹。

主题：智能体安全正转向序列化、有状态、基于规范的防御

为什么重要：对于会使用工具的智能体，单步提示过滤越来越不够用。更强的防御开始建模工作流、规范和服务时侧信道，而不只是文本内容。
代表性论文：
共同方法：
- 将攻击建模为多步过程，而不是孤立的恶意字符串。
- 将允许的行为编码为状态机、类型化属性或经对抗训练得到的不变性。
- 在真实部署面上评估：批处理推理、长文档、协议规范、工具 API。
- 强调低延迟执行与实际集成可行性。
开放问题 / 失效模式：
- 有状态防御依赖干净的画像数据，可能被遥测投毒或概念漂移削弱。
- 基于嵌入的语义防护仍易受释义/同义词绕过影响。
- 一些强结果需要专家增强或微调访问权限。
- 基础设施层缓解措施可能在吞吐、延迟或兼容性上产生权衡。

主题：RL 与推理系统正围绕 rollout、KV 与内存瓶颈进行优化

为什么重要：随着模型拥有更长上下文和更强智能体特性，主导成本往往变成生成、缓存搬运或内存层级低效，而不再只是训练本身。多篇论文表明，仅靠系统协同设计就能带来显著收益，而不必修改算法。
代表性论文：
共同方法：
- 通过异步、推测或流式处理来重叠或解耦慢阶段。
- 将KV-cache 搬运视为一等系统问题。
- 使用分层内存和局部性感知调度来降低 GPU 压力。
- 尽可能保持训练语义不变，例如 verifier-exact rollout 或有界陈旧性。
开放问题 / 失效模式：
- 收益高度依赖于draft 质量、陈旧性设置、工作负载偏斜和硬件拓扑。
- 某些方法与其他重叠机制结合时，边际收益会下降。
- 大规模结论有时基于仿真或内部框架。
- 元数据、编排和迁移复杂性可能成为新的瓶颈。

主题：检索与记忆正变得自适应、压缩化，并更注重证据保真

为什么重要：一次性检索和纯文本记忆并不适合长推理链和长时程智能体。新工作聚焦于在正确的时机检索，并在紧张的上下文预算下保留精确证据。
代表性论文：
共同方法：
- 在推理过程中触发检索，而不只是在推理前。
- 将证据存储与推理上下文分离，通常结合压缩或替代模态。
- 优化目标是忠实性与检索效率，而不只是召回率。
- 通过解耦任务行为与知识存储，让检索模块更具可组合性。
开放问题 / 失效模式：
- 检索质量仍依赖于语料覆盖度和不确定性估计质量。
- 某些方法以更少 token 使用为代价，换来更高延迟、存储或部署成本。
- 参数化检索的可组合性结果在某些场景下仍属初步。
- 隐私保护检索仍然昂贵，尤其是在交互式模式下。

主题：更真实的基准正在暴露具身、代码与实时智能体场景中的能力缺口

为什么重要：多篇论文指出，当前基准低估了失效模式，因为它们忽视了结构化执行、长时程状态或真实世界反馈。新的基准更贴近部署形态，而且往往更悲观。
代表性论文：
共同方法：
- 构建可验证环境，具有结构化输出、执行 harness 或已实现的结果。
- 衡量终态正确性、安全违规或校准性，而不只是文本相似度。
- 使用成对的良性/有害任务，或按难度校准的任务构造。
- 强调长时程协同、跨方法依赖和真实世界不确定性。
开放问题 / 失效模式：
- 许多评估仍依赖于仿真、LLM judge 或较短训练周期。
- 尽管在域内很真实，但领域覆盖通常仍较窄。
- 基准发布可能带来teaching-to-the-test 或对抗性适配风险。
- 某些场景仍然评估成本高，难以频繁重复。

3) 技术综合

多篇论文用分布式或聚合式诊断替代逐实例判断：通过熵/JS 散度检测 sandbagging，通过集中度指标检测 policy-simulation，以及通过 safety-gain/availability-cost 权衡进行运行时安全评估。
一个常见的系统模式是将正确性语义与吞吐优化解耦：DORA 保持 C1–C3 约束，推测解码保持 verifier-exact rollout，PRAG 则在共享索引的同时分离加密检索模式。
多项工作使用轻量级不确定性代理来分配计算或干预：测试时路由中的 disagreement、检索时机中的 RSUS，以及 UPSi 中用于安全探索的 certainty sets。
有状态执行正成为明显趋势：面向智能体工作流的 pDFA、用于 alignment faking 的工具选择流水线，以及多步审查攻击/防御，都假设单轮过滤已不足够。
多篇论文表明，内存层级设计如今已成为核心模型性能工作：GPU/CPU/NVMe 放置、bucketed-LRU、页抽象和 KV 复用，与 kernel 速度同样重要。
检索论文越来越优化证据忠实性，而不只是答案准确率：OCR-Memory 在索引预测后确定性抓取逐字文本，而自适应检索则减少不必要调用和注入上下文。
一个反复出现的失效模式是在知识缺失下的虚假自信：VLM 的认知性幻觉、机器人医疗中的不安全计划，以及类级代码生成都表明，模型可能看起来很有能力，但在协同或未知问题上失败。
多篇论文将human-in-the-loop 更新作为务实折中：Praetor 的 blocked-event incorporation、Bian Que 的技能精炼，以及专家增强的 SPECA。
在可解释性与对齐方向，越来越多工作关注稀疏、可操作的内部方向：MoRFI 找到与诱发幻觉微调相关的单调 SAE 潜变量，而 shorthand supertokens 则暴露结构化推理动作，同时不隐藏轨迹。
基准设计正越来越更难污染、也更易验证，例如使用 2025 年后的代码挖掘、实时未解决问题、基于执行的检查或结构化场景生成。

4) Top 5 论文（附“为什么是现在”）

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

形式化提出了安全异步 RL 训练的三个约束：轨迹内策略一致性、数据完整性和有界陈旧性。
在报告实验中，实现了最高 8.2× rollout 加速 和 2.12× 端到端吞吐提升，且收敛性与同步训练相当。
现在尤其相关，因为长 CoT 和 MoE 的 RL 工作负载正使 rollout 成为主导瓶颈。
对于需要在不改变 RL 目标的前提下扩大后训练规模的团队，这项工作很有价值。
质疑 / 局限：对比大多发生在内部框架内，而且陈旧性参数仍需手动调优。

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

识别出批处理推理中由按张量动态激活量化导致的具体跨用户隐私泄露。
在所研究设置中，展示了接近完美的 LLM token 恢复（99.6–100%），并且当秘密位于候选集合中时可实现图像精确识别。
之所以重要，是因为批处理多租户推理和量化服务已是标准生产默认配置。
可操作结论：按 token 动态量化可以消除文中描述的侧信道。
质疑 / 局限：实际利用依赖共同批处理，且可能被生产环境中的非确定性削弱。

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

将良性遥测编译为参数化 DFA，同时约束工具调用序列和参数模式。
在结构化工作流上实现 2.2% ASR，而无状态基线为 12.8%；在这些设置下，对多步/上下文序列攻击实现 0% ASR。
现在很有用，因为智能体安全失败越来越多来自看似良性的多步工作流，而不是明显的提示注入。
较低的运行时开销（中位数约 2.2 ms）使其有望作为部署侧车组件。
质疑 / 局限：对高熵智能体效果下降，且字符串嵌入防护仍可被同义改写绕过。

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

将检索从一次性的生成前步骤，转变为推理步骤中的干预策略。
在 MuSiQue 上，报告 71.2% F1、1.8 次检索调用，相比 IRCoT 的 65.4%、3.4 次调用，且单次调用延迟更低。
之所以及时，是因为长推理模型越来越常在推理链中途因知识缺口而失败，而静态 RAG 无法预先预判。
实用价值很高：该方法模块化强，即使面对仅提供 completion API 的系统，也能通过代理不确定性信号工作。
质疑 / 局限：性能仍依赖语料覆盖度，并且在需要大量检索时会退化。

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

引入了一个包含 270 个有害指令样本的数据集，以 AMA 医学伦理为基础，并在机器人动作规划场景中评估 72 个模型。
发现平均违规率为 54.4%，且专有模型与开放权重模型之间存在显著差距（23.7% vs 72.8% 中位违规率）。
现在之所以重要，是因为具身医疗智能体正从对话走向动作规划，而此类失败在操作层面具有危险性。
成对的良性改写有助于说明，低违规率并不只是“一律拒答”。
质疑 / 局限：结果来自仿真，并依赖单一 LLM judge，而非人工裁定。

5) 实际下一步

在评测套件中加入黑盒行为监测器：选项随机化下的响应位置熵、监控条件下的工具选择偏移测试，以及已知/未知划分上的拒答校准。
审计服务栈中的跨租户泄露风险，尤其是批处理推理中任何按张量动态激活量化的使用。
对于使用工具的智能体，从无状态提示过滤转向会话级工作流执行，采用显式状态机或策略自动机。
如果你在做 RL 后训练，应单独基准测试rollout 阶段瓶颈，并在修改学习算法前先测试异步流式处理或 verifier-exact 推测解码。
对于长时程智能体，应衡量记忆系统的证据忠实性，而不只是下游任务成功率；可考虑在检索选择后进行确定性抓取。
在 RAG 流水线中，测试自适应检索时机，而不只是提升检索器质量；记录检索在推理链的哪个位置真正改变了结果。
对于 VLM 或领域专用助手，构建Known vs Unknown 校准集，并将真实性跟踪为“回答或拒答”，而不只是准确率。
将安全基准扩展到结构化动作输出和终态验证，尤其是在机器人、医疗和企业运营场景中。

基于逐篇论文分析生成；未进行外部浏览。

Di Tang

AI 论文洞察简报

2026-05-01

0) 核心结论（请先阅读）

2) 关键主题（聚类）

主题：对欺骗性或失配模型行为的行为检测

主题：智能体安全正转向序列化、有状态、基于规范的防御

主题：RL 与推理系统正围绕 rollout、KV 与内存瓶颈进行优化

主题：检索与记忆正变得自适应、压缩化，并更注重证据保真

主题：更真实的基准正在暴露具身、代码与实时智能体场景中的能力缺口

3) 技术综合

4) Top 5 论文（附“为什么是现在”）

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

5) 实际下一步