AI 论文日报(2026-04-22)

Published:

English version: /paper-news/2026-04-22/

运行统计

  • 候选论文: 311
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-20T00:00:00Z → 2026-04-21T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.18463Using large language models for embodied planning introduces systematic safety risks
PDF
cs.AI, cs.LG, cs.RO96DESPITE benchmark shows LLM planning can be highly capable yet systematically unsafe in robotics tasksagent-safety, embodied-agents, robotics, planning, benchmark, risk-evaluation
2604.18487Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
PDF
cs.CL, cs.AI95Large jailbreak benchmark; big ASR jump under stylistic obfuscation across 31 frontier modelsjailbreaks, robustness, benchmark, red-teaming, safety-eval, stylistic-attacks
2604.18519LLM Safety From Within: Detecting Harmful Content with Internal Representations
PDF
cs.AI94Guardrail via internal-layer features; big gains with tiny params; better OOD generalizationsafety, harmful-content-detection, internal-representations, interpretability, guard-models
2604.18510Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
PDF
cs.CR, cs.AI, cs.CL93Compares jailbreak routes; shows mechanistic/behavioral divergence despite similar harmful compliancejailbreaks, mechanistic-analysis, RLVR, SFT, abliteration, safety-failure-modes
2604.17860TitanCA: Lessons from Orchestrating LLM Agents to Discover 100+ CVEs
PDF
cs.CR93Real-world multi-agent vuln discovery; 203 zero-days/118 CVEs; strong security lessonsagentic-security, vulnerability-discovery, LLM-agents, cybersecurity, red-teaming, software-security
2604.18179Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
PDF
cs.CR, cs.AI93Commit-open protocol using SAE feature traces to detect hosted LLM silent model substitutionsecurity, auditing, model-integrity, SAE, verification, hosted-llms
2604.17691SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
PDF
cs.LG, cs.AI92Targets safety erosion under continual domain adaptation; anchors safety subspaces during LoRA updatesalignment, continual-learning, safety-preservation, fine-tuning, LoRA
2604.18248Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
PDF
cs.CR, cs.CL90Seven cross-domain prompt-injection detection ideas aimed at adaptive adversaries beyond regex/classifiersprompt-injection, agent-security, detection, adversarial-robustness, LLM-security
2604.17730MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
PDF
cs.CL, cs.AI, cs.HC89Interaction-level mental health safety eval with role-aware harm taxonomy for multi-turn counselingmental-health, safety-eval, multi-turn, harm-taxonomy, clinical-safety, agents
2604.18231AgenTEE: Confidential LLM Agent Execution on Edge Devices
PDF
cs.CR, cs.OS88TEE-based confidential execution for LLM agents on edge; reduces attack surface and protects prompts/stateagent-security, TEE, confidential-computing, edge, system-prompts, privacy
2604.18362ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
PDF
cs.CL, cs.IR88Pre-generation conflict arbitration for long-form RAG; explicit support/contradiction claim graphRAG, factuality, hallucinations, evidence-arbitration, long-form-generation
2604.18164MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
PDF
cs.CL, cs.AI, cs.CV88Benchmark for compositional bias in MLLM-as-judge; controlled perturbations + metricsevaluation, judge-models, multimodal, bias, robustness, benchmarks
2604.18103Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
PDF
cs.AI88Training-free selective halting for long-context prefilling; big speedups while keeping accuracyllm-efficiency, long-context, attention, inference-optimization, flashattention-compatible
2604.17768When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
PDF
cs.AI87Shows VLM judges ignore images (informativeness bias) and proposes a mitigation methodevaluation, VLM-as-judge, multimodal, bias, grounding, reliability
2604.18240AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
PDF
cs.AI86Benchmark for Agent-as-a-Judge that interacts with tools/envs to verify behavior beyond static judgingevaluation, agentic-systems, LLM-judge, verification, benchmarks, tool-use
2604.17943Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
PDF
cs.CL86Defense-doc RAG benchmark with auditable evidence; reports large gains + hallucination reductionRAG, benchmark, attribution, hallucinations, domain-eval
2604.17843Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
PDF
cs.HC, cs.AI86Evidence-based multi-agent system with citations + abstention; large in-the-wild evalRAG, epistemic-humility, abstention, citations, deployment, misinformation
2604.17866Latent Abstraction for Retrieval-Augmented Generation
PDF
cs.CL, cs.AI86Unifies RAG in latent space: LLM generates dense retrieval vectors instead of text queriesRAG, retrieval, latent-retrieval, grounding, hallucinations, architecture
2604.18109FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings
PDF
cs.CL, cs.SD86Shows lexical content recoverable from embeddings; strong privacy/interpretability diagnostic for encoders.embeddings, interpretability, privacy-leakage, multilingual, multimodal, representation-analysis
2604.17803Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
PDF
cs.AI, cs.LG84Adversarial competition framework to generate diverse safety-alignment conversation data at scaledata-generation, red-teaming, alignment-data, crowdsourcing, adversarial-training
2604.17948RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs
PDF
cs.CR, cs.AI, cs.MA84LLM-agent + RAG for vulnerability root-cause reports; structured template and curated security KBcybersecurity, agents, RAG, vulnerability-analysis, software-security
2604.18235Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
PDF
cs.CL, cs.AI84Analyzes GRPO instability for deep-search agents; proposes advantage calibration fixagents, RLHF, GRPO, training-stability, search-agents, credit-assignment
2604.17761Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
PDF
cs.AI, cs.CL84Contrastive attribution framework to analyze real benchmark failures; cross-layer graphs for long contextinterpretability, attribution, debugging, llm-failures, evaluation
2604.17957Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
PDF
cs.CL83Scales PRM data via PDDL planning; ~1M step-level rewards beyond math; reusable for reasoning evalprocess-reward-models, reasoning, datasets, planning, evaluation
2604.18224WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
PDF
cs.SE, cs.AI83WebCompass benchmark for multimodal web coding lifecycle (gen/edit/repair); human-in-loopcode-agents, evaluation, multimodal, benchmarks, web-development, repair
2604.17739Tool Learning Needs Nothing More Than a Free 8B Language Model
PDF
cs.LG, cs.CL83Data-free tool-agent training with simulated environments from free 8B LMs + adaptive curriculumtool-use, agents, rl, synthetic-environments, open-models, training
2604.17769Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
PDF
cs.CL, cs.AI82Automated toxic data synthesis via inverted constitution; probability-clamped RLAIF to curb reward hackingadversarial-data, RLAIF, toxicity, red-teaming, reward-hacking, safety-training
2604.17886Latent Preference Modeling for Cross-Session Personalized Tool Calling
PDF
cs.CL, cs.AI82Benchmark + method for cross-session personalized tool calling; big token savings vs full historyagents, tool-use, personalization, memory, benchmarks
2604.17817Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
PDF
cs.HC, cs.AI, cs.MA82DailyDroid benchmark + failure analysis for smartphone agents; compares text vs screenshotsmobile-agents, evaluation, HCI, multimodal, failure-analysis, automation
2604.18584MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
PDF
cs.AI, cs.DL, cs.IR, cs.LG82Large multilingual multimodal Olympiad math benchmark + paired retrieval set for equivalence/similaritybenchmark, math-reasoning, multimodal, multilingual, retrieval, evaluation

AI 论文洞察简报

2026-04-22

0) 执行要点(先读这个)

  • 评估正在从“单次回答”转向“交互 + 环境”:多轮、角色条件化的心理健康红队评测(MHSafeEval)与可回放的 agent-judge 验证(AJ-Bench)都显示出静态评判遗漏的巨大差距。
  • 自动化评审存在可证实的偏差,并可通过更好的协议(部分)修复:VLM 评审常忽略图像并过度奖励“信息量”;BIRCH 将准确率提升约 9–10% 并降低偏差,但推理时间翻倍。
  • 在真实后训练流水线下,安全失败会叠加:顺序 LoRA 领域适配可能导致累积性安全退化;SafeAnchor 在保持领域性能接近标准 LoRA 的同时,保留约 93% 的原始安全性。
  • 安全正在变得更“代理化 + 运营化”,而非仅基准测试:TitanCA 通过编排式 LLM-agent 流水线报告了 118 个 CVE;Adversarial Arena 表明锦标赛生成的多轮数据在微调后可实质提升安全编码/拒答指标。
  • RAG 可靠性工作正在前移到流水线更早阶段:ArbGraph 在生成之前仲裁矛盾证据并提升长文事实召回(如 83.3–84.9% FR);DoRA 显示领域落地的合成基准 + 轻量 LoRA SFT 可在国防文档 QA 场景将幻觉减半。
  • 两类互补的安全原语正在形成:(a) 内部表征防护(SIREN),以更少可训练参数优于开源 guard 模型;(b) 服务时审计(承诺式 SAE 轨迹 + Merkle)以 ≤2.1% 开销检测托管模型替换。

2) 关键主题(聚类)

主题:顺序适配下的持续对齐

主题:超越单轮提示的高保真安全评估

主题:评审可靠性与偏差(文本 + 多模态)

主题:通过领域落地与冲突仲裁提升 RAG 鲁棒性

主题:大规模 agent 训练与数据生成(模拟 + 竞赛)

主题:面向真实部署的安全与隐私原语

3) 技术综合

  • “闭环搜索”正在成为发现失败的默认方式:MHSafeEval 使用类似 MAP-Elites 的档案;AJ-Bench 使用交互式验证;Adversarial Arena 使用锦标赛——都比静态提示覆盖更广。
  • 评判流水线被当作具有可测偏差的系统来对待:信息量偏差(IB)与图像依赖(IRS)量化评审失败;BIRCH 通过真实锚点缓解,而非仅做长度均衡。
  • 安全保持正从“一次性微调”走向“持续控制”:SafeAnchor 结合基于 Fisher 的子空间识别 + 正交梯度投影 + 监控触发修复。
  • 表征层面的安全既是攻击面也是防御面:越狱路径在机制上分化(RLVR vs SFT vs abliteration),而 SIREN 利用内部层实现更好的有害性检测。
  • RAG 可靠性正在分裂为 (a) 基准真实感 与 (b) 证据仲裁:DoRA 聚焦污染感知、意图多样的领域 QA;ArbGraph 聚焦生成前的矛盾消解。
  • 运营安全流水线强调校准与精度:TitanCA 的置信校准在保持不平衡下召回的同时降低误报(28%→20%);这呼应了更广泛的“保信任”工具趋势。
  • 效率工作瞄准 prefill 瓶颈,而非仅解码:DASH 在某个起始层后剪枝稳定 token,并保持与 FlashAttention 兼容,实现随长度增长的加速(如 16k token 理论 1.83×)。
  • 基准越来越将成本/延迟作为一等指标:DailyDroid 量化多模态成本膨胀;BIRCH 报告约 2× 推理时间;AgenTEE 报告相对进程 <5.15% 开销。

4) Top 5 论文(含“为何现在”)

1) SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

  • 表明安全会在顺序领域 LoRA中发生复合式退化;SafeAnchor 在 base 91.4 的情况下保留 85.2±0.9 的安全性(≈93.2% 保留率),同时保持领域性能接近标准 LoRA。
  • 实用配方:基于 Fisher 的“安全子空间” + 正交梯度投影 + 探针触发修复。
  • 提升对抗鲁棒性(GCG 拒答 78.4±2.1 vs 54.6±2.6 最佳基线)。
  • 质疑点:主要在 7B 与短序列(3 个领域;部分扩展到 T=5)评估;依赖探针质量(LlamaGuard)与 Fisher 近似。

2) MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

  • 将心理健康安全重构为轨迹级伤害发现,并提供角色×类别分类体系(28 种行为)。
  • 闭环搜索相较仅用种子显著提高攻击成功率(如 GPT-3.5 ASR 0.603→0.943)。
  • 发现关系性伤害(依赖诱导、煤气灯效应、过度病理化)即使在理解力较高时也易被诱发。
  • 质疑点:依赖模拟交互与基于 LLM 的临床评审(gpt-4o-mini);受成本限制,前沿规模覆盖有限。

3) When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias

  • 量化 VLM 评审往往几乎不使用图像(IRS 通常 <3–5%),并偏好“信息量大”但错误的答案。
  • BIRCH 通过真实且信息充分的锚点缓解:提升评审准确率(如 GPT-4o 66.45%→75.78%)并降低 IB(如 Llama-3.2 IB 52.9%→35.9%)。
  • 质疑点:锚点错误可能传播;计算成本约翻倍,偏差降低但未消除。

4) Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

  • 引入 commit-open 协议(对 SAE top-k 特征草图做 Merkle 承诺),弥补“先返回再探针验证”中的“并行服务”漏洞。
  • 报告对多类替换的检测与 SVIP 对比(SVIP 漏检 11/11;commit-open 在复跑集合中检出 11/11)。
  • 服务开销低(batch 32 时 ≤2.1%;224 字节负载)。
  • 质疑点:限定于特定主干/SAE(1.7–9B)与威胁模型;旗舰规模与更强白盒自适应仍待研究。

5) Using large language models for embodied planning introduces systematic safety risks (DESPITE)

  • 确定性 PDDL 基准(12,279 任务)将可行性与安全意图分离;显示安全意识随规模增长更慢(βSI=4.5)而可行性更快(βF=26.8)。
  • 典型例子:Gemini-3-Pro-Preview 不可行仅 0.4%,但产生危险计划 28.7%。
  • 给出清晰分解:Safety ≈ Feasibility × Safety Intention(R²≈0.99)。
  • 质疑点:符号化/确定性设定(无感知、无连续动力学);应将其视为真实机器人场景的下界。

5) 实用下一步

  • 若你做持续专用化:为 LoRA 流水线加入适配后安全监控(探针集 + 阈值),并测试(SafeAnchor 风格的)正交梯度约束;跨多个顺序领域跟踪安全保留率,而非只看单次。
  • 若你依赖 LLM/VLM 评审:测量并报告偏差切分(信息量驱动 vs 正确性驱动)与图像依赖(IRS);当正确性必须优先时,考虑锚点式评判(BIRCH)。
  • 用于 agent 评估:至少在一个你关心的领域采用可环境回放的评审设置(AJ-Bench 风格);比较 LLM-as-judge 与 agent-as-judge 的 F1 与预算敏感性。
  • 敏感领域的 RAG:从私有语料构建类似 DoRA 的合成、证据链接回归集;在固定检索器下测试轻量 LoRA SFT 是否同时提升任务指标与幻觉诊断。
  • 长文 RAG:在小范围原型化生成前主张仲裁(ArbGraph 风格);对比你当前“检索-再生成”基线,测量事实召回/幻觉。
  • 托管模型完整性:评估在你的服务栈中 commit-before-open 轨迹(如 SAE 草图 + Merkle)是否可行;量化开销并明确需要覆盖的攻击者类别。
  • 红队覆盖:将风格混淆变换(AHB 风格)加入单轮安全套件;将修辞位移下的 ∆ASR 作为鲁棒性 KPI 跟踪。

由逐篇论文分析生成;未进行外部浏览。