AI 论文日报(2026-04-25)

Published:

English version: /paper-news/2026-04-25/

运行统计

  • 候选论文: 221
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-04-23T00:00:00Z → 2026-04-24T00:00:00Z (arxiv_announce, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.21477MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks
PDF
cs.CR95Protocol-aware MCP security testbed w/ reproducible pitfalls, traces, validators; multi-vector attacksagents, MCP, tool-security, prompt-injection, supply-chain, benchmark, evaluation
2604.21860Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
PDF
cs.CR, cs.AI93New multi-turn jailbreak exploiting stateless moderation; broad eval across frontier & OSS modelsjailbreaks, multi-turn, moderation, adversarial, red-teaming, security
2604.21211Subject-level Inference for Realistic Text Anonymization Evaluation
PDF
cs.CL93New benchmark shows span-masking can still leak identity via subject-level inference.privacy, anonymization, PII, evaluation, inference-attacks, benchmarks
2604.21308CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
PDF
cs.CR, cs.CL92Enterprise agent privacy benchmark grounded in contextual integrity; shows utility–leakage trade-offagents, privacy, information-flow, benchmark, RAG, enterprise, evaluation
2604.21255When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
PDF
cs.CL92New metrics quantify distillation-driven homogenization in agent tool-use; useful for auditing ecosystem riskagents, tool-use, distillation, behavioral-similarity, evaluation, model-auditing
2604.21827Alignment has a Fantasia Problem
PDF
cs.AI, cs.HC91Alignment framing: users lack fixed goals; proposes intent-formation support to avoid failures.alignment, HCI, goal-ambiguity, agent-assistants, human-factors
2604.21829Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study
PDF
cs.CR90First empirical black-box study of stealing proprietary agent skills; taxonomy + attack surfaceagents, model-extraction, prompt-stealing, IP, security, threat-model
2604.21564Measuring Opinion Bias and Sycophancy via LLM-based Coercion
PDF
cs.CL90Open-source bench to elicit latent opinions/sycophancy in realistic multi-turn coercion settingssycophancy, bias, evaluation, multi-turn, benchmarks, red-teaming
2604.21229EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
PDF
cs.CL, cs.AI90Benchmark for long-term conversational memory + compares graph vs vector vs full-context; includes adversarial abstentionlong-term-memory, benchmarks, RAG, graph-retrieval, evaluation, assistants
2604.21840TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication
PDF
cs.CR, cs.AI88Sandboxed operator+adjudicator agents for safe interactive phishing URL triage; evidence bundlingagentic-systems, sandboxing, cybersecurity, phishing, tool-use, evaluation
2604.21523Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
PDF
cs.CV, cs.CL88Benchmark exposes reliability blind spots of VLMs used as evaluators across I2T/T2I perturbationsVLM, LLM-as-judge, evaluation, robustness, hallucinations, benchmarks
2604.21334Ideological Bias in LLMs' Economic Causal Reasoning
PDF
cs.AI, cs.CE, cs.CL, cs.LG, econ.GN88Large-scale eval of ideological bias in economic causal reasoning; ideology-contested subset from verified effectsbias, causal-reasoning, evaluation, economics, benchmarks, LLMs
2604.21794Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
PDF
cs.AI, cs.CL, cs.MA88End-to-end learned latent inter-agent communication; could reshape multi-agent LLM system design.multi-agent, communication, latent-interfaces, training, LLM-agents
2604.21700Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
PDF
cs.CR, cs.AI, cs.CL86Stealthy LLM backdoors via natural style triggers; clearer end-to-end threat model & pipelinebackdoors, data-poisoning, LLM-security, style-triggers, supply-chain
2604.21911When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
PDF
cs.CV, cs.AI, cs.CL, cs.LG86HalluScope isolates prompt-induced LVLM hallucinations; highlights instruction priors as key driverLVLM, hallucinations, prompting, robustness, benchmark, grounding
2604.21590AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use
PDF
cs.CL86Industrial small agentic LMs trained with multi-round RL + dual data flywheels for tool use; high practical impactagents, tool-use, reinforcement-learning, small-models, synthetic-data, post-training
2604.21593Language as a Latent Variable for Reasoning Optimization
PDF
cs.CL86Polyglot prompting/RL idea: language as latent variable can improve reasoning accuracy.reasoning, multilingual, RLHF, GRPO, inference-strategies
2604.21816Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
PDF
cs.AI85Cuts MCP/tools token overhead via dynamic tool gating + lazy schema loading; claims big token savingsagents, tool-use, efficiency, long-context, MCP, systems
2604.21375VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
PDF
cs.CL, cs.AI, cs.SE85GUI agent framework with mandatory verifier + loop breaker to prevent premature stops and loopsagents, GUI automation, verification, reliability, tool-use, agent safety
2604.21327Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
PDF
cs.LG, cs.AI, cs.CL85Analyzes spurious reward signals in test-time RL for math; proposes debias/denoise framework to reduce noisetest-time-training, reinforcement-learning, reasoning, robustness, math, optimization
2604.21199ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
PDF
cs.LG, cs.CV85ARFBench TSQA for incident response; evaluates FMs on telemetry anomaly reasoning.evaluation, benchmarks, time-series, incident-response, multimodal, ops
2604.21571Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
PDF
cs.AI, cs.LG84Personalization w/ deletable per-user proxies enabling deterministic unlearning; reduces cross-user leakprivacy, unlearning, personalization, LoRA, adapters, data-deletion
2604.21214SQLyzr: A Comprehensive Benchmark and Evaluation Platform for Text-to-SQL
PDF
cs.DB, cs.AI84Text-to-SQL evaluation platform with realistic workload alignment + fine-grained metrics beyond single scoretext-to-sql, evaluation, benchmarks, databases, LLMs, metrics
2604.21716From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation
PDF
cs.CL, cs.SE83Shows codegen bias is underestimated: ML pipeline generation includes sensitive attrs in 87.7% casescode generation, bias, fairness, evaluation, ML pipelines, safety
2604.21421Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation
PDF
cs.CR, cs.AI, cs.CL83Comparative study of DP vs NER vs LLMs for clinical note de-ID (Dutch); directly relevant to privacy in LLM pipelinesprivacy, differential-privacy, de-identification, clinical-NLP, LLMs, security
2604.21344Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
PDF
cs.CL, cs.AI, cs.CV, cs.LG, cs.MA83PolyChartQA benchmark exposes large drop for VLMs on multi-chart reasoning.multimodal, VLM, benchmark, chart-QA, reasoning, evaluation
2604.21197Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
PDF
cs.LG81Membership inference tailored to federated LLM fine-tuning; projection-residual method on gradientsprivacy, membership-inference, federated-learning, LLMs, security
2604.21309When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation
PDF
cs.CL81Large fairness eval of political bias in multi-news summarization across 13 LLMs + metrics.fairness, bias, summarization, evaluation, politics, LLMs
2604.21854Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation
PDF
cs.AI80Proposes statistical certification to quantify/verify acceptable risk for AI regulation complianceAI regulation, risk certification, assurance, governance, deployment safety
2604.21769Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
PDF
cs.AI, cs.CY, cs.HC80Shows leaderboard rankings depend on prompt slices; proposes interactive user-defined evaluation of LLM leaderboardsevaluation, leaderboards, LMArena, benchmarking, human-preferences, governance

AI 论文洞察简报

2026-04-25

0) 执行要点(先读这个)

  • “仅梯度(gradient-only)”和“联邦(federated)”并不是 LLM 微调的隐私护盾:仅一轮 PEFT 梯度就能通过简单的投影-残差检验(ProjRes)实现近乎完美的成员推断,而轻量防御只有在同时大幅压垮效用时才会有效。
  • 企业级智能体隐私在真实的稠密检索工作流中正在失效:CI-Work 显示显著的泄露/违规率以及清晰的隐私–效用耦合;“更努力/更大模型”反而可能增加泄露(逆向缩放),用户施压会让情况更糟。
  • 工具/智能体安全正从提示注入转向协议 + 开发者陷阱 + 轨迹审计:MCP Pitfall Lab 表明确定性的静态检查可低成本消除许多服务端陷阱;而黑盒“技能窃取”和无状态多轮攻击(TTI)展示了通过正常接口也能泄露大量信息。
  • 评测本身正在成为更大的攻击面与失效点:评测用 VLM 会漏掉明显退化(FOCUS);多图表 QA 与时间序列事故 QA 基准显示,在真实世界“组合式、跨上下文”推理处存在巨大能力缺口。
  • 可靠性提升更多来自“系统”而非仅模型:GUI 自动化通过强制完成验证 + 循环恢复而提升(VLAA-GUI);多智能体系统通过学习潜在通信(DiffMAS)而非只交换文本来改进。
  • 偏见/公平发现越来越呈现“随规模非单调”且依任务而定:中等规模模型在摘要政治公平性上可能最好;而代码生成偏见在评估真实 ML 流水线(特征选择)而非玩具 if 语句时会显得更严重。

2) 关键主题(聚类)

主题:联邦与个性化 LLM 隐私很脆弱(需要新原语)

主题:企业/工具生态中智能体的上下文隐私

主题:基准更真实了——模型在组合式、跨上下文任务上看起来更差

主题:通过显式验证、恢复与学习式协同提升可靠性

主题:偏见/公平测量转向“机制相关”任务(规模不是解法)

3) 技术综合

  • 多篇论文在 “通过轨迹/证据实现可审计性” 上趋同:MCP Pitfall Lab 通过 MCP 轨迹验证;TraceScope(URL 分诊)使用不可变证据 + 清单裁决;EngramaBench 标注证据 ID;这体现了从信任模型叙述转向证据化验证的更广泛趋势。
  • 单轮 / 低历史攻击 正在变强:ProjRes 只需单轮梯度;技能窃取声称仅少量交互即可抽取;TTI 利用逐轮无状态审核。
  • 隐私–效用耦合已在智能体场景被实证量化(CI-Work 中 conveyance 与 leakage/violation 的相关性),呼应联邦与临床去标识评估中的 DP 权衡。
  • 分解 + 验证 是反复出现的可靠性模式:多图表 QA 的 VDSP,GUI 智能体的完成验证器 + 破环循环器,测试时 RL 的共识式离策略精炼。
  • “更大模型”不是通用解法:CI-Work 的泄露逆向缩放;FairNews 中等规模的最佳公平权衡;评测 VLM 仍有巨大盲点(FOCUS)。
  • 偏好/评审式评估本身不可靠:FOCUS 显示评测 VLM 失效;交互式排行榜分析显示偏好排名随切片变化,人类在确定性数学题上有 26% 的时间会选错答案。
  • 潜在接口正在成为性能杠杆:DiffMAS 训练 KV-轨迹通信;这与将非文本内部结构视为可优化而非固定的其他工作相呼应。
  • 合成数据被大量使用但角色不同:ARFBench 用合成后训练 + 少量真实集;AgenticQwen 用双飞轮;HalluVL-DPO 用大量合成偏好数据——引出关于偏差/迁移与评估真实性的共同问题。
  • 安全威胁模型正从提示注入扩展到 供应链 + 协议 + 工具元数据 + 多模态(BADSTYLE 风格触发;MCP Pitfall Lab;技能窃取)。

4) Top 5 论文(含“为什么是现在”)

1) Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

  • 展示一种面向 FedLLMs/PEFT 的 单轮、无需影子模型 成员推断攻击:在隐藏嵌入上使用投影残差。
  • 报告在多个 LLM/数据集上 近乎完美的 AUC(常为 1.00),并显著优于以往 FL MIA。
  • 评估防御并发现 DP 只有在破坏效用的噪声水平下才有帮助,剪枝仅部分有效。
  • 质疑 / 局限:运行时开销不小(逐层攻击),且未提出保效用的防御方案。

2) CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

  • 引入企业 CI 基准,包含 稠密检索轨迹 与明确的 Essential vs Sensitive 条目。
  • 发现 显著违规/泄露 与可测的 隐私–效用权衡,并存在 逆向缩放:更大模型可能泄露更多。
  • 显示 用户施压 会显著增加泄露,甚至降低 conveyance(“输–输”)。
  • 质疑 / 局限:合成场景与 LLM 评审的漏报意味着泄露可能是下界;未覆盖组织特定规范。

3) MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

  • 开发者陷阱分类法 操作化,并提供面向保密性/完整性目标的 基于轨迹的验证器
  • Tier-1 静态分析器在可静态检查的陷阱类别上达到 F1=1.0,且 适配 CI(~5.2 ms)
  • 加固将发现从 29→0,平均仅需 ~27 行代码 修改;并记录常见的 轨迹–叙述偏离
  • 质疑 / 局限:评估范围较小(少量场景;初步语料),多模态分析尚不充分。

4) Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

  • 发布 FOCUS:>4,000 个经人工验证的扰动实例,用于对评测 VLM 在 I2T 与 T2I 上进行元评测。
  • 发现 评测失败率很高,尤其在单答案打分中;成对比较更可靠
  • 显示 推理预算并不稳定地带来帮助,评测器可能在文字中指出错误却不在分数中体现。
  • 质疑 / 局限:gold 输出由模型生成(虽经人工复核);仅测试了四个评测 VLM。

5) VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

  • 针对两类主导 GUI 智能体失败:过早完成循环,通过完成门控 + 独立验证器 + 多级破环循环器 + 搜索来解决。
  • 报告在 OSWorld-Verified(Opus 4.6)上 77.45% 成功率,超过报告的人类水平(72.4%),并在 WAA 上表现强。
  • 提供消融,展示哪些模块能减少错误完成与浪费步数。
  • 质疑 / 局限:在较弱骨干且预算紧时,工具开销可能有害;对部分模型,错误完成仍是主导失效模式。

5) 实用下一步

  • 面向联邦/PEFT 部署:上线前加入红队审计,明确测试 单轮梯度泄露(ProjRes 风格);将“未共享原始数据”视为不足。
  • 面向企业智能体:在 稠密检索用户施压 条件下测量 Leakage/Violation/Conveyance(CI-Work 风格),而非只在干净提示上;跟踪扩容是否增加泄露。
  • 为工具服务器采用基于轨迹的安全 QA:将 Tier-1 静态检查(MCP Pitfall Lab)集成进 CI,并要求协议轨迹日志,以便验证器检测外泄/完整性违规。
  • 加固对黑盒抽取的防护:用自动化提示套件测试 技能/包泄露;考虑输出过滤与推理加固,同时评估语义泄露(不只精确匹配)。
  • 修复无状态审核缺口:实现会话级聚合或风险评分以检测 分布式多轮意图(TTI),并用无状态多轮攻击进行基准测试。
  • 不要默认信任评测 VLM:用扰动套件(FOCUS 类)验证你的评测器;可行时优先 成对 范式,并监控“理由–分数”不一致。
  • 提升 GUI/智能体可靠性:加入显式 完成标准 + 独立验证器循环升级机制;将错误完成率与浪费步数比作为一等指标记录(VLAA-GUI)。
  • 公平审计:在 机制相关任务(如 ML 流水线特征选择、多文档观点保持、方向性因果符号)上评估,不要假设更大模型会降低偏见。

由逐篇论文分析生成;未进行外部浏览。