AI 论文日报(2026-05-05)

Published:

English version: /paper-news/2026-05-05/

运行统计

  • 候选论文: 4818
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-05-01T00:00:00Z → 2026-05-02T00:00:00Z (weekend_backlog_sun, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2604.26467Differentially Private Contrastive Learning via Bounding Group-level Contribution
PDF
cs.CR91DP contrastive learning method tackles privacy-utility tradeoff with principled dependency reduction.privacy, differential-privacy, representation-learning, security
2604.19471API Security Based on Automatic OpenAPI Mapping
PDF
cs.CR90Unsupervised API mapping plus anomaly detection with strong security results and deployment relevance.security, API, anomaly-detection, unsupervised, OpenAPI
2604.26573PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
PDF
cs.LG89Reasoning training method for self-distilled LLMs with token-level supervision and verified context.llm-reasoning, post-training, self-distillation, training
2604.24372SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution
PDF
cs.CL, cs.AI, cs.NE88LLM-guided algorithm discovery with explicit strategy-space evolution; strong novelty for agentic search.llm, agents, algorithm-discovery, evolution, reasoning
2604.25840PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
PDF
cs.CL, cs.AI88Clinically grounded eval for LLM patient simulators; reduces judge opacity and measures diversity.evaluation, llm, safety, benchmark, mental-health
2604.24703Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
PDF
cs.SE, cs.AI88Targets LLM codegen reliability by detecting defective prompts; practical safety relevance and strong reported gains.llm-reliability, code-generation, input-quality, evaluation, small-models
2604.25231DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams
PDF
cs.CV, cs.AI, cs.CL88Benchmark for evidence-grounded diagram reasoning, targeting faithfulness beyond answer accuracy.benchmark, vlm, grounding, evaluation, faithfulness
2604.19526Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection
PDF
cs.CR, cs.LG, cs.SE87LLM-generated XSS obfuscation benchmark pipeline is directly relevant to adversarial security evaluation.LLM-security, XSS, adversarial-evaluation, red-teaming, cybersecurity
2604.25264MARD: A Multi-Agent Framework for Robust Android Malware Detection
PDF
cs.CR, cs.SE86LLM multi-agent malware detection is security-relevant and targets robustness under concept drift.llm, multi-agent, cybersecurity, malware-detection, robustness
2604.21679A Sociotechnical, Practitioner-Centered Approach to Technology Adoption in Cybersecurity Operations: An LLM Case
PDF
cs.CR86Practitioner-grounded study of LLM deployment in SOCs; directly relevant to trust, reliability, and security ops.llm-security, deployment, trust, cybersecurity, human-factors
2604.25122M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering
PDF
cs.CV, cs.AI86New benchmark for multimodal multi-entity multi-hop reasoning with evidence; useful for MLLM evaluation.benchmark, multimodal, reasoning, evaluation, retrieval
2604.26479Recipes for Calibration Checks in Safety-Critical Applications
PDF
stat.ME, cs.LG86Calibration testing framework for safety-critical probabilistic systems; strong reliability relevance.calibration, reliability, safety-critical, evaluation, uncertainty
2604.24396Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
PDF
cs.CV, cs.AI86Inference-time hallucination mitigation for VLMs via decoding intervention to boost visual fidelity.vlm, hallucination, decoding, reliability
2604.26645SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data
PDF
cs.AI, cs.LG84Agentic framework for trustworthiness and AI-readiness evaluation of scientific data; reusable criteria system.agents, evaluation, trustworthiness, data-governance, ai-for-science
2604.24350Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training
PDF
cs.LG, cs.AI, cs.CR84Backdoor-based explanation of catastrophic overfitting could unify robustness failures and defenses.adversarial-robustness, backdoors, training-dynamics, security
2604.01635Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations
PDF
cs.CR84Concrete defense against deepfake facial manipulation with claimed generalization beyond white-box GAN settings.security, deepfakes, adversarial-defense, privacy, robustness
2604.24052QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering
PDF
cs.CV, cs.AI84Reference-free multimodal eval for video summaries targeting coverage, factuality, chronology.evaluation, multimodal, factuality, video, benchmark
2604.20079On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
PDF
cs.LG, cs.CL84LLM efficiency study with concrete quantization results; relevant to deployable coding models.llm, diffusion-lm, quantization, efficiency, coding
2604.18552Do Privacy Policies Match with the Logs? An Empirical Study of Privacy Disclosure in Android Application Logs
PDF
cs.CR, cs.SE84Large empirical privacy study linking policies to actual app logs; concrete, scalable privacy auditing angle.privacy, auditing, mobile-security, empirical-study, logging
2604.24468A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations
PDF
cs.CR, cs.CL, cs.DC, cs.LG84Timely survey on privacy-preserving split learning for LLM fine-tuning, including defenses and attacks.llm, privacy, split-learning, fine-tuning, survey
2604.12545Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents
PDF
cs.AI, cs.CY84Evaluates LLM agents against human emotional responses across cultures; useful for agent realism and evals.llm-agents, evaluation, cross-cultural, simulation, alignment
2604.25149Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models
PDF
cs.AI84Paired benchmark shows semantic context sharply improves NL-to-data accuracy and reduces hallucination.hallucination, data-analytics, benchmark, grounding, llm-evaluation
2604.19070TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
PDF
cs.CL, cs.LG83RL-only post-training for LLM reasoning on text-rich networks; notable frontier LLM training angle.llm, reinforcement-learning, reasoning, post-training, graphs
2604.25584DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding
PDF
cs.AI82Multimodal factuality benchmark/framework exposing fluent-but-wrong model outputs in procedural video tasks.multimodal, factuality, evaluation, benchmark, video-understanding
2604.20596Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation
PDF
cs.LG, cs.CR82Combines clustered FL with differential privacy; practical privacy-preserving learning under heterogeneity.differential-privacy, federated-learning, privacy, security, distributed-ml
2603.09691ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling
PDF
cs.CL, cs.AI82Unified instruction- and schema-aware tuning for task-oriented dialog; reusable LLM adaptation framework.llm, instruction-tuning, task-oriented-dialog, schema, alignment
2604.25318Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
PDF
cs.GR, cs.AI, cs.CL82LLM agent framework with MCP-based tool integration for end-to-end 3D cutscene generation.llm-agents, tool-use, mcp, automation, multimodal
2603.11392Agentic AI for Embodied-enhanced Beam Prediction in Low-Altitude Economy Networks
PDF
cs.NI, cs.AI82Multi-agent LLM reasoning for embodied comms; agentic design is relevant though safety claims are limited.agents, multi-agent, LLM, reasoning, embodied-ai
2604.17718Do LLMs Use Cultural Knowledge Without Being Told? A Multilingual Evaluation of Implicit Pragmatic Adaptation
PDF
cs.CL, cs.SI82Evaluates implicit pragmatic adaptation across languages; useful for reliability and culturally aware LLM behavior.LLM-evaluation, multilingual, pragmatics, reliability, culture
2604.24114IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning
PDF
cs.CL82Cross-lingual math reasoning with staged RL curriculum and released dataset; useful post-training signal.llm, reasoning, rl, multilingual, dataset

AI 论文洞察简报

2026-05-05

0) 执行要点(请先阅读)

  • 今天一个强烈的模式是:推理时结构优于单纯模型扩展。无论是用于 text-to-SQL 的语义层、面向任务型对话的 schema-aware prompting、冲突驱动的视觉验证,还是基于证据的评估,都表明加入合适的外部结构能显著提升可靠性。
  • 多篇论文正在将具备代理能力/工具使用能力的系统从演示推进到可操作工作流:Android 恶意软件分诊、SOC 副驾、波束预测、过场动画生成,以及科学数据就绪性,都依赖于将任务拆解为专职代理加确定性工具,而不是仅靠端到端提示。
  • 在隐私/安全方面,最实质性的进展来自对已知瓶颈的结构性修复:组级有界 DP 对比学习、隐私保护的聚类联邦学习初始化、无监督 API schema 归纳,以及大规模应用日志/隐私政策不匹配分析。
  • 评估正变得更加诊断化且基于证据,而不只是面向基准分数:QEVA、DualFact+、DRAGON、PSI-Bench 和 M$^3$-VQA 都将失败分解为可解释的子维度,如时间顺序、证据支撑、角色一致性或群体层面的真实性。
  • RL/后训练工作越来越关注监督应落在何处,而不只是是否使用 RL:PAINT、TRN-R1-Zero 和 IRIS 都围绕信息量更高的位置、邻居影响或部分解延续来重塑奖励或课程。
  • 一个反复出现的提醒是:许多收益伴随着延迟、工具链或标注开销。实践前沿已不再是“这能不能工作?”,而是“它能否在部署预算内工作、在评审器稳定的情况下工作,并且不依赖脆弱的外部依赖项?”

2) 关键主题(聚类)

主题:结构化上下文是可靠性的倍增器

主题:代理系统正在变成工具编排系统

主题:隐私与安全进展正从点式防御转向流水线设计

主题:评估正在走向基于证据、可解释的诊断

主题:RL 与后训练正变得更加有针对性

3) 技术综合

  • 一个共同的系统模式是 LLM + 确定性底座:MARD 使用 Soot/FlowDroid,API 安全使用图验证 + 自编码器,Cutscene Agent 使用引擎原生 MCP 工具,Active-Look 使用外部 grounding 专家。
  • 多篇论文用选择性验证循环替代单体式推理:Active-Look 会重新检查有争议区域,M3-VQA 的代理式检索会分解多跳查询,QEVA 通过 QA 验证摘要,DualFact 则将抽取事实与视频进行核验。
  • 在至少一个严格控制的设置中,上下文工程优于模型选择:在 text-to-SQL 中,语义层上下文形成了统计上显著不同的高准确率与低准确率簇,而簇内模型差异并不显著。
  • 隐私工作反复从结构层面攻击敏感度:DP-GCL 通过对负样本分组来限制贡献;PINA 在安全聚合前先压缩并私有化稀疏 LoRA sketch 以完成初始化。
  • 评估论文越来越多地采用与人类对齐的分解方式:时间顺序、证据定位、NEP progression、概念事实与情境事实,以及 top-3 情绪重叠,都让失败变得可检查。
  • 多篇鲁棒性论文表明,朴素聚合可能有害:合并视觉检测器会降低 grounding 效果,完整解条件化会使特权蒸馏过度锐化,而扁平 URL/载荷建模会错过 API 结构。
  • 预算感知的推理设计正在明显增加:Active-Look 将视觉 token 分配给有争议的框,代理式波束预测会切换模态路径,PAINT 则将教师插值稀疏化到熵失配最高的位置。
  • 多项工作揭示的是静默失效模式而非显性错误:语义错误但可执行的 SQL、政策中未披露的隐私泄露、没有显式指令时文化上不匹配的语用行为,以及答案正确但缺乏图表证据支撑的情况。
  • 一个反复出现的实证模式是:基准提升显著,但部署存在注意事项:多模态验证的延迟开销、AEGIS 的黑盒查询成本、Active-Look 的固定预处理开销,以及 SOC 采用研究中的单站点定性验证。
  • 跨模态来看,领域正在收敛到证据优先的可靠性:如果模型无法指出正确的 schema、区域、页面、事实或轨迹,那么仅有答案质量已不再被视为充分。

4) Top 5 论文(附“为什么是现在”)

Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

  • 表明一个小型手工编写的语义层可将三种前沿模型的一次性分析通过率提升 +17.2 到 +23.2 个点。
  • 强配对设计将上下文隔离为主要驱动因素;带语义层的运行聚成一类,原始 schema 的运行聚成另一类。
  • 当下很有用,因为许多团队正在决定:分析副驾究竟该投资模型升级,还是投资语义建模。
  • 审慎看法:证据来自一个零售数据集和一种提示形式;其在跨领域和运行时语义系统中的普适性仍未确定。

Differentially Private Contrastive Learning via Bounding Group-level Contribution

  • 通过组内负样本和按组裁剪,将 InfoNCE 训练的敏感度固定为 2C。
  • 报告称,相比先前的 DP 对比学习方法,在分类和图文检索上都有显著提升,并且在大 batch 扩展上更好。
  • 当下很有用,因为隐私保护表征学习长期受限于对比目标下 DP 效用差。
  • 审慎看法:尚无十亿级预训练结果,而且与非隐私训练之间仍存在有意义的差距。

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

  • 提出一种实用、免训练的幻觉缓解方法:根据检测器分歧,在全局高亮与选择性放大之间进行裁决。
  • 在多个 LVLM 上于 POPE、MME 和 CHAIR 上取得稳定提升,并通过强消融解释了为何朴素的检测器并集会失败。
  • 当下很有用,因为推理时缓解是现有多模态模型少数可部署的杠杆之一。
  • 审慎看法:依赖外部检测器召回率,并增加了相当可观的运行时间/token 开销。

MARD: A Multi-Agent Framework for Robust Android Malware Detection

  • 将 manifest 级风险筛查、ReAct 风格静态分析取证和最终 LLM 裁决结合为一个可解释的零样本恶意软件流水线。
  • 报告称在 CICMalDroid 和 AndroZoo 上取得较强 F1,并在概念漂移下保持时间鲁棒性,且每个 APK 成本低于 $0.10。
  • 当下很有用,因为安全团队需要的是可解释、能抗分布漂移的 LLM 辅助分诊,而不只是基准上表现好的分类器。
  • 审慎看法:加壳/动态加载应用仍是弱点,且生产吞吐尚未得到验证。

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

  • 提供一种无参考评估指标,通过多模态 QA 将摘要质量分解为覆盖度、事实性和时间顺序。
  • 在一个新的 800 条摘要基准上,与大量基线相比,和人工判断的相关性更高。
  • 当下很有用,因为视频摘要的发展速度快于人工参考的构建速度,团队需要可扩展且能捕捉时间/事实错误的评估方法。
  • 审慎看法:仍依赖强大的 LLM/VLM 组件,因此评审器幻觉和 API 成本仍是现实问题。

5) 实践上的下一步

  • 在扩展模型之前,先加入结构化上下文层:用于分析的语义层文档、用于对话的显式 schema,以及用于多模态 QA 的证据检索。
  • 对代理系统,优先采用工具优先的任务分解:将规划留给 LLM,但把验证、检索、静态分析和执行放入带日志的确定性模块。
  • 衡量静默错误率,而不只是任务准确率:可执行但错误的 SQL、无支撑的视觉断言、未 grounding 的证据框,或政策/日志不匹配。
  • 在多模态系统中,实现预算约束下的选择性验证,而不是全量重处理;检测器分歧或检索不确定性是有用的触发信号。
  • 对 RL/后训练,测试稀疏且有针对性的监督:奖励信息量高的位置、部分延续,或结构上重要的上下文,而不只是最终结果。
  • 在隐私敏感的表征学习中,评估结构化敏感度控制(分组、有界贡献、安全聚类聚合)是否比标准 DP-SGD 基线带来更好的效用。
  • 如果要在安全相关或政策相关场景部署 LLM,加入分布与文化诊断,而不要只依赖平均评审分数。
  • 构建能返回可执行子分数的评估栈——如 grounding、时间顺序、遗漏、显著性、校准或真实性——以便将失败反馈到训练和产品设计中。

基于逐篇论文分析生成;未进行外部浏览。