Daily AI Paper Report (2026-05-05)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 4818
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-01T00:00:00Z → 2026-05-02T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.26467Differentially Private Contrastive Learning via Bounding Group-level Contribution
PDF
cs.CR91DP contrastive learning method tackles privacy-utility tradeoff with principled dependency reduction.privacy, differential-privacy, representation-learning, security
2604.19471API Security Based on Automatic OpenAPI Mapping
PDF
cs.CR90Unsupervised API mapping plus anomaly detection with strong security results and deployment relevance.security, API, anomaly-detection, unsupervised, OpenAPI
2604.26573PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
PDF
cs.LG89Reasoning training method for self-distilled LLMs with token-level supervision and verified context.llm-reasoning, post-training, self-distillation, training
2604.24372SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution
PDF
cs.CL, cs.AI, cs.NE88LLM-guided algorithm discovery with explicit strategy-space evolution; strong novelty for agentic search.llm, agents, algorithm-discovery, evolution, reasoning
2604.25840PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
PDF
cs.CL, cs.AI88Clinically grounded eval for LLM patient simulators; reduces judge opacity and measures diversity.evaluation, llm, safety, benchmark, mental-health
2604.24703Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
PDF
cs.SE, cs.AI88Targets LLM codegen reliability by detecting defective prompts; practical safety relevance and strong reported gains.llm-reliability, code-generation, input-quality, evaluation, small-models
2604.25231DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams
PDF
cs.CV, cs.AI, cs.CL88Benchmark for evidence-grounded diagram reasoning, targeting faithfulness beyond answer accuracy.benchmark, vlm, grounding, evaluation, faithfulness
2604.19526Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection
PDF
cs.CR, cs.LG, cs.SE87LLM-generated XSS obfuscation benchmark pipeline is directly relevant to adversarial security evaluation.LLM-security, XSS, adversarial-evaluation, red-teaming, cybersecurity
2604.25264MARD: A Multi-Agent Framework for Robust Android Malware Detection
PDF
cs.CR, cs.SE86LLM multi-agent malware detection is security-relevant and targets robustness under concept drift.llm, multi-agent, cybersecurity, malware-detection, robustness
2604.21679A Sociotechnical, Practitioner-Centered Approach to Technology Adoption in Cybersecurity Operations: An LLM Case
PDF
cs.CR86Practitioner-grounded study of LLM deployment in SOCs; directly relevant to trust, reliability, and security ops.llm-security, deployment, trust, cybersecurity, human-factors
2604.25122M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering
PDF
cs.CV, cs.AI86New benchmark for multimodal multi-entity multi-hop reasoning with evidence; useful for MLLM evaluation.benchmark, multimodal, reasoning, evaluation, retrieval
2604.26479Recipes for Calibration Checks in Safety-Critical Applications
PDF
stat.ME, cs.LG86Calibration testing framework for safety-critical probabilistic systems; strong reliability relevance.calibration, reliability, safety-critical, evaluation, uncertainty
2604.24396Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
PDF
cs.CV, cs.AI86Inference-time hallucination mitigation for VLMs via decoding intervention to boost visual fidelity.vlm, hallucination, decoding, reliability
2604.26645SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data
PDF
cs.AI, cs.LG84Agentic framework for trustworthiness and AI-readiness evaluation of scientific data; reusable criteria system.agents, evaluation, trustworthiness, data-governance, ai-for-science
2604.24350Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training
PDF
cs.LG, cs.AI, cs.CR84Backdoor-based explanation of catastrophic overfitting could unify robustness failures and defenses.adversarial-robustness, backdoors, training-dynamics, security
2604.01635Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations
PDF
cs.CR84Concrete defense against deepfake facial manipulation with claimed generalization beyond white-box GAN settings.security, deepfakes, adversarial-defense, privacy, robustness
2604.24052QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering
PDF
cs.CV, cs.AI84Reference-free multimodal eval for video summaries targeting coverage, factuality, chronology.evaluation, multimodal, factuality, video, benchmark
2604.20079On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
PDF
cs.LG, cs.CL84LLM efficiency study with concrete quantization results; relevant to deployable coding models.llm, diffusion-lm, quantization, efficiency, coding
2604.18552Do Privacy Policies Match with the Logs? An Empirical Study of Privacy Disclosure in Android Application Logs
PDF
cs.CR, cs.SE84Large empirical privacy study linking policies to actual app logs; concrete, scalable privacy auditing angle.privacy, auditing, mobile-security, empirical-study, logging
2604.24468A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations
PDF
cs.CR, cs.CL, cs.DC, cs.LG84Timely survey on privacy-preserving split learning for LLM fine-tuning, including defenses and attacks.llm, privacy, split-learning, fine-tuning, survey
2604.12545Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents
PDF
cs.AI, cs.CY84Evaluates LLM agents against human emotional responses across cultures; useful for agent realism and evals.llm-agents, evaluation, cross-cultural, simulation, alignment
2604.25149Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models
PDF
cs.AI84Paired benchmark shows semantic context sharply improves NL-to-data accuracy and reduces hallucination.hallucination, data-analytics, benchmark, grounding, llm-evaluation
2604.19070TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
PDF
cs.CL, cs.LG83RL-only post-training for LLM reasoning on text-rich networks; notable frontier LLM training angle.llm, reinforcement-learning, reasoning, post-training, graphs
2604.25584DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding
PDF
cs.AI82Multimodal factuality benchmark/framework exposing fluent-but-wrong model outputs in procedural video tasks.multimodal, factuality, evaluation, benchmark, video-understanding
2604.20596Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation
PDF
cs.LG, cs.CR82Combines clustered FL with differential privacy; practical privacy-preserving learning under heterogeneity.differential-privacy, federated-learning, privacy, security, distributed-ml
2603.09691ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling
PDF
cs.CL, cs.AI82Unified instruction- and schema-aware tuning for task-oriented dialog; reusable LLM adaptation framework.llm, instruction-tuning, task-oriented-dialog, schema, alignment
2604.25318Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
PDF
cs.GR, cs.AI, cs.CL82LLM agent framework with MCP-based tool integration for end-to-end 3D cutscene generation.llm-agents, tool-use, mcp, automation, multimodal
2603.11392Agentic AI for Embodied-enhanced Beam Prediction in Low-Altitude Economy Networks
PDF
cs.NI, cs.AI82Multi-agent LLM reasoning for embodied comms; agentic design is relevant though safety claims are limited.agents, multi-agent, LLM, reasoning, embodied-ai
2604.17718Do LLMs Use Cultural Knowledge Without Being Told? A Multilingual Evaluation of Implicit Pragmatic Adaptation
PDF
cs.CL, cs.SI82Evaluates implicit pragmatic adaptation across languages; useful for reliability and culturally aware LLM behavior.LLM-evaluation, multilingual, pragmatics, reliability, culture
2604.24114IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning
PDF
cs.CL82Cross-lingual math reasoning with staged RL curriculum and released dataset; useful post-training signal.llm, reasoning, rl, multilingual, dataset

AI Paper Insight Brief

2026-05-05

0) Executive takeaways (read this first)

  • A strong pattern today is inference-time structure beats raw model scaling: semantic layers for text-to-SQL, schema-aware prompting for task-oriented dialog, conflict-driven visual verification, and evidence-grounded evaluation all show that adding the right external structure materially improves reliability.
  • Several papers push agentic/tool-using systems from demo to operational workflow: Android malware triage, SOC copilots, beam prediction, cutscene generation, and scientific-data readiness all rely on decomposition into specialist agents plus deterministic tools rather than end-to-end prompting alone.
  • On privacy/security, the most substantive advances are structural fixes to known bottlenecks: group-bounded DP contrastive learning, privacy-preserving clustered FL initialization, unsupervised API schema induction, and large-scale app-log/policy mismatch analysis.
  • Evaluation is getting more diagnostic and grounded, not just benchmark-score oriented: QEVA, DualFact+, DRAGON, PSI-Bench, and M3-VQA all decompose failure into interpretable subdimensions like chronology, evidence grounding, role consistency, or population-level realism.
  • RL/post-training work is increasingly targeting where supervision should land, not just whether to use RL: PAINT, TRN-R1-Zero, and IRIS all reshape rewards or curricula around informative positions, neighbor influence, or partial-solution continuation.
  • A recurring caution: many gains come with latency, tooling, or annotation overhead. The practical frontier is no longer “can this work?” but “can it work under deployment budgets, with stable judges, and without brittle external dependencies?”

2) Key themes (clusters)

Theme: Structured context as a reliability multiplier

Theme: Agentic systems are becoming tool orchestration systems

Theme: Privacy and security progress is shifting from point defenses to pipeline design

Theme: Evaluation is moving toward grounded, interpretable diagnostics

Theme: RL and post-training are getting more targeted

3) Technical synthesis

  • A common systems pattern is LLM + deterministic substrate: MARD uses Soot/FlowDroid, API security uses graph validation + autoencoder, Cutscene Agent uses engine-native MCP tools, and Active-Look uses external grounding experts.
  • Several papers replace monolithic inference with selective verification loops: Active-Look re-checks disputed regions, M3-VQA agentic retrieval decomposes multi-hop queries, QEVA verifies summaries via QA, and DualFact verifies extracted facts against video.
  • Context engineering outperformed model choice in at least one tightly controlled setup: in text-to-SQL, semantic-layer context created statistically distinct high-accuracy and low-accuracy clusters, with within-cluster model differences insignificant.
  • Privacy work repeatedly attacks sensitivity at the structural level: DP-GCL bounds contribution by grouping negatives; PINA compresses and privatizes sparse LoRA sketches for initialization before secure aggregation.
  • Evaluation papers increasingly use human-aligned decompositions: chronology, evidence localization, NEP progression, conceptual vs contextual facts, and top-3 emotion overlap all make failures inspectable.
  • Multiple robustness papers show that naive aggregation can hurt: unioning visual detectors degrades grounding, full-solution conditioning can oversharpen privileged distillation, and flat URL/payload modeling misses API structure.
  • There is a notable rise in budget-aware inference design: Active-Look allocates visual tokens to disputed boxes, agentic beam prediction switches modality paths, and PAINT sparsifies teacher interpolation to top-entropy-mismatch positions.
  • Several works expose silent failure modes rather than overt errors: semantically wrong but executable SQL, privacy leaks absent from policies, culturally mismatched pragmatic behavior without explicit instruction, and correct answers without grounded diagram evidence.
  • A recurring empirical pattern is strong benchmark gains with deployment caveats: latency overhead in multimodal verification, black-box query cost in AEGIS, fixed preprocessing overhead in Active-Look, and single-site qualitative validation in SOC adoption.
  • Across modalities, the field is converging on evidence-first reliability: if a model cannot point to the right schema, region, page, fact, or trajectory, answer quality alone is no longer treated as sufficient.

4) Top 5 papers (with “why now”)

Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

  • Shows a small hand-authored semantic layer boosts first-shot analytical pass rate by +17.2 to +23.2 points across three frontier models.
  • Strong paired design isolates context as the main driver; semantic-layer runs cluster together and raw-schema runs cluster together.
  • Useful now because many teams are deciding whether to invest in model upgrades or semantic modeling for analytics copilots.
  • Skeptical take: evidence comes from one retail dataset and one prompt form; generality across domains and runtime semantic systems is still open.

Differentially Private Contrastive Learning via Bounding Group-level Contribution

  • Reworks InfoNCE training so sensitivity is fixed at 2C via within-group negatives and per-group clipping.
  • Reports strong gains over prior DP contrastive methods in both classification and image-text retrieval, plus better large-batch scaling.
  • Useful now because privacy-preserving representation learning has been bottlenecked by poor DP utility under contrastive objectives.
  • Skeptical take: no billion-scale pretraining results yet, and a meaningful gap to non-private training remains.

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

  • Introduces a practical training-free hallucination mitigation method that arbitrates between global highlighting and selective zoom based on detector disagreement.
  • Delivers consistent gains on POPE, MME, and CHAIR across multiple LVLMs, with strong ablations explaining why naive detector union fails.
  • Useful now because inference-time mitigation is one of the few deployable levers for existing multimodal models.
  • Skeptical take: depends on external detector recall and adds substantial runtime/token overhead.

MARD: A Multi-Agent Framework for Robust Android Malware Detection

  • Combines manifest-level risk screening, ReAct-style static-analysis forensics, and final LLM adjudication into an interpretable zero-shot malware pipeline.
  • Reports strong F1 on CICMalDroid and AndroZoo, plus temporal robustness under concept drift and per-APK cost under $0.10.
  • Useful now because security teams want LLM-assisted triage that is explainable and resilient to distribution shift, not just benchmarked classifiers.
  • Skeptical take: packed/dynamically loaded apps remain a weakness, and production throughput is not established.

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

  • Provides a reference-free metric that decomposes summary quality into coverage, factuality, and chronology using multimodal QA.
  • Achieves higher correlation with human judgments than a broad set of baselines on a new 800-summary benchmark.
  • Useful now because video summarization is moving faster than human-reference creation, and teams need scalable evaluation that catches temporal/factual errors.
  • Skeptical take: still relies on strong LLM/VLM components, so judge hallucination and API cost remain real concerns.

5) Practical next steps

  • Add structured context layers before scaling models: semantic-layer docs for analytics, explicit schemas for dialog, and evidence retrieval for multimodal QA.
  • For agent systems, prefer tool-first decomposition: keep planning in the LLM, but move verification, retrieval, static analysis, and execution into deterministic modules with logs.
  • Measure silent-error rates, not just task accuracy: executable-but-wrong SQL, unsupported visual claims, ungrounded evidence boxes, or policy/log mismatches.
  • In multimodal systems, implement selective verification under a budget rather than full reprocessing; detector disagreement or retrieval uncertainty is a useful trigger.
  • For RL/post-training, test sparse, targeted supervision: reward informative positions, partial continuations, or structurally important context instead of only final outcomes.
  • In privacy-sensitive representation learning, evaluate whether structural sensitivity control (grouping, bounded contribution, secure clustered aggregation) gives better utility than standard DP-SGD baselines.
  • If deploying LLMs in safety- or policy-adjacent settings, add distributional and cultural diagnostics rather than relying on average judge scores.
  • Build evaluation stacks that return actionable sub-scores—grounding, chronology, omission, salience, calibration, or realism—so failures can feed back into training and product design.

Generated from per-paper analyses; no external browsing.