Daily AI Paper Report (2026-05-05)

Published: May 05, 2026

Chinese version: [中文]

Run stats

Candidates: 4818
Selected: 30
Deepread completed: 30
Window (UTC): 2026-05-01T00:00:00Z → 2026-05-02T00:00:00Z (weekend_backlog_sun, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.26467`	Differentially Private Contrastive Learning via Bounding Group-level Contribution PDF	cs.CR	91	DP contrastive learning method tackles privacy-utility tradeoff with principled dependency reduction.	privacy, differential-privacy, representation-learning, security
`2604.19471`	API Security Based on Automatic OpenAPI Mapping PDF	cs.CR	90	Unsupervised API mapping plus anomaly detection with strong security results and deployment relevance.	security, API, anomaly-detection, unsupervised, OpenAPI
`2604.26573`	PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners PDF	cs.LG	89	Reasoning training method for self-distilled LLMs with token-level supervision and verified context.	llm-reasoning, post-training, self-distillation, training
`2604.24372`	SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution PDF	cs.CL, cs.AI, cs.NE	88	LLM-guided algorithm discovery with explicit strategy-space evolution; strong novelty for agentic search.	llm, agents, algorithm-discovery, evolution, reasoning
`2604.25840`	PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators PDF	cs.CL, cs.AI	88	Clinically grounded eval for LLM patient simulators; reduces judge opacity and measures diversity.	evaluation, llm, safety, benchmark, mental-health
`2604.24703`	Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis PDF	cs.SE, cs.AI	88	Targets LLM codegen reliability by detecting defective prompts; practical safety relevance and strong reported gains.	llm-reliability, code-generation, input-quality, evaluation, small-models
`2604.25231`	DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams PDF	cs.CV, cs.AI, cs.CL	88	Benchmark for evidence-grounded diagram reasoning, targeting faithfulness beyond answer accuracy.	benchmark, vlm, grounding, evaluation, faithfulness
`2604.19526`	Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection PDF	cs.CR, cs.LG, cs.SE	87	LLM-generated XSS obfuscation benchmark pipeline is directly relevant to adversarial security evaluation.	LLM-security, XSS, adversarial-evaluation, red-teaming, cybersecurity
`2604.25264`	MARD: A Multi-Agent Framework for Robust Android Malware Detection PDF	cs.CR, cs.SE	86	LLM multi-agent malware detection is security-relevant and targets robustness under concept drift.	llm, multi-agent, cybersecurity, malware-detection, robustness
`2604.21679`	A Sociotechnical, Practitioner-Centered Approach to Technology Adoption in Cybersecurity Operations: An LLM Case PDF	cs.CR	86	Practitioner-grounded study of LLM deployment in SOCs; directly relevant to trust, reliability, and security ops.	llm-security, deployment, trust, cybersecurity, human-factors
`2604.25122`	M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering PDF	cs.CV, cs.AI	86	New benchmark for multimodal multi-entity multi-hop reasoning with evidence; useful for MLLM evaluation.	benchmark, multimodal, reasoning, evaluation, retrieval
`2604.26479`	Recipes for Calibration Checks in Safety-Critical Applications PDF	stat.ME, cs.LG	86	Calibration testing framework for safety-critical probabilistic systems; strong reliability relevance.	calibration, reliability, safety-critical, evaluation, uncertainty
`2604.24396`	Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation PDF	cs.CV, cs.AI	86	Inference-time hallucination mitigation for VLMs via decoding intervention to boost visual fidelity.	vlm, hallucination, decoding, reliability
`2604.26645`	SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data PDF	cs.AI, cs.LG	84	Agentic framework for trustworthiness and AI-readiness evaluation of scientific data; reusable criteria system.	agents, evaluation, trustworthiness, data-governance, ai-for-science
`2604.24350`	Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training PDF	cs.LG, cs.AI, cs.CR	84	Backdoor-based explanation of catastrophic overfitting could unify robustness failures and defenses.	adversarial-robustness, backdoors, training-dynamics, security
`2604.01635`	Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations PDF	cs.CR	84	Concrete defense against deepfake facial manipulation with claimed generalization beyond white-box GAN settings.	security, deepfakes, adversarial-defense, privacy, robustness
`2604.24052`	QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering PDF	cs.CV, cs.AI	84	Reference-free multimodal eval for video summaries targeting coverage, factuality, chronology.	evaluation, multimodal, factuality, video, benchmark
`2604.20079`	On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks PDF	cs.LG, cs.CL	84	LLM efficiency study with concrete quantization results; relevant to deployable coding models.	llm, diffusion-lm, quantization, efficiency, coding
`2604.18552`	Do Privacy Policies Match with the Logs? An Empirical Study of Privacy Disclosure in Android Application Logs PDF	cs.CR, cs.SE	84	Large empirical privacy study linking policies to actual app logs; concrete, scalable privacy auditing angle.	privacy, auditing, mobile-security, empirical-study, logging
`2604.24468`	A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations PDF	cs.CR, cs.CL, cs.DC, cs.LG	84	Timely survey on privacy-preserving split learning for LLM fine-tuning, including defenses and attacks.	llm, privacy, split-learning, fine-tuning, survey
`2604.12545`	Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents PDF	cs.AI, cs.CY	84	Evaluates LLM agents against human emotional responses across cultures; useful for agent realism and evals.	llm-agents, evaluation, cross-cultural, simulation, alignment
`2604.25149`	Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models PDF	cs.AI	84	Paired benchmark shows semantic context sharply improves NL-to-data accuracy and reduces hallucination.	hallucination, data-analytics, benchmark, grounding, llm-evaluation
`2604.19070`	TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only PDF	cs.CL, cs.LG	83	RL-only post-training for LLM reasoning on text-rich networks; notable frontier LLM training angle.	llm, reinforcement-learning, reasoning, post-training, graphs
`2604.25584`	DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding PDF	cs.AI	82	Multimodal factuality benchmark/framework exposing fluent-but-wrong model outputs in procedural video tasks.	multimodal, factuality, evaluation, benchmark, video-understanding
`2604.20596`	Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation PDF	cs.LG, cs.CR	82	Combines clustered FL with differential privacy; practical privacy-preserving learning under heterogeneity.	differential-privacy, federated-learning, privacy, security, distributed-ml
`2603.09691`	ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling PDF	cs.CL, cs.AI	82	Unified instruction- and schema-aware tuning for task-oriented dialog; reusable LLM adaptation framework.	llm, instruction-tuning, task-oriented-dialog, schema, alignment
`2604.25318`	Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation PDF	cs.GR, cs.AI, cs.CL	82	LLM agent framework with MCP-based tool integration for end-to-end 3D cutscene generation.	llm-agents, tool-use, mcp, automation, multimodal
`2603.11392`	Agentic AI for Embodied-enhanced Beam Prediction in Low-Altitude Economy Networks PDF	cs.NI, cs.AI	82	Multi-agent LLM reasoning for embodied comms; agentic design is relevant though safety claims are limited.	agents, multi-agent, LLM, reasoning, embodied-ai
`2604.17718`	Do LLMs Use Cultural Knowledge Without Being Told? A Multilingual Evaluation of Implicit Pragmatic Adaptation PDF	cs.CL, cs.SI	82	Evaluates implicit pragmatic adaptation across languages; useful for reliability and culturally aware LLM behavior.	LLM-evaluation, multilingual, pragmatics, reliability, culture
`2604.24114`	IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning PDF	cs.CL	82	Cross-lingual math reasoning with staged RL curriculum and released dataset; useful post-training signal.	llm, reasoning, rl, multilingual, dataset

AI Paper Insight Brief

2026-05-05

0) Executive takeaways (read this first)

A strong pattern today is inference-time structure beats raw model scaling: semantic layers for text-to-SQL, schema-aware prompting for task-oriented dialog, conflict-driven visual verification, and evidence-grounded evaluation all show that adding the right external structure materially improves reliability.
Several papers push agentic/tool-using systems from demo to operational workflow: Android malware triage, SOC copilots, beam prediction, cutscene generation, and scientific-data readiness all rely on decomposition into specialist agents plus deterministic tools rather than end-to-end prompting alone.
On privacy/security, the most substantive advances are structural fixes to known bottlenecks: group-bounded DP contrastive learning, privacy-preserving clustered FL initialization, unsupervised API schema induction, and large-scale app-log/policy mismatch analysis.
Evaluation is getting more diagnostic and grounded, not just benchmark-score oriented: QEVA, DualFact+, DRAGON, PSI-Bench, and M3-VQA all decompose failure into interpretable subdimensions like chronology, evidence grounding, role consistency, or population-level realism.
RL/post-training work is increasingly targeting where supervision should land, not just whether to use RL: PAINT, TRN-R1-Zero, and IRIS all reshape rewards or curricula around informative positions, neighbor influence, or partial-solution continuation.
A recurring caution: many gains come with latency, tooling, or annotation overhead. The practical frontier is no longer “can this work?” but “can it work under deployment budgets, with stable judges, and without brittle external dependencies?”

2) Key themes (clusters)

Theme: Structured context as a reliability multiplier

Why it matters: Multiple papers show that failures often come from missing task structure rather than insufficient model capacity. Injecting schemas, semantic layers, or targeted evidence can reduce silent errors more effectively than swapping to a larger model.
Representative papers:
Common approach:
- Add explicit task structure at inference or fine-tuning time: schemas, semantic-layer docs, gold evidence, or detector-backed visual regions.
- Use selective retrieval/verification rather than trusting a single-pass generation.
- Diagnose failures as missing disambiguation or grounding, not just weak reasoning.
Open questions / failure modes:
- External structure can become a new bottleneck: schema curation, semantic-layer authoring, detector recall, and retrieval quality.
- Gains may not transfer when labels are abbreviated, evidence is incomplete, or the external tool misses the relevant object/page.
- Added context often improves correctness but increases latency and system complexity.

Theme: Agentic systems are becoming tool orchestration systems

Why it matters: The most credible agent papers today are not “LLM does everything” stories; they pair LLM planning with deterministic tools, constrained roles, and explicit state. That is a more realistic template for production agents in security, communications, and creative tooling.
Representative papers:
Common approach:
- Split work into planner / analyst / verifier roles with explicit handoffs.
- Keep hard operations in tools: static analysis, retrieval, rendering, engine APIs, or signal models.
- Use iterative observe-think-act loops with audit trails or structured outputs.
Open questions / failure modes:
- Tool quality and access constraints dominate outcomes; weak tools or blocked APIs cap agent performance.
- Multi-agent systems often improve capability at the cost of runtime, token use, and operational fragility.
- Many evaluations remain offline or single-site, leaving real deployment robustness underexplored.

Theme: Privacy and security progress is shifting from point defenses to pipeline design

Why it matters: The strongest security/privacy papers redesign the training or monitoring pipeline itself—bounding sensitivity, preserving cluster structure under DP, learning API schemas from traffic, or measuring disclosure gaps at scale—rather than adding a thin defense layer.
Representative papers:
Common approach:
- Re-architect the unit of aggregation or validation: groups instead of samples, clusters instead of global models, graphs instead of flat URLs.
- Combine formal guarantees or structural checks with practical heuristics and tooling.
- Emphasize deployment constraints such as unlabeled traffic, secure aggregation, or large-batch utility.
Open questions / failure modes:
- Many methods still lack field-scale validation, especially under adversarial adaptation or production traffic diversity.
- Privacy-preserving methods retain nontrivial utility gaps and often add communication or compute overhead.
- Some empirical studies expose inconsistencies in reporting or limited interaction coverage, which matters for compliance claims.

Theme: Evaluation is moving toward grounded, interpretable diagnostics

Why it matters: Several papers argue that answer-only or overlap-only metrics hide the real failure modes. Better evaluation now means checking evidence, chronology, role assignment, grounding, or distributional realism.
Representative papers:
Common approach:
- Break evaluation into interpretable sub-scores rather than one scalar.
- Compare model outputs against source evidence, not just references or judges.
- Validate automatic metrics against humans or experts where possible.
Open questions / failure modes:
- Many frameworks still depend on LLM/VLM judges, which can hallucinate or encode bias.
- Annotation cost is high, and some benchmarks remain narrow in domain or language coverage.
- Better diagnostics do not automatically yield better training signals unless integrated into optimization.

Theme: RL and post-training are getting more targeted

Why it matters: Rather than generic reward-on-final-answer, newer post-training methods try to identify where signal is most useful: informative graph neighborhoods, partial reasoning suffixes, or curriculum stages. This is a more promising direction for smaller models and transfer settings.
Representative papers:
Common approach:
- Use structured reward shaping tied to local informativeness or continuation quality.
- Combine SFT warmup/curriculum with RL rather than treating them as substitutes.
- Optimize for efficiency by reducing rollout count or using smaller backbones.
Open questions / failure modes:
- Many gains remain domain-specific and depend on strong base-model priors.
- Reward design is brittle; sparse or misaligned shaping can still collapse or overfit.
- Evidence for transfer beyond the tested domains is still limited.

3) Technical synthesis

A common systems pattern is LLM + deterministic substrate: MARD uses Soot/FlowDroid, API security uses graph validation + autoencoder, Cutscene Agent uses engine-native MCP tools, and Active-Look uses external grounding experts.
Several papers replace monolithic inference with selective verification loops: Active-Look re-checks disputed regions, M3-VQA agentic retrieval decomposes multi-hop queries, QEVA verifies summaries via QA, and DualFact verifies extracted facts against video.
Context engineering outperformed model choice in at least one tightly controlled setup: in text-to-SQL, semantic-layer context created statistically distinct high-accuracy and low-accuracy clusters, with within-cluster model differences insignificant.
Privacy work repeatedly attacks sensitivity at the structural level: DP-GCL bounds contribution by grouping negatives; PINA compresses and privatizes sparse LoRA sketches for initialization before secure aggregation.
Evaluation papers increasingly use human-aligned decompositions: chronology, evidence localization, NEP progression, conceptual vs contextual facts, and top-3 emotion overlap all make failures inspectable.
Multiple robustness papers show that naive aggregation can hurt: unioning visual detectors degrades grounding, full-solution conditioning can oversharpen privileged distillation, and flat URL/payload modeling misses API structure.
There is a notable rise in budget-aware inference design: Active-Look allocates visual tokens to disputed boxes, agentic beam prediction switches modality paths, and PAINT sparsifies teacher interpolation to top-entropy-mismatch positions.
Several works expose silent failure modes rather than overt errors: semantically wrong but executable SQL, privacy leaks absent from policies, culturally mismatched pragmatic behavior without explicit instruction, and correct answers without grounded diagram evidence.
A recurring empirical pattern is strong benchmark gains with deployment caveats: latency overhead in multimodal verification, black-box query cost in AEGIS, fixed preprocessing overhead in Active-Look, and single-site qualitative validation in SOC adoption.
Across modalities, the field is converging on evidence-first reliability: if a model cannot point to the right schema, region, page, fact, or trajectory, answer quality alone is no longer treated as sufficient.

4) Top 5 papers (with “why now”)

Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

Shows a small hand-authored semantic layer boosts first-shot analytical pass rate by +17.2 to +23.2 points across three frontier models.
Strong paired design isolates context as the main driver; semantic-layer runs cluster together and raw-schema runs cluster together.
Useful now because many teams are deciding whether to invest in model upgrades or semantic modeling for analytics copilots.
Skeptical take: evidence comes from one retail dataset and one prompt form; generality across domains and runtime semantic systems is still open.

Differentially Private Contrastive Learning via Bounding Group-level Contribution

Reworks InfoNCE training so sensitivity is fixed at 2C via within-group negatives and per-group clipping.
Reports strong gains over prior DP contrastive methods in both classification and image-text retrieval, plus better large-batch scaling.
Useful now because privacy-preserving representation learning has been bottlenecked by poor DP utility under contrastive objectives.
Skeptical take: no billion-scale pretraining results yet, and a meaningful gap to non-private training remains.

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Introduces a practical training-free hallucination mitigation method that arbitrates between global highlighting and selective zoom based on detector disagreement.
Delivers consistent gains on POPE, MME, and CHAIR across multiple LVLMs, with strong ablations explaining why naive detector union fails.
Useful now because inference-time mitigation is one of the few deployable levers for existing multimodal models.
Skeptical take: depends on external detector recall and adds substantial runtime/token overhead.

MARD: A Multi-Agent Framework for Robust Android Malware Detection

Combines manifest-level risk screening, ReAct-style static-analysis forensics, and final LLM adjudication into an interpretable zero-shot malware pipeline.
Reports strong F1 on CICMalDroid and AndroZoo, plus temporal robustness under concept drift and per-APK cost under $0.10.
Useful now because security teams want LLM-assisted triage that is explainable and resilient to distribution shift, not just benchmarked classifiers.
Skeptical take: packed/dynamically loaded apps remain a weakness, and production throughput is not established.

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Provides a reference-free metric that decomposes summary quality into coverage, factuality, and chronology using multimodal QA.
Achieves higher correlation with human judgments than a broad set of baselines on a new 800-summary benchmark.
Useful now because video summarization is moving faster than human-reference creation, and teams need scalable evaluation that catches temporal/factual errors.
Skeptical take: still relies on strong LLM/VLM components, so judge hallucination and API cost remain real concerns.

5) Practical next steps

Add structured context layers before scaling models: semantic-layer docs for analytics, explicit schemas for dialog, and evidence retrieval for multimodal QA.
For agent systems, prefer tool-first decomposition: keep planning in the LLM, but move verification, retrieval, static analysis, and execution into deterministic modules with logs.
Measure silent-error rates, not just task accuracy: executable-but-wrong SQL, unsupported visual claims, ungrounded evidence boxes, or policy/log mismatches.
In multimodal systems, implement selective verification under a budget rather than full reprocessing; detector disagreement or retrieval uncertainty is a useful trigger.
For RL/post-training, test sparse, targeted supervision: reward informative positions, partial continuations, or structurally important context instead of only final outcomes.
In privacy-sensitive representation learning, evaluate whether structural sensitivity control (grouping, bounded contribution, secure clustered aggregation) gives better utility than standard DP-SGD baselines.
If deploying LLMs in safety- or policy-adjacent settings, add distributional and cultural diagnostics rather than relying on average judge scores.
Build evaluation stacks that return actionable sub-scores—grounding, chronology, omission, salience, calibration, or realism—so failures can feed back into training and product design.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-05-05

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Structured context as a reliability multiplier

Theme: Agentic systems are becoming tool orchestration systems

Theme: Privacy and security progress is shifting from point defenses to pipeline design

Theme: Evaluation is moving toward grounded, interpretable diagnostics

Theme: RL and post-training are getting more targeted

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps