Daily AI Paper Report (2026-04-27)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 4394
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-24T00:00:00Z → 2026-04-25T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.17745HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
PDF
cs.CL90Hierarchical multi-agent paper-to-code reproduction + improved Paper2Code eval protocol (P2C-Ex).agents, paper-to-code, reproducibility, evaluation, multi-agent, automation
2604.21510OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
PDF
cs.CL90OptiVerse: 1k optimization problems; 22 LLM eval shows big drop on hard tasks; strong benchmark valuebenchmark, LLM-evaluation, optimization, reasoning, tool-use
2604.19633Time Series Augmented Generation for Financial Applications
PDF
cs.AI, cs.CE90Benchmark for LLM agents doing verifiable financial time-series tool use; strong eval focusagents, tool-use, evaluation, benchmarks, finance, verifiable-tools
2604.21882Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
PDF
cs.CL90RedirectQA probes factual recall vs name/surface-form access; key for reliability & memorization evalsLLMs, memorization, factuality, evaluation, datasets, entity-linking, robustness
2604.20572Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
PDF
cs.CL90Proactive memory/skill retrieval for lifelong agents; strong agentic relevance and reusable framework.agents, lifelong-learning, memory, retrieval, tool-use, online-learning
2604.20621SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion
PDF
cs.CR88SoK of AV perception attacks incl. multi-sensor fusion threats; taxonomy + gaps for defenses.security, autonomous-vehicles, perception-attacks, sensor-fusion, survey, robustness
2604.21192How VLAs (Really) Work In Open-World Environments
PDF
cs.RO, cs.AI88Critiques VLA evals for hiding unsafe behaviors; proposes safety-relevant analysis in open-worldrobotics, VLA, safety-evaluation, open-world, long-horizon, deployment
2604.20711Participatory provenance as representational auditing for AI-mediated public consultation
PDF
cs.AI, cs.HC88Audits input-fidelity of AI summarization for public consultation via provenance/optimal transport metricsauditing, summarization, governance, evaluation, optimal-transport, causal-inference, public-policy
2604.21725AEL: Agent Evolving Learning for Open-Ended Environments
PDF
cs.CL, cs.AI, cs.CE88Two-timescale learning for long-horizon LLM agents: adaptive memory retrieval + reflection updates.llm-agents, continual-learning, memory, retrieval-policy, reflection, bandits
2604.19016AlignCultura: Towards Culturally Aligned Large Language Models?
PDF
cs.CL86UNESCO-grounded cultural alignment dataset/pipeline for HHH evaluation; useful for safety+fairness auditscultural-alignment, evaluation, dataset, HHH, fairness, safety
2604.21193Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models
PDF
cs.AI86DAVinCI attributes+verifies claims with calibration; targets hallucinations and interpretability in LMsfactuality, hallucinations, verification, attribution, calibration, reliability
2604.21579A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
PDF
cs.SE, cs.AI86Metamorphic testing to expose memorization/data leakage in LLM program repair evaluations.data-leakage, memorization, evaluation, program-repair, metamorphic-testing, software-engineering
2603.17883SoK: From Silicon to Netlist and Beyond $-$ Two Decades of Hardware Reverse Engineering Research
PDF
cs.CR86Comprehensive SoK on hardware reverse engineering; strong security relevance and reusable overview.security, SoK, hardware, reverse-engineering, supply-chain, verification
2604.12596KumoRFM-2: Scaling Foundation Models for Relational Learning
PDF
cs.LG, cs.AI86Foundation model for relational DBs; avoids flattening, supports ICL+finetune, temporal consistency.foundation-models, relational-learning, in-context-learning, databases, tabular, pretraining
2604.19087OLLM: Options-based Large Language Models
PDF
cs.AI86Latent “options” for next-token; controllable diversity/search with minimal params on pretrained LLMsLLM, latent-variable, decoding, reasoning, controllability, efficient-adaptation
2604.21232ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
PDF
cs.AI86Hierarchical predictive correction for VLA agents to prevent cascading multi-step failures.agents, VLA, planning, robustness, error-correction, multimodal
2604.18356ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship
PDF
cs.CL86Tool-augmented companionship + new benchmark for personalized social support; relevant to agent eval/safety.agents, tool-use, evaluation, benchmarks, personalization, social-support
2604.20511CHASM: Unveiling Covert Advertisements on Chinese Social Media
PDF
cs.LG, cs.AI, cs.CL, cs.CV, cs.CY85CHASM dataset for multimodal covert-ad detection; concrete, safety-adjacent eval data from real platform.datasets, evaluation, multimodal, content-moderation, security, adversarial
2604.21917CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis
PDF
cs.CR, cs.SE84Benchmark of multi-commit CVEs that evade per-commit SAST; strong for secure code/agent tooling evalssecurity, benchmark, vulnerabilities, SAST, software-engineering, datasets
2604.17816Privacy-Preserving Product-Quantized Approximate Nearest Neighbor Search Framework for Large-scale Datasets via A Hybrid of Fully Homomorphic Encryption and Trusted Execution Environment
PDF
cs.CR84Privacy-preserving ANN for embeddings using FHE+TEE; relevant to secure RAG/vector DBsprivacy, security, ANN, vector-search, FHE, TEE, RAG
2604.19172Reasoning-Aware AIGC Detection via Alignment and Reinforcement
PDF
cs.AI84New multi-domain AIGC detection dataset + reasoning-chain detector trained with RL for robustnessAIGC-detection, datasets, robustness, reasoning, RL, misinformation, evaluation
2604.21396VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
PDF
cs.CV, cs.AI84Automated dataset linking each visual reasoning step to image regions for trustworthy LVLM eval.vision-language, grounding, chain-of-thought, trustworthiness, dataset, evaluation
2604.11529TempusBench: An Evaluation Framework for Time-Series Forecasting
PDF
cs.LG84Needed TS foundation-model eval framework; tackles dataset leakage/metadata issues and standardization.evaluation, benchmarks, time-series, foundation-models, data-contamination, forecasting
2604.18459Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
PDF
cs.CV, cs.AI84Streaming video agent: evidence-aligned response timing + transparent decision maker under compute limitsvideo-LLM, online-inference, agent, transparency, evaluation, multimodal
2604.05966FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures
PDF
cs.CL84Auditable agentic workflow with ontology mapping + anomaly logging for verified financial reporting.agents, auditing, verification, ontology, information-extraction, LLM-workflows
2604.21916MathDuels: Evaluating LLMs as Problem Posers and Solvers
PDF
cs.CL, cs.SE84Self-play math benchmark where models pose+solve; better capability separation than static tests.evaluation, benchmarks, math, self-play, adversarial-testing, LLMs
2604.08352Security Concerns in Generative AI Coding Assistants: Insights from Online Discussions on GitHub Copilot
PDF
cs.SE, cs.CR, cs.HC82Empirical security concerns for GenAI coding assistants from GitHub Copilot discussions.LLM-security, coding-assistants, developer-practice, secure-software, human-factors
2604.21416CSC: Turning the Adversary's Poison against Itself
PDF
cs.CR, cs.AI82Backdoor defense using latent-space cluster dynamics; targets poisoning without heavy utility lossbackdoors, data-poisoning, robustness, defense, security, representation-learning
2604.18543ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
PDF
cs.AI, cs.CL82Auto-generates and validates agent environments from NL; scalable eval/training infra for agentsagents, environment-generation, evaluation, benchmarks, validation, tool-interfaces
2604.19090Dual-Guard: Dual-Channel Latent Watermarking for Provenance and Tamper Localization in Diffusion Images
PDF
cs.CR82Dual-channel watermarking for diffusion provenance + tamper localization; practical integrity angleprovenance, watermarking, diffusion, content-integrity, tamper-detection, forensics

AI Paper Insight Brief

2026-04-27

0) Executive takeaways (read this first)

  • Evaluation is becoming the bottleneck—and papers are responding with “auditability-first” frameworks: multiple works introduce benchmarks + protocols that explicitly target leakage, hallucinated evaluation, or missing safety signals (TempusBench, HiRAS/P2C-Ex, TSAG, OptiVerse/DVA-Agent, MathDuels, CHASM, B1K safety Q-scores).
  • Agent progress is shifting from “more tools” to “better control of when/why to use tools/memory”: proactive retrieval as an explicit action with RL supervision (PROACTAGENT) and bandit-selected retrieval policies + reflection (AEL) both show large gains and strong ablation evidence that using experience is the key lever.
  • Security research is emphasizing “system-level gaps” over isolated attacks: AV perception attacks are under-studied at the fusion layer (AV SoK + cross-modal spoofing PoC), and software security datasets highlight blind spots in common pipelines (CrossCommitVuln-Bench shows per-commit SAST misses multi-commit chains; Copilot discourse surfaces licensing/provenance and insecure-suggestion concerns).
  • Provenance and integrity are converging on practical, deployable signals: diffusion-image watermarking moves beyond global provenance to tamper localization via dual latent channels (Dual-Guard), while public-consultation summarization is audited via input representational fidelity metrics (participatory provenance).
  • Privacy-preserving retrieval is getting closer to interactive scale: a hybrid FHE+TEE + PQ design reports >50 QPS sequential encrypted ANN at Recall@10 > 0.9 on million-scale datasets (PPPQ-ANN), but still leaves access-pattern leakage open.

2) Key themes (clusters)

Theme: Benchmarks & evaluation that resist leakage, hallucinated scoring, and missing safety signals

Theme: Proactive memory/retrieval as a learned action in lifelong agents

  • Why it matters: Long-horizon agents fail from context overload or missing key experience; learning when to retrieve can improve both success and efficiency.
  • Representative papers:
  • Common approach:
    • Treat retrieval as explicit control rather than passive RAG (PROACTAGENT adds retrieval to the action space).
    • Provide step-level supervision for retrieval via counterfactual comparisons (paired-branch rollouts with/without retrieval in PROACTRL).
    • Maintain typed memory/skills (facts, episodes, success/failure skills; AEL’s episodic→semantic→procedural tiers).
    • Use lightweight online adaptation (AEL uses Thompson Sampling bandit over retrieval policies).
  • Open questions / failure modes:
    • PROACTRL assumes prefix replayability for paired rollouts; may break in stochastic/non-replayable environments.
    • Memory growth and eviction policies are underdeveloped (PROACTAGENT notes no capacity cap / learned eviction).
    • Credit assignment across modules remains hard (AEL finds more complex credit methods degrade performance in noisy regimes).

Theme: Agent environment generation & scalable agent evaluation

  • Why it matters: Manual environment/benchmark creation doesn’t scale and becomes stale; auto-generation enables continuous, leakage-resistant evaluation.
  • Representative papers:
  • Common approach:
    • Generate tasks from natural language specs with validation loops (ClawEnvKit Parser/Generator/Validator → executable sandbox tasks).
    • Use self-play / co-evolving difficulty to avoid benchmark ceilings (MathDuels: models author + solve; Rasch/IRT ranking).
    • Prefer deterministic checks where possible; cap LLM-judge influence (ClawEnvKit uses 15 deterministic checks + capped llm_judge weight).
  • Open questions / failure modes:
    • Sim-to-real gaps: mock services omit auth/rate limits/schema drift (ClawEnvKit limitation).
    • Self-play still yields many non-discriminative items (MathDuels reports ~39% solved by all non-authors).
    • Verifier dependence: automated verification still relies on model backbones in edge cases (MathDuels).

Theme: Provenance, integrity, and representational auditing

  • Why it matters: Trust in AI outputs increasingly requires traceability—either to detect tampering (media) or to ensure summaries don’t erase minority views (governance).
  • Representative papers:
  • Common approach:
    • Move beyond binary provenance to localized evidence (Dual-Guard produces block heatmaps via fused latent evidence).
    • Audit transformations from inputs→outputs with distributional metrics (participatory provenance uses per-participant coverage + Wasserstein-2 gap).
    • Provide operational tooling (Co-creation Provenance Lab; platform-side “Full mode” watermark verification).
  • Open questions / failure modes:
    • Dual-Guard depends on owner-side round-trip reference latents; adaptive white-box attacks are out of scope.
    • Embedding dependence and proxy nature of coverage metrics can affect individual-level conclusions (participatory provenance reports rank correlation variability).

Theme: Security & privacy gaps in modern AI pipelines (hardware, AV fusion, code, retrieval)

3) Technical synthesis

  • Multiple papers converge on “auditable decomposition”: break systems into stages with explicit status labels/logs (FinReporting’s OK/MISSING/PARSE_ERROR; ClawEnvKit audit logs; participatory provenance metrics; DAVinCI attribution+verification).
  • Counterfactual evaluation is emerging as a core tool: PROACTRL’s paired rollouts (retrieve vs not) mirrors OptiVerse’s dual-view (text→math vs code→math) and HiRAS’ focus on execution vs code-only scoring gaps.
  • Benchmarks increasingly include hard negatives that look like positives (CHASM includes product-sharing non-ads; CrossCommitVuln-Bench requires individually benign commits; AIGC-text-bank includes AI-Polish).
  • There’s a clear trend toward measuring what classic metrics miss:
    • Timing deviation and transparency in streaming video (Thinking-QwenVL).
    • Safety violations and non-target-object interactions in embodied tasks (sQ/seQ for B1K).
    • Representational exclusion in summarization (coverage + W2 + concept recall/precision).
  • Several works show that execution/environment is the dominant failure mode once code/text is plausible (HiRAS: import/env issues can drop scores dramatically).
  • “Leakage-aware” evaluation appears in multiple domains: time-series (TempusBench), APR (metamorphic testing + NLL), and optimization (OptiVerse filtering web-accessible solutions + strict numeric verification).
  • Tool-using systems are moving toward bounded decision spaces to reduce unsafe autonomy (FinReporting’s KEEP/REPAIR/NEED_REVIEW; ClawEnvKit’s deterministic checks + capped judge weight).
  • Security SoKs highlight a meta-problem shared with ML: artifact scarcity and fragility (HRE reproducibility 4%) parallels concerns in agent benchmarks about staleness/leakage and in evaluation about hallucinated judges.

4) Top 5 papers (with “why now”)

1) ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

  • Automates environment creation from NL specs into executable tasks E=(P,M,C) with validation/regeneration loops.
  • Releases Auto-ClawEval (1,040 tasks) and shows no model saturates completion (34%–76%), making it useful for frontier tracking.
  • Demonstrates harness matters: structured harnesses beat ReAct by up to 15.7 points.
  • Skepticism: mock services may not transfer to real APIs (auth, schema drift, rate limits).

2) Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

  • Makes retrieval a first-class action and trains it with paired-branch rollouts to get step-level supervision.
  • Large ablation signal: removing PROACTRL drops SciWorld SR 73.50% → 26.50%.
  • Shows efficiency gains (fewer rounds/tokens) alongside success improvements.
  • Skepticism: relies on replayability assumptions; memory scaling/eviction not solved.

3) Participatory provenance as representational auditing for AI-mediated public consultation

  • Introduces concrete metrics (coverage, W2 gap, AIPW causal attribution, concept fidelity) to audit input→summary representational fidelity.
  • Empirical finding: official summaries underperform random baselines on coverage and exclude ~15–17% of participants, concentrated in dissent clusters.
  • Provides an operational tool (Co-creation Provenance Lab) for audit-and-revision workflows.
  • Skepticism: embedding-based proxies and embedding choice sensitivity can affect individual-level rankings.

4) Dual-Guard: Dual-Channel Latent Watermarking for Provenance and Tamper Localization in Diffusion Images

  • Combines a robust global provenance anchor (GS in z_T) with a learned spatial codec (in z0) to localize tampering.
  • Reports near-perfect closed-set provenance discrimination and ≥99.9% detection across multiple edit/tamper scenarios; provides coarse localization (16×16 blocks).
  • Clear complementarity: GS-only misses local edits; dual-channel fixes this.
  • Skepticism: depends on owner-side round-trip reference latents; adaptive white-box attacks not evaluated.

5) OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

  • Broadens optimization evaluation to six domains and shows severe headroom (top models ~25–27% on Hard).
  • Identifies dominant failure mode as semantic modeling/logic errors (code runs but is wrong).
  • DVA-Agent’s dual-view auditing improves Hard accuracy (e.g., 16.67% → 24.33% for Qwen3-235B-Instruct) while triggering edits only ~23–32% of cases.
  • Skepticism: textbook/exam distribution may not reflect messy industrial optimization; evaluation pipeline is compute-heavy.

5) Practical next steps

  • For agent builders: implement retrieval-as-action and train it with counterfactual rollouts (retrieve vs not) to get step-level supervision; track both success and efficiency (rounds/tokens).
  • For benchmark owners: add metamorphic variants (semantics-preserving transforms) and report robustness deltas; pair with familiarity proxies where available (e.g., NLL for open models).
  • For evaluation pipelines: adopt dual-view auditing for semantic correctness (spec→formalization vs code→formalization) in any domain where “runs” ≠ “correct” (optimization, data pipelines, ETL, policy rules).
  • For safety in embodied/VLA: extend metrics beyond terminal success—log handling/placement violations and non-target interactions (sQ/seQ-style) and require reporting of trial-to-trial variance.
  • For provenance/integrity: if you operate a platform, consider closed-set verification designs (registered artifacts) and add localization signals, not just binary provenance.
  • For privacy-preserving retrieval: prototype hybrid designs (PQ + cryptography + enclave) but explicitly measure what remains leaked (e.g., access patterns) and document threat model boundaries.
  • For security tooling: evaluate SAST/CI on cross-commit chains (not just snapshots) and add history-aware checks; use datasets like CrossCommitVuln-Bench to quantify gaps.

Generated from per-paper analyses; no external browsing.