Daily AI Paper Report (2026-05-11)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 5420
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-08T00:00:00Z → 2026-05-09T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.02236Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
PDF
cs.AI, cs.CL, cs.LG90Studies persistence/escape in recursive LLM loops; relevant to agent stability and prompt-induced drift.llm, agents, safety, robustness, evaluation
2604.19734UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
PDF
cs.RO, cs.AI88Humanoid foundation-model direction: unified latent action language for human-to-robot transfer.robotics, foundation-models, world-models, policy-learning, transfer-learning
2605.02372Privacy Preserving Machine Learning Workflow: from Anonymization to Personalized Differential Privacy Budgets in Federated Learning
PDF
cs.CR, cs.AI88Privacy-preserving FL workflow with poisoning detection and personalized DP budgets.privacy, federated-learning, differential-privacy, poisoning, security
2605.03426Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
PDF
cs.AI88Federated preference-based alignment for heterogeneous VLMs; strong privacy/alignment relevance.federated-learning, alignment, VLM, preference-modeling, privacy
2603.15506Seeking SOTA: Time-Series Forecasting Must Adopt Taxonomy-Specific Evaluation to Dispel Illusory Gains
PDF
cs.LG, cs.AI88Calls out misleading TSF benchmarks; strong evaluation critique with broad ML relevance.evaluation, benchmarking, time-series, methodology, robustness
2605.02351MolViBench: Evaluating LLMs on Molecular Vibe Coding
PDF
cs.CL87New benchmark for LLM molecular code generation; useful eval for domain agents and executable reasoning.llm, benchmark, code-generation, agents, evaluation
2605.02669An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES
PDF
cs.AI86Agentic, auditable biomedical reasoning plus benchmark for a high-stakes domain.agents, llm, safety-critical, benchmark, explainability, biomed
2605.03941A Benchmark for Interactive World Models with a Unified Action Generation Framework
PDF
cs.CV, cs.AI86Large benchmark for interactive world models with unified action evaluation; reusable for agent capability testing.world-models, benchmark, agents, evaluation, multimodal
2605.02110Adversarial Update-Based Federated Unlearning for Poisoned Model Recovery
PDF
cs.LG, cs.CR86Targets poisoned federated models with efficient unlearning/recovery; concrete security relevance.federated-learning, security, unlearning, poisoning, robustness
2604.24001CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation
PDF
cs.AI86Fine-grained factuality benchmark for CT report generation; strong eval utility and reuse potential.evaluation, benchmark, factuality, medical-ai, report-generation
2605.04491An Evaluation of Chat Safety Moderations in Roblox
PDF
cs.CY, cs.CR85Large-scale independent evaluation of chat moderation on a child-heavy platform; concrete safety relevance.safety, moderation, evaluation, platforms, cybersecurity
2605.03821RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
PDF
cs.RO, cs.AI85Reward-aligned robot world models plus new benchmark/judge; relevant to alignment of embodied generative models.alignment, robotics, world-models, reward-modeling, benchmark
2605.05045When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
PDF
cs.CV, cs.CL85Targets VLM relation hallucination under perturbations; useful robustness evaluation.vlm, hallucination, robustness, evaluation, multimodal
2603.22219Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
PDF
cs.LG, stat.ML85Exact statistical benchmark for probabilistic forecasting; reusable eval framework.evaluation, benchmark, probabilistic-modeling, time-series, robustness
2605.02374Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
PDF
cs.CR, cs.CL84Adversarial training for robust machine-generated text detection; concrete black-box threat model.llm-security, adversarial-training, text-detection, evaluation, robustness
2604.11734Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving
PDF
cs.RO, cs.AI84Online RL post-training for multi-agent diffusion driving planners with explicit safety/efficiency aims.reinforcement-learning, autonomous-driving, multi-agent, diffusion, safety
2604.20719ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
PDF
cs.SD, cs.AI, cs.MM, eess.AS84Benchmark targets omnimodal reasoning and explicitly critiques hallucination-prone LLM-as-judge evals.benchmark, multimodal, evaluation, hallucinations, reasoning
2605.03544DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset
PDF
cs.CV, cs.AI84Open multicentric benchmark comparing pathology copilots to experts; strong real-world LLM/VLM evaluation value.benchmark, multimodal, medical-ai, evaluation, copilots
2604.10996When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
PDF
cs.CL, cs.AI, cs.CE84LLM-generated features help RL trading only in some regimes; useful reliability lesson with concrete IC results.llm, rl, reliability, evaluation, representation
2605.03986From Intent to Execution: Composing Agentic Workflows with Agent Recommendation
PDF
cs.AI84Automates multi-agent workflow composition and agent recommendation; useful agentic systems infra.agents, multi-agent, workflow, orchestration, LLM
2604.19724Benign Overfitting in Adversarial Training for Vision Transformers
PDF
cs.LG, cs.AI84Theoretical analysis of adversarial training in ViTs; robustness results could inform secure model design.adversarial-robustness, vision-transformers, theory, security, generalization
2605.05121Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
PDF
cs.CL83Targets trustworthy prediction with uncertainty and reasoning-aware views in a high-stakes language setting.trustworthiness, uncertainty, nlp, reliability, evaluation
2604.20382Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs
PDF
cs.CL82LLM data generation for counseling with structured grounding in a high-risk domain.llm, synthetic-data, mental-health, safety-critical, grounding
2603.21597A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
PDF
cs.AI, cs.CV82Interactive multi-agent clinical AI with privacy-preserving deployment and clinician-facing reasoning tools.agents, healthcare, multimodal, privacy, decision-support
2604.20166Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders
PDF
cs.CL, cs.HC82Trust/safety framework for mental-health AI; strong multi-stakeholder lens on reliability and deployment.AI-safety, trust, mental-health, survey, evaluation
2604.26498Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
PDF
cs.LG, q-bio.QM82Useful scaling reality check: larger models often do not win in drug discovery across many endpoints.scaling-laws, benchmark, foundation-models, evaluation, drug-discovery
2603.15185What Matters for Scalable and Robust Learning in End-to-End Driving Planners?
PDF
cs.RO, cs.AI, cs.CV82Systematic study of what actually improves closed-loop end-to-end driving robustness and scalability.autonomy, robustness, evaluation, scaling, planning
2604.25472SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
PDF
cs.AI82New benchmark for LLM-based evaluation of AI-generated science materials with evidence.benchmark, evaluation, llm, education, reliability
2605.03788Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones
PDF
cs.AI, cs.NI, cs.RO82Grounded LLM agent framework for real-time drone swarms; notable agent execution/safety setting.agents, LLM, robotics, tool-use, cyber-physical-systems
2604.19357FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition
PDF
cs.LG82Subgroup fairness auditing with bias-variance decomposition; practical auditing tool with broad applicability.fairness, auditing, evaluation, bias, reliability

AI Paper Insight Brief

2026-05-11

0) Executive takeaways (read this first)

  • Evaluation is the dominant theme today: several papers argue current benchmarks overstate progress, then replace them with more falsifiable or fine-grained protocols—taxonomy-aware forecasting evaluation, exact noise-titration for probabilistic TSF, attribute-level CT report scoring, deterministic music-notation evaluation, and multicentric pathology/VLM benchmarking.
  • Robustness failures are increasingly traced to interface design rather than raw model scale: BEV compression improves closed-loop driving, memory/update rules determine recursive-LLM “fragility,” and simple preprocessing only partially fixes VLM relation hallucination under rotation/noise.
  • Post-training is becoming more targeted and modular: diffusion planners get online RL with variance-gated optimization, robot world models get distilled multimodal reward alignment plus inference-time re-encoding, and federated VLM alignment shifts from parameter sharing to reward-routing.
  • Bigger models do not reliably win in specialized domains: simple/classical methods remain competitive in time-series forecasting and molecular prediction, while pathology-specific or task-specific systems often outperform general-purpose multimodal models on domain tasks.
  • In high-stakes domains, the strongest papers pair performance gains with workflow-aware interpretability: dementia risk assessment, DILI hypothesis generation, subgroup fairness auditing, and mental-health prediction all emphasize evidence traces, uncertainty, or mechanistic explanations rather than raw scores alone.
  • For agentic systems, the practical lesson is to harden scaffolding, not just the base model: typed tools, guardrails, routing, retrieval, and explicit memory policies repeatedly determine whether systems remain reliable under shift or long-horizon execution.

2) Key themes (clusters)

Theme: Evaluation is shifting from leaderboard scores to falsifiable diagnostics

Theme: Closed-loop robustness depends on representation bottlenecks and post-training

Theme: Agent reliability is mostly a systems problem

Theme: High-stakes AI is moving toward evidence-bearing, uncertainty-aware outputs

Theme: Domain-specific benchmarks are exposing where general models fail

3) Technical synthesis

  • A recurring pattern is benchmark redesign around causal structure: known DGPs in forecasting, attribute schemas in radiology, canonical pitch mappings in music, and sequestered answers in pathology all reduce ambiguity in what “correct” means.
  • Several papers show open-loop or feature-level validity does not imply closed-loop utility: driving planners with strong BEV features fail in closed loop, LLM-derived trading features improve IC but not policy robustness, and visually plausible world models remain task-misaligned.
  • Compression/bottlenecking appears as a robustness tool: scene tokenization in driving, shared latent action tokens in humanoid transfer, and lightweight distilled reward models in robot world models all improve scalability while reducing brittle dependence on raw high-dimensional inputs.
  • Post-training is becoming more structured than generic RLHF: VG-GRPO for diffusion planners, GRPO with routed rewards for federated VLMs, and reward-distilled RL for world models all tailor optimization to model class and deployment constraints.
  • Multiple papers emphasize paired or counterfactual evaluation: treatment-vs-control recursive loops, paraphrase-vs-adversarial CT reports, and benchmark splits by taxonomy or chemical similarity all aim to isolate real gains from artifacts.
  • Simple baselines remain surprisingly strong in periodic forecasting and molecular property prediction, reinforcing that benchmark composition and split design can dominate perceived progress.
  • Inference-time fixes matter: orientation correction, denoising, sliding-window re-encoding, helper tools, and guardrails often recover more reliability than prompt tweaks alone.
  • Uncertainty is increasingly operationalized as triage signal, not just calibration score: evidential mental-health prediction, modality-aware dementia fusion, and fairness auditing all aim to identify when humans should inspect or intervene.
  • Agent systems are converging on modular orchestration: routers, recommenders, typed tool gateways, and critique loops repeatedly outperform monolithic “give the model everything” designs.
  • Across safety-relevant domains, the strongest papers combine task-specific structure + human-auditable outputs, suggesting that frontier progress is currently more about system design and evaluation discipline than raw model scaling.

4) Top 5 papers (with “why now”)

  • What Matters for Scalable and Robust Learning in End-to-End Driving Planners?
    • Shows that high-resolution BEV features can hurt closed-loop driving via causal confusion; a simple tokenizer bottleneck materially improves driving score and success rate.
    • Separates the roles of disentangled outputs and diffusion planning: one reduces static infractions, the other dynamic infractions, and the combination works best.
    • Demonstrates data-scaling advantages for diffusion planners and reports SOTA closed-loop Bench2Drive results plus gains on NAVSIM.
    • Skeptical about: compression may fail in long-range/high-speed scenarios, and diffusion still carries runtime trade-offs.
  • A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
    • Strong example of workflow-aware medical AI: modality agents, propose-and-critique fusion, and a clinician-facing dashboard.
    • Beats single-modality and LLM baselines across prediction, diagnosis, and survival tasks, and improves clinician accuracy in a reader study by +17.5 percentage points.
    • Handles missing modalities gracefully and adds a Dynamic Medical Notebook for iterative correction.
    • Skeptical about: labels are retrospective EHR-derived proxies, and the system still depends on general-purpose LLM reasoning components.
  • Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
    • Reframes forecasting robustness as an exact statistical problem by controlling the DGP and injected noise, enabling sharper claims than standard historical benchmarks.
    • Introduces a probabilistic Fern model with full Gaussian beliefs and rich calibration diagnostics.
    • Exposes failure modes of zero-shot foundation models and conformal methods under non-stationarity.
    • Skeptical about: evidence is synthetic and Gaussian-noise-based, so real-world transfer remains unproven.
  • RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
    • Practical recipe for aligning robot world models to task-level criteria rather than pixel similarity alone.
    • Distills an 8B multimodal judge into a ~98M reward model fast enough for online RL, then adds sliding-window re-encoding to reduce rollout drift.
    • Reports +10.1% aggregate judge improvement over the strongest baseline and better long-horizon fidelity with minimal runtime overhead.
    • Skeptical about: gains are shown on tabletop manipulation and not yet tied to downstream closed-loop control improvements.
  • DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset
    • High-value benchmark release: multicentric, pathologist-curated, sequestered evaluation, and direct comparison to 31 human readers.
    • Shows pathology-specific PathChat+ is much closer to expert performance than general-purpose VLMs on several tasks.
    • Useful now because pathology copilots are moving fast and leakage-resistant benchmarking is badly needed.
    • Skeptical about: evaluation uses selected ROIs rather than full WSIs and lacks broader clinical context or ancillary tests.

5) Practical next steps

  • Audit your evaluation stack for artifact-driven gains: add simple baselines, taxonomy-aware splits, and perturbation tests before trusting leaderboard improvements.
  • For agentic systems, explicitly test memory/update policies (append vs replace vs summarized context) because scaffold mechanics can dominate robustness.
  • In closed-loop planning or control, add representation bottlenecks and compare open-loop vs closed-loop metrics; don’t assume richer latent state helps.
  • If using expensive judges or reward models, try teacher→student distillation so alignment signals can be used online rather than only offline.
  • Add paired-control experiments to robustness work: compare treatment vs control-vs-control stochastic floors to separate real effects from sampling variance.
  • For multimodal or medical systems, require outputs to include evidence traces, uncertainty, or mechanism hypotheses that a human can inspect.
  • In federated or privacy-sensitive settings, consider sharing preferences/rewards/routing signals instead of full parameters when clients are heterogeneous.
  • For VLM deployment, benchmark relation reasoning under rotation/noise and test preprocessing pipelines; prompt-only fixes are unlikely to be enough.

Generated from per-paper analyses; no external browsing.