Daily AI Paper Report (2026-04-26)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 4233
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-24T00:00:00Z → 2026-04-25T00:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.19457Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
PDF
cs.AI92Clear eval framework for long-horizon enterprise agents: factuality, reasoning, compliance, abstention axesagents, alignment, evaluation, compliance, abstention, enterprise
2604.18292Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
PDF
cs.AI, cs.CL92Self-evolving arena to synthesize verifiable real-world tool envs for training lifelong agents.agents, environment-generation, tool-use, lifelong-learning, evaluation, MCP
2604.18401StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
PDF
cs.CL92Agentic RL post-training with step-aligned optimization for tool-using LLM agents.agentic-RL, LLM-agents, post-training, tool-use, credit-assignment, long-horizon
2604.17870GraSP: Graph-Structured Skill Compositions for LLM Agents
PDF
cs.CL92Executable skill graphs w/ verification+repair for LLM agents; tackles skill overload/orchestration.llm-agents, tool-use, planning, skill-composition, verification, reliability
2604.17771SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
PDF
cs.CL, cs.AI, cs.DB90Practical framework to detect/quantify NL2SQL benchmark contamination; important for trustworthy evalsdata-contamination, evaluation, NL2SQL, synthetic-variants, benchmarking, LLM-reliability
2604.19548Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment
PDF
cs.CL, cs.AI, cs.CY90Identifies bias in multi-agent reflection/auditing; introduces Ambiguous Failure Benchmark + mitigation.agents, multi-agent, reliability, evaluation, cognitive-bias, self-critique
2604.17950CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
PDF
cs.AI90Risk-aware contextual capability calibration for delegation; reduces misdelegation via uncertainty.multi-agent, delegation, calibration, uncertainty, risk-aware, agent-safety
2604.20300FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory
PDF
cs.AI90Selective forgetting for LLM agents targets efficiency + security (forget malicious/sensitive memory).agents, memory, security, privacy, robustness
2604.17821WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
PDF
cs.AI90Web agent planning+reasoning with uncertainty-aware adaptive planning and MCTS to curb hallucinationsweb-agents, planning, uncertainty, MCTS, hallucinations, agent-reliability
2604.17894Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions
PDF
cs.CL89DynaSlide benchmark + SlideAgent for NL-driven slide edits; strong real-world agentic eval.benchmarks, agents, tool-use, multimodal, document-ai, evaluation
2604.20211Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs
PDF
cs.SE, cs.AI, cs.CR88Taxonomy + real-world benchmark (101 cases) for insecure logging; enables LLM detection/repair evaluationsecurity, benchmark, software-engineering, LLM, privacy-leakage, log-injection
2604.19300HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
PDF
cs.SD, cs.AI88Large benchmark for hallucination detection in audio-language models; fills key multimodal safety gaphallucinations, multimodal, audio-language-models, benchmark, evaluation, reliability
2604.18576Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
PDF
cs.AI88Agentic forecasting with Bayesian linguistic belief updates + calibration; SOTA on ForecastBench.agents, calibration, uncertainty, tool-use, forecasting, evaluation
2604.19502Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
PDF
cs.CL88Benchmark/framework for AI peer reviews beyond ratings; multi-dimension eval + dataset.evaluation, benchmarks, LLM-reviews, metrics, faithfulness, argumentation
2604.17842QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
PDF
cs.CL88Method/tool to efficiently find hard cases in dynamic LLM benchmarks; useful for red-teaming.evaluation, benchmarks, adversarial-testing, bayesian-optimization, robustness
2604.20658Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
PDF
cs.CL88Cooperation-game profiles predict multi-agent LLM team performance; useful for agent evaluation/safety.multi-agent, evaluation, cooperation, AI-for-science, benchmarks
2604.17827Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models
PDF
cs.CL88SLM learns when/how to query LLM under privacy/cost constraints; useful for safe, efficient agent designssmall-models, LLM-collaboration, privacy, cost-control, multi-step-reasoning, agent-architectures
2604.19015FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
PDF
cs.LG, cs.AI88Federated LLM fine-tuning tackling IP+privacy+heterogeneity via proxy SLM and fusion; practical deployment valuefederated-learning, llm-finetuning, privacy, ip-protection, heterogeneous-data, compression, edge
2604.18038First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows
PDF
cs.CY, cs.AI86Measures and mitigates racial bias in clinical LLM workflows; governance lens + multi-model eval.safety, bias, healthcare, evaluation, agentic-workflows, governance
2604.19005Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection
PDF
cs.CL86Multi-agent debate to detect misleading half-truths under noisy retrieval; strong eval claim.misinformation, fact-checking, multi-agent, debate, RAG, robustness
2604.20443DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
PDF
cs.CL, cs.AI, cs.LG86DialToM benchmark separates ToM recognition vs using states to forecast dialogue; exposes reasoning gaps.benchmark, theory-of-mind, dialogue, evaluation, reasoning
2604.19012Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis
PDF
cs.CR, cs.SE85Training-free multi-agent vuln detection via behavioral contract synthesis; addresses dedup collapse issuecybersecurity, vulnerability-detection, agents, specification-synthesis, robust-evaluation, software-security
2604.18176QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning
PDF
cs.AI, quant-ph85Physics-consistent QuantumQA dataset + verification-aware RLVR reward modeling for rigor.RLVR, verifiable-rewards, scientific-reasoning, dataset, reward-modeling, reliability
2604.20495Towards Certified Malware Detection: Provable Guarantees Against Evasion Attacks
PDF
cs.CR, cs.LG84Certified robustness for malware detection via randomized smoothing; formal guarantees vs evasion attackscybersecurity, malware, certified-robustness, adversarial, randomized-smoothing
2604.18071Architectural Design Decisions in AI Agent Harnesses
PDF
cs.AI84Empirical taxonomy of agent harness architecture decisions across 70 projects; reusable patterns.agents, systems, orchestration, tooling, safety-controls, survey
2604.20755V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
PDF
cs.AI, cs.LG84Process-supervised RL for multimodal table reasoning with step-level critic feedback.multimodal, process-supervision, RL, verifiable-reasoning, tables, critics
2604.07712CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
PDF
cs.LG84CausalVAE plug-in boosts counterfactual world models; better intervention robustness & interpretability.causal-representation-learning, world-models, counterfactuals, robustness, distribution-shift, interpretability
2604.17966TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
PDF
cs.AI84Safety-critical engineering calc benchmark; penalizes plausible-but-physically-wrong LLM answers.evaluation, safety-critical, reasoning, numerical-robustness, aerospace
2604.19281Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
PDF
cs.HC, cs.AI, cs.CL, cs.LG84VB-Score evaluates medical QA via verification components; highlights health equity and safety risks.evaluation, factuality, medical, health-equity, reliability
2603.17820Federated Distributional Reinforcement Learning with Distributional Critic Regularization
PDF
cs.LG84Fed distributional RL preserves tail risks via Wasserstein barycenter trust region; safety-critical fitfederated-learning, distributional-rl, risk-sensitive, trust-region, safety-critical

AI Paper Insight Brief

2026-04-26

0) Executive takeaways (read this first)

  • “Evaluation is the bottleneck” is becoming concrete: multiple papers propose diagnostic benchmarks/metrics that expose hidden failure modes (contamination via syntactic brittleness in NL2SQL; “right answer, wrong reasoning” in TPS engineering; entity-level failures in medical QA; functional vs literal ToM in dialogue).
  • Uncertainty- and risk-aware control is moving from theory to agent practice: web agents use dual-level uncertainty to switch planning modes and drive MCTS; federated RL uses CVaR-weighted distributional critics + trust regions to reduce tail risk and drift under heterogeneity.
  • Structured intermediate representations are repeatedly the lever for robustness: typed skill DAGs with local repair (GraSP), Gherkin behavioral contracts for vuln detection (Phoenix), semi-structured belief states for forecasting (BLF), and verifiable step traces for multimodal table reasoning (V-tableR1).
  • Verification + RL is converging on “process supervision”: quantum reasoning uses deterministic verifiers fused into a verification-aware reward model (VRM) for RLVR; table reasoning uses a critic VLM to score step fidelity and gate rewards (PGPO).
  • Hybrid deployment patterns (edge+cloud, multi-agent teams) are getting principled: SLMs learn when/how to ask LLMs for help under privacy/efficiency constraints; delegation is calibrated with context-conditioned Beta posteriors to reduce misrouting on GAIA/SWE-bench.

2) Key themes (clusters)

Theme: Uncertainty & risk as first-class control signals (agents + RL)

Theme: Structured representations + local repair beat flat prompting

Theme: Benchmarking that targets hidden failure modes (contamination, omission, process errors, equity)

Theme: Verification-aware RL / process supervision (text + multimodal)

  • Why it matters: “Plausible but wrong” is often a process failure; domains with verifiers (physics, tables) allow dense feedback that reduces hallucination-like errors.
  • Representative papers:
  • Common approach:
    • Replace sparse outcome rewards with richer signals: deterministic verifiers + semantic scoring (VRM); critic-scored step fidelity (V-tableR1).
    • Align optimization granularity to agent causality: step-level MDP + step-level GAE/PPO (StepPO).
    • Show ablations where removing verifiers/process reward degrades performance.
  • Open questions / failure modes:
    • Compute and complexity: large critics (e.g., 32B critic VLM) and verifier suites increase cost.
    • Generality: quantum/table domains are unusually verifiable; transfer to open-world tasks is unclear.
    • Empirical breadth: StepPO evidence is currently a single HotpotQA curve without extensive reporting.

Theme: Security & privacy in real pipelines (hybrid inference, code security, certified robustness, memory forgetting)

3) Technical synthesis

  • Intermediate representations are the recurring “control surface”: DAGs (GraSP), contracts (Phoenix), belief JSON (BLF), rubric dimensions (TPS-CalcBench), axis decompositions (Four-Axis Decision Alignment), and verifiable step traces (V-tableR1) all create places to attach checks, rewards, and repairs.
  • Risk/uncertainty is being operationalized as gating: planning-mode switching (WebUncertainty), delegation margin δ with LCB-style scores (CADMAS-CTX), and trust-region shrink–squash constraints (FedDistRL) all implement “act only when confident enough.”
  • Evaluation is shifting from averages to diagnostics with guarantees: QuickScope uses COUP-style adaptive sampling + certification; SPENCE uses Kendall τ sensitivity; TPS-CalcBench uses noise-sensitivity and quadrant analysis (outcome-high/process-low).
  • Retrieval is not a universal fix: VB-Score shows RAG improves composite scores but entity F1 stays <10%; RADAR shows retrieval noise is exactly where multi-agent debate helps; WebUncertainty uses uncertainty-aware search rather than “more retrieval.”
  • Process supervision via verifiers/critics is converging across modalities: VRM fuses deterministic SES signals with semantic scores; V-tableR1 uses critic-gated rewards; both report ablations where removing verifiers hurts.
  • Multi-agent systems are being made more statistical: CADMAS-CTX uses Beta posteriors + variance penalties; QuickScope uses confidence bounds; BLF uses multi-trial aggregation + hierarchical calibration.
  • “Right answer, wrong reasoning” is now measurable in multiple domains: TPS-CalcBench explicitly scores formula selection/assumptions; V-tableR1 uses grounding scores; medical QA shows semantic similarity can hide entity failures.
  • Cost controls are increasingly built-in: RADAR early stopping; WebUncertainty reports inference-time reductions vs a baseline; QuickScope batching; BLF uses K=5 trials but calibrates/shrinks; FSFM caps memory to 70%.
  • Federation and decentralization trends: FedDistRL keeps policies local and federates critics; CADMAS-CTX keeps beliefs agent-local (no global store), emphasizing scalable decentralized coordination.

4) Top 5 papers (with “why now”)

1) Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

  • Builds ~1,978 executable environments and ~19,822 tools, plus verifiable task synthesis (graph-based + programmatic).
  • Trains agents with multi-environment GRPO and a self-evolving loop (diagnose failures → generate targeted tasks/environments).
  • Shows monotonic gains across evolution rounds (e.g., τ2-Bench 60.2→63.5→65.4 for 14B).
  • Skepticism: heavy reliance on LLM-driven mining/synthesis/diagnosis; diminishing returns beyond ~500 environments.

2) GraSP: Graph-Structured Skill Compositions for LLM Agents

  • Compiles retrieved skills into a typed DAG (state/data/order edges) with verifiers and local repair operators (REBIND/INSERTPREREQ/SUBSTITUTE/REWIRE/BYPASS).
  • Reports up to +19 reward points and up to 41% fewer environment steps; robust to over-retrieval and degraded skill quality.
  • “Compilation layer” framing is actionable for anyone building skill libraries.
  • Skepticism: DAG acyclicity limits loops/iterative refinement; relies on LLM compilation quality and tuned routing thresholds.

3) Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models

  • RL-trained SLM decides when/how to query an LLM; reward mixes EM quality, efficiency, and privacy leakage penalties.
  • Reports quality gains (+14.5%–17.4% over SLM-CoT; +2.8%–9.9% over static interaction) while reducing turns and lowering leakage (24.3%–32.4%).
  • Shows transfer: policies trained with one LLM generalize to stronger unseen LLMs.
  • Skepticism: weak instruction-following SLMs need supervised cold-start; privacy evaluation depends on an LLM judge.

4) TPS-CalcBench: … Analytical Calculation Competence in Hypersonic TPS Engineering

  • 420-item curated benchmark with dual-track scoring: numeric correctness + 8-dimension process rubric (formula selection, units, plausibility, assumptions, etc.).
  • Finds a meaningful “outcome-high/process-low” quadrant (~11–14%) and identifies formula-selection as a dominant failure mode (~18% of tagged errors).
  • Demonstrates diagnose→intervene: RAG-EQ, DFA-TPS fine-tuning, and PA-CoT improve KPI and reduce hallucination-class errors.
  • Skepticism: LLM-judge rubric has ±3–5 KPI uncertainty; benchmark is currently 420 items and domain-specific.

5) SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

  • Controlled paraphrase probe: hold schema+SQL fixed, generate paraphrases, rank by dependency-tree edit distance, measure accuracy drop vs syntactic divergence.
  • Strong negative Kendall τ on older benchmarks (Spider/SParC/CoSQL), near-zero on BIRD—actionable “benchmark trustworthiness” signal.
  • Robustness checks: paraphraser choice, temperature, length, and lexical overlap controls.
  • Skepticism: temporal gradient is correlational; paraphrase generation/filtering can introduce bias; execution accuracy conflates error sources.

5) Practical next steps

  • Adopt “diagnostic-first” eval: add at least one sensitivity curve (e.g., SPENCE-style syntactic divergence, retrieval-noise stress) and one process metric (TPS rubric-like or VB-Score components) alongside aggregate accuracy.
  • Instrument agents with explicit uncertainty gates: prototype task-level mode switching (explicit vs implicit planning) and action-level uncertainty-aware search (MCTS reward modulation) and measure cascading-error reduction.
  • Move orchestration to structured artifacts: try compiling tool/skill plans into a typed DAG with verifiers + local repair; compare step count and recovery rate vs flat ReAct+skills.
  • For safety-critical reasoning domains, add verifiers into RL: implement a VRM-like fusion of deterministic checks + semantic scoring, or a critic-gated process reward, and run ablations removing each verifier component.
  • Hybrid edge/cloud deployments: train or simulate a “help-seeking” policy that optimizes (quality, turns, privacy) and track privacy leakage rate under different penalty weights.
  • Security engineering: if you maintain code assistants, evaluate on SecLogging-style patterns (redaction/masking, injection) and add functional patch tests (beyond similarity) for remediation.
  • Memory governance: introduce a forgetting policy with auditable importance scoring and measure retrieval latency + “dangerous content retention” before/after pruning; treat forgetting as a security control, not just cost control.

Generated from per-paper analyses; no external browsing.