Daily AI Paper Report (2026-04-26)
Published:
Chinese version: [中文]
Run stats
- Candidates: 4233
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-24T00:00:00Z → 2026-04-25T00:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.19457 | Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents | cs.AI | 92 | Clear eval framework for long-horizon enterprise agents: factuality, reasoning, compliance, abstention axes | agents, alignment, evaluation, compliance, abstention, enterprise |
2604.18292 | Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence | cs.AI, cs.CL | 92 | Self-evolving arena to synthesize verifiable real-world tool envs for training lifelong agents. | agents, environment-generation, tool-use, lifelong-learning, evaluation, MCP |
2604.18401 | StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning | cs.CL | 92 | Agentic RL post-training with step-aligned optimization for tool-using LLM agents. | agentic-RL, LLM-agents, post-training, tool-use, credit-assignment, long-horizon |
2604.17870 | GraSP: Graph-Structured Skill Compositions for LLM Agents | cs.CL | 92 | Executable skill graphs w/ verification+repair for LLM agents; tackles skill overload/orchestration. | llm-agents, tool-use, planning, skill-composition, verification, reliability |
2604.17771 | SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks | cs.CL, cs.AI, cs.DB | 90 | Practical framework to detect/quantify NL2SQL benchmark contamination; important for trustworthy evals | data-contamination, evaluation, NL2SQL, synthetic-variants, benchmarking, LLM-reliability |
2604.19548 | Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment | cs.CL, cs.AI, cs.CY | 90 | Identifies bias in multi-agent reflection/auditing; introduces Ambiguous Failure Benchmark + mitigation. | agents, multi-agent, reliability, evaluation, cognitive-bias, self-critique |
2604.17950 | CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation | cs.AI | 90 | Risk-aware contextual capability calibration for delegation; reduces misdelegation via uncertainty. | multi-agent, delegation, calibration, uncertainty, risk-aware, agent-safety |
2604.20300 | FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory | cs.AI | 90 | Selective forgetting for LLM agents targets efficiency + security (forget malicious/sensitive memory). | agents, memory, security, privacy, robustness |
2604.17821 | WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent | cs.AI | 90 | Web agent planning+reasoning with uncertainty-aware adaptive planning and MCTS to curb hallucinations | web-agents, planning, uncertainty, MCTS, hallucinations, agent-reliability |
2604.17894 | Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions | cs.CL | 89 | DynaSlide benchmark + SlideAgent for NL-driven slide edits; strong real-world agentic eval. | benchmarks, agents, tool-use, multimodal, document-ai, evaluation |
2604.20211 | Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs | cs.SE, cs.AI, cs.CR | 88 | Taxonomy + real-world benchmark (101 cases) for insecure logging; enables LLM detection/repair evaluation | security, benchmark, software-engineering, LLM, privacy-leakage, log-injection |
2604.19300 | HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models | cs.SD, cs.AI | 88 | Large benchmark for hallucination detection in audio-language models; fills key multimodal safety gap | hallucinations, multimodal, audio-language-models, benchmark, evaluation, reliability |
2604.18576 | Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs | cs.AI | 88 | Agentic forecasting with Bayesian linguistic belief updates + calibration; SOTA on ForecastBench. | agents, calibration, uncertainty, tool-use, forecasting, evaluation |
2604.19502 | Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews | cs.CL | 88 | Benchmark/framework for AI peer reviews beyond ratings; multi-dimension eval + dataset. | evaluation, benchmarks, LLM-reviews, metrics, faithfulness, argumentation |
2604.17842 | QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks | cs.CL | 88 | Method/tool to efficiently find hard cases in dynamic LLM benchmarks; useful for red-teaming. | evaluation, benchmarks, adversarial-testing, bayesian-optimization, robustness |
2604.20658 | Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows | cs.CL | 88 | Cooperation-game profiles predict multi-agent LLM team performance; useful for agent evaluation/safety. | multi-agent, evaluation, cooperation, AI-for-science, benchmarks |
2604.17827 | Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models | cs.CL | 88 | SLM learns when/how to query LLM under privacy/cost constraints; useful for safe, efficient agent designs | small-models, LLM-collaboration, privacy, cost-control, multi-step-reasoning, agent-architectures |
2604.19015 | FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion | cs.LG, cs.AI | 88 | Federated LLM fine-tuning tackling IP+privacy+heterogeneity via proxy SLM and fusion; practical deployment value | federated-learning, llm-finetuning, privacy, ip-protection, heterogeneous-data, compression, edge |
2604.18038 | First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows | cs.CY, cs.AI | 86 | Measures and mitigates racial bias in clinical LLM workflows; governance lens + multi-model eval. | safety, bias, healthcare, evaluation, agentic-workflows, governance |
2604.19005 | Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection | cs.CL | 86 | Multi-agent debate to detect misleading half-truths under noisy retrieval; strong eval claim. | misinformation, fact-checking, multi-agent, debate, RAG, robustness |
2604.20443 | DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories | cs.CL, cs.AI, cs.LG | 86 | DialToM benchmark separates ToM recognition vs using states to forecast dialogue; exposes reasoning gaps. | benchmark, theory-of-mind, dialogue, evaluation, reasoning |
2604.19012 | Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis | cs.CR, cs.SE | 85 | Training-free multi-agent vuln detection via behavioral contract synthesis; addresses dedup collapse issue | cybersecurity, vulnerability-detection, agents, specification-synthesis, robust-evaluation, software-security |
2604.18176 | QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning | cs.AI, quant-ph | 85 | Physics-consistent QuantumQA dataset + verification-aware RLVR reward modeling for rigor. | RLVR, verifiable-rewards, scientific-reasoning, dataset, reward-modeling, reliability |
2604.20495 | Towards Certified Malware Detection: Provable Guarantees Against Evasion Attacks | cs.CR, cs.LG | 84 | Certified robustness for malware detection via randomized smoothing; formal guarantees vs evasion attacks | cybersecurity, malware, certified-robustness, adversarial, randomized-smoothing |
2604.18071 | Architectural Design Decisions in AI Agent Harnesses | cs.AI | 84 | Empirical taxonomy of agent harness architecture decisions across 70 projects; reusable patterns. | agents, systems, orchestration, tooling, safety-controls, survey |
2604.20755 | V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization | cs.AI, cs.LG | 84 | Process-supervised RL for multimodal table reasoning with step-level critic feedback. | multimodal, process-supervision, RL, verifiable-reasoning, tables, critics |
2604.07712 | CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics | cs.LG | 84 | CausalVAE plug-in boosts counterfactual world models; better intervention robustness & interpretability. | causal-representation-learning, world-models, counterfactuals, robustness, distribution-shift, interpretability |
2604.17966 | TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering | cs.AI | 84 | Safety-critical engineering calc benchmark; penalizes plausible-but-physically-wrong LLM answers. | evaluation, safety-critical, reasoning, numerical-robustness, aerospace |
2604.19281 | Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications | cs.HC, cs.AI, cs.CL, cs.LG | 84 | VB-Score evaluates medical QA via verification components; highlights health equity and safety risks. | evaluation, factuality, medical, health-equity, reliability |
2603.17820 | Federated Distributional Reinforcement Learning with Distributional Critic Regularization | cs.LG | 84 | Fed distributional RL preserves tail risks via Wasserstein barycenter trust region; safety-critical fit | federated-learning, distributional-rl, risk-sensitive, trust-region, safety-critical |
AI Paper Insight Brief
2026-04-26
0) Executive takeaways (read this first)
- “Evaluation is the bottleneck” is becoming concrete: multiple papers propose diagnostic benchmarks/metrics that expose hidden failure modes (contamination via syntactic brittleness in NL2SQL; “right answer, wrong reasoning” in TPS engineering; entity-level failures in medical QA; functional vs literal ToM in dialogue).
- Uncertainty- and risk-aware control is moving from theory to agent practice: web agents use dual-level uncertainty to switch planning modes and drive MCTS; federated RL uses CVaR-weighted distributional critics + trust regions to reduce tail risk and drift under heterogeneity.
- Structured intermediate representations are repeatedly the lever for robustness: typed skill DAGs with local repair (GraSP), Gherkin behavioral contracts for vuln detection (Phoenix), semi-structured belief states for forecasting (BLF), and verifiable step traces for multimodal table reasoning (V-tableR1).
- Verification + RL is converging on “process supervision”: quantum reasoning uses deterministic verifiers fused into a verification-aware reward model (VRM) for RLVR; table reasoning uses a critic VLM to score step fidelity and gate rewards (PGPO).
- Hybrid deployment patterns (edge+cloud, multi-agent teams) are getting principled: SLMs learn when/how to ask LLMs for help under privacy/efficiency constraints; delegation is calibrated with context-conditioned Beta posteriors to reduce misrouting on GAIA/SWE-bench.
2) Key themes (clusters)
Theme: Uncertainty & risk as first-class control signals (agents + RL)
- Why it matters: Long-horizon agents fail via cascading errors; federated RL can hide tail risk under averaging. Explicit uncertainty/risk signals let systems choose safer planning modes and preserve downside-relevant structure.
- Representative papers:
- Common approach:
- Quantify uncertainty/risk (task-level vs action-level; CVaR tail focus; posterior variance penalties).
- Use thresholds/margins to gate decisions (explicit vs implicit planning; delegate vs self-execute; trust-region constraints).
- Optimize for robustness proxies (accident/catastrophe rates; misdelegation reduction; drift control).
- Open questions / failure modes:
- Calibration brittleness: fixed thresholds (δ, τ) and confidence scores can mis-switch under volatility.
- Certificates/guarantees are partial: FedDistRL proves critic-side tracking under evaluation assumptions, not full actor-critic convergence.
- Uncertainty estimates derived from LLM confidences may be poorly calibrated across domains/models.
Theme: Structured representations + local repair beat flat prompting
- Why it matters: As skill libraries and tool ecosystems grow, flat sequences overload context and make recovery expensive; explicit structure enables bounded repair and better verification.
- Representative papers:
- GraSP: Graph-Structured Skill Compositions for LLM Agents
- Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis
- Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
- Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
- Common approach:
- Introduce an intermediate artifact: typed skill DAG; Gherkin contract; belief-state JSON; executable tools/tasks.
- Add verification hooks: node verifiers + pre/postconditions; contract compliance checks; leakage defenses; executable reward scripts.
- Prefer locality-bounded updates: h-hop graph repair; sequential Bayesian belief updates; targeted environment/task expansion.
- Open questions / failure modes:
- Expressiveness limits: DAG assumption struggles with loops/iterative tasks (GraSP).
- Spec incompleteness dominates errors (Phoenix false negatives from incomplete Gherkin).
- Tool/environment synthesis relies heavily on LLM generation; bias/failure can be baked into environments (Agent-World).
Theme: Benchmarking that targets hidden failure modes (contamination, omission, process errors, equity)
- Why it matters: Aggregate accuracy/semantic similarity can look good while systems fail in ways that matter for safety, trust, and governance (memorization, omissions, wrong equations, missing entities, biased decisions).
- Representative papers:
- Common approach:
- Stress-test invariances: syntactic paraphrase rank vs execution accuracy (SPENCE); noisy retrieval vs full evidence (RADAR).
- Separate outcome from process: rubric-scored reasoning dimensions (TPS-CalcBench); component-wise entity/factual/structure scoring (VB-Score).
- Use statistical rigor: Kendall’s τ + bootstrap; calibrated judges; noise-sensitivity analyses.
- Open questions / failure modes:
- Correlation vs causation: “temporal gradient” in contamination is correlational (SPENCE).
- Judge dependence: rubric scoring uses LLM judges with systematic uncertainty (TPS-CalcBench; VB-Score lacks clinician adjudication).
- Retrieval-grounded debates can over-scrutinize and shift TRUE→HALF-TRUE (RADAR).
Theme: Verification-aware RL / process supervision (text + multimodal)
- Why it matters: “Plausible but wrong” is often a process failure; domains with verifiers (physics, tables) allow dense feedback that reduces hallucination-like errors.
- Representative papers:
- Common approach:
- Replace sparse outcome rewards with richer signals: deterministic verifiers + semantic scoring (VRM); critic-scored step fidelity (V-tableR1).
- Align optimization granularity to agent causality: step-level MDP + step-level GAE/PPO (StepPO).
- Show ablations where removing verifiers/process reward degrades performance.
- Open questions / failure modes:
- Compute and complexity: large critics (e.g., 32B critic VLM) and verifier suites increase cost.
- Generality: quantum/table domains are unusually verifiable; transfer to open-world tasks is unclear.
- Empirical breadth: StepPO evidence is currently a single HotpotQA curve without extensive reporting.
Theme: Security & privacy in real pipelines (hybrid inference, code security, certified robustness, memory forgetting)
- Why it matters: Deployment introduces new attack surfaces (cloud calls, logs, memory stores) and evaluation pitfalls; several works propose concrete artifacts (datasets, certificates, policies) rather than generic advice.
- Representative papers:
- Common approach:
- Make trade-offs explicit and measurable: EM vs turns vs privacy leakage; detection vs repair similarity; certified radius; storage/latency/security retention.
- Use structured penalties/priors: privacy reward terms; safety-risk penalties in memory scoring; Wilson-interval certification.
- Provide curated benchmarks/taxonomies: SecLogging (101 issues, 10 patterns); malware robustness tests; PrivQA synthesis.
- Open questions / failure modes:
- Reliance on LLM judges for privacy leakage labeling (SLM–LLM collaboration).
- Patch similarity may not equal functional correctness (SecLogging).
- Feature-space certificates may not map cleanly to feasible problem-space binary edits (certified malware detection).
3) Technical synthesis
- Intermediate representations are the recurring “control surface”: DAGs (GraSP), contracts (Phoenix), belief JSON (BLF), rubric dimensions (TPS-CalcBench), axis decompositions (Four-Axis Decision Alignment), and verifiable step traces (V-tableR1) all create places to attach checks, rewards, and repairs.
- Risk/uncertainty is being operationalized as gating: planning-mode switching (WebUncertainty), delegation margin δ with LCB-style scores (CADMAS-CTX), and trust-region shrink–squash constraints (FedDistRL) all implement “act only when confident enough.”
- Evaluation is shifting from averages to diagnostics with guarantees: QuickScope uses COUP-style adaptive sampling + certification; SPENCE uses Kendall τ sensitivity; TPS-CalcBench uses noise-sensitivity and quadrant analysis (outcome-high/process-low).
- Retrieval is not a universal fix: VB-Score shows RAG improves composite scores but entity F1 stays <10%; RADAR shows retrieval noise is exactly where multi-agent debate helps; WebUncertainty uses uncertainty-aware search rather than “more retrieval.”
- Process supervision via verifiers/critics is converging across modalities: VRM fuses deterministic SES signals with semantic scores; V-tableR1 uses critic-gated rewards; both report ablations where removing verifiers hurts.
- Multi-agent systems are being made more statistical: CADMAS-CTX uses Beta posteriors + variance penalties; QuickScope uses confidence bounds; BLF uses multi-trial aggregation + hierarchical calibration.
- “Right answer, wrong reasoning” is now measurable in multiple domains: TPS-CalcBench explicitly scores formula selection/assumptions; V-tableR1 uses grounding scores; medical QA shows semantic similarity can hide entity failures.
- Cost controls are increasingly built-in: RADAR early stopping; WebUncertainty reports inference-time reductions vs a baseline; QuickScope batching; BLF uses K=5 trials but calibrates/shrinks; FSFM caps memory to 70%.
- Federation and decentralization trends: FedDistRL keeps policies local and federates critics; CADMAS-CTX keeps beliefs agent-local (no global store), emphasizing scalable decentralized coordination.
4) Top 5 papers (with “why now”)
1) Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
- Builds ~1,978 executable environments and ~19,822 tools, plus verifiable task synthesis (graph-based + programmatic).
- Trains agents with multi-environment GRPO and a self-evolving loop (diagnose failures → generate targeted tasks/environments).
- Shows monotonic gains across evolution rounds (e.g., τ2-Bench 60.2→63.5→65.4 for 14B).
- Skepticism: heavy reliance on LLM-driven mining/synthesis/diagnosis; diminishing returns beyond ~500 environments.
2) GraSP: Graph-Structured Skill Compositions for LLM Agents
- Compiles retrieved skills into a typed DAG (state/data/order edges) with verifiers and local repair operators (REBIND/INSERTPREREQ/SUBSTITUTE/REWIRE/BYPASS).
- Reports up to +19 reward points and up to 41% fewer environment steps; robust to over-retrieval and degraded skill quality.
- “Compilation layer” framing is actionable for anyone building skill libraries.
- Skepticism: DAG acyclicity limits loops/iterative refinement; relies on LLM compilation quality and tuned routing thresholds.
3) Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models
- RL-trained SLM decides when/how to query an LLM; reward mixes EM quality, efficiency, and privacy leakage penalties.
- Reports quality gains (+14.5%–17.4% over SLM-CoT; +2.8%–9.9% over static interaction) while reducing turns and lowering leakage (24.3%–32.4%).
- Shows transfer: policies trained with one LLM generalize to stronger unseen LLMs.
- Skepticism: weak instruction-following SLMs need supervised cold-start; privacy evaluation depends on an LLM judge.
4) TPS-CalcBench: … Analytical Calculation Competence in Hypersonic TPS Engineering
- 420-item curated benchmark with dual-track scoring: numeric correctness + 8-dimension process rubric (formula selection, units, plausibility, assumptions, etc.).
- Finds a meaningful “outcome-high/process-low” quadrant (~11–14%) and identifies formula-selection as a dominant failure mode (~18% of tagged errors).
- Demonstrates diagnose→intervene: RAG-EQ, DFA-TPS fine-tuning, and PA-CoT improve KPI and reduce hallucination-class errors.
- Skepticism: LLM-judge rubric has ±3–5 KPI uncertainty; benchmark is currently 420 items and domain-specific.
5) SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
- Controlled paraphrase probe: hold schema+SQL fixed, generate paraphrases, rank by dependency-tree edit distance, measure accuracy drop vs syntactic divergence.
- Strong negative Kendall τ on older benchmarks (Spider/SParC/CoSQL), near-zero on BIRD—actionable “benchmark trustworthiness” signal.
- Robustness checks: paraphraser choice, temperature, length, and lexical overlap controls.
- Skepticism: temporal gradient is correlational; paraphrase generation/filtering can introduce bias; execution accuracy conflates error sources.
5) Practical next steps
- Adopt “diagnostic-first” eval: add at least one sensitivity curve (e.g., SPENCE-style syntactic divergence, retrieval-noise stress) and one process metric (TPS rubric-like or VB-Score components) alongside aggregate accuracy.
- Instrument agents with explicit uncertainty gates: prototype task-level mode switching (explicit vs implicit planning) and action-level uncertainty-aware search (MCTS reward modulation) and measure cascading-error reduction.
- Move orchestration to structured artifacts: try compiling tool/skill plans into a typed DAG with verifiers + local repair; compare step count and recovery rate vs flat ReAct+skills.
- For safety-critical reasoning domains, add verifiers into RL: implement a VRM-like fusion of deterministic checks + semantic scoring, or a critic-gated process reward, and run ablations removing each verifier component.
- Hybrid edge/cloud deployments: train or simulate a “help-seeking” policy that optimizes (quality, turns, privacy) and track privacy leakage rate under different penalty weights.
- Security engineering: if you maintain code assistants, evaluate on SecLogging-style patterns (redaction/masking, injection) and add functional patch tests (beyond similarity) for remediation.
- Memory governance: introduce a forgetting policy with auditable importance scoring and measure retrieval latency + “dangerous content retention” before/after pruning; treat forgetting as a security control, not just cost control.
Generated from per-paper analyses; no external browsing.
