Daily AI Paper Report (2026-04-18)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 3670
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-17T00:00:00Z → 2026-04-18T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.12951The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime
PDF
cs.LG95Proves minimax limits for calibration auditing in rare-error regime; big implications for AI eval & governance.calibration, auditing, evaluation, statistical-limits, rare-errors, reliability
2604.12548DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection
PDF
cs.CR92Black-box prompt-injection fuzzing combining semantic + char obfuscation; timely robustness eval on DeepSeekprompt-injection, jailbreaks, robustness-eval, fuzzing, black-box, LLM-security, Chinese-LLMs
2604.12666From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation
PDF
cs.LG, cs.CL, cs.HC90590k web-agent dataset + hard negatives + curriculum to improve robust web navigation generalizationweb-agents, robustness, dataset, hard-negatives, curriculum-learning, evaluation
2604.12461CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems
PDF
cs.AI90Black-box attack infers LLM multi-agent communication topology; concrete new MAS privacy/security risk.multi-agent, security, privacy, black-box-attack, topology-inference, LLM-agents
2604.05846AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning
PDF
cs.CL90RL-driven LLM agent for graph-native tool use; relevant to agentic systems design & control.LLM agents, tool use, reinforcement learning, graph learning, agentic retrieval
2604.11518From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python
PDF
cs.SE, cs.AI90Real production coding agent port; benchmark-driven method + SWE-bench/Terminal-Bench results.agents, coding-agents, SWE-bench, evaluation, software-engineering, LLM-assisted-development
2604.12601LLM-Guided Prompt Evolution for Password Guessing
PDF
cs.CR, cs.AI90LLM prompt evolution boosts password cracking; important offensive-security signal for LLM misuse evalscybersecurity, LLM-misuse, prompt-optimization, red-teaming, password-guessing
2604.12160PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
PDF
cs.LG90Federated RLVR with public off-policy signal sharing; practical for private-data reasoning post-training.RLVR, federated-learning, post-training, LoRA, reasoning, privacy
2604.12459Operationalising the Right to be Forgotten in LLMs: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments
PDF
cs.AI88Practical sequential unlearning for Right-to-be-Forgotten; layer-restricted negative FT on benchmarkunlearning, privacy, right-to-be-forgotten, LLMs, deployment, fine-tuning
2604.06802Riemann-Bench: A Benchmark for Moonshot Mathematics
PDF
cs.AI88Research-level math benchmark beyond olympiad; curated hard problems for frontier reasoning eval.evaluation, math-reasoning, benchmarks, moonshot, LLM-reasoning
2604.11661Towards Autonomous Mechanistic Reasoning in Virtual Cells
PDF
cs.LG, cs.AI88Multi-agent verified mechanistic reasoning + new dataset for grounded scientific agentsagents, verification, grounding, scientific-discovery, dataset, multi-agent
2604.12913CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference
PDF
cs.SE, cs.AI, cs.CR86LLM decompiler refinement targeting hallucinations/semantic mismatch; practical security RE workflow impactcode-LLMs, reverse-engineering, decompilation, hallucinations, rationale-guidance, robust-inference, security
2604.11772Towards Automated Pentesting with Large Language Models
PDF
cs.CR86LLM-assisted pentesting framework; concrete offensive code generation results raise security/dual-use stakescybersecurity, LLMs, pentesting, code-generation, dual-use, PowerShell
2604.12867QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence
PDF
cs.AI86Long-horizon deep-search agent for medical domain with data+training+benchmarks; strong agentic relevanceagents, deep-search, tool-use, medical, benchmarks, post-training
2603.24389When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
PDF
cs.CL, cs.AI, cs.CY86Large real-world LLM assessment dataset for teacher-child interaction; scalable evaluation implications.LLM, evaluation, education, dataset, human-AI collaboration, Chinese
2604.05767Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0
PDF
cs.CV, cs.CL86Safety-critical collision anticipation with long-tail benchmark + scalable data curation pipeline.safety, autonomous driving, long-tail evaluation, video understanding, explainability, benchmark
2604.07240$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
PDF
cs.MS, cs.AI, cs.LG86Open-ended automated discovery benchmark for k-server conjecture; sound refutation-based eval.automated-discovery, math, benchmarks, agents, program-synthesis, evaluation
2604.12748Generating Effective CoT Traces for Mitigating Causal Hallucination
PDF
cs.CL86Targets causal hallucination with generated CoT traces and proposes a new hallucination metric (CHR)hallucinations, reasoning, chain-of-thought, evaluation, dataset-generation
2603.23253On the Vulnerability of FHE Computation to Silent Data Corruption
PDF
cs.CR, cs.AR86Reliability risk for FHE on real hardware; silent corruption is critical for privacy-preserving AI.security, privacy, FHE, reliability, faults, robust-computation
2604.12196Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score
PDF
cs.CL86Training-free best-of-N via embedding consensus; improves reliability beyond majority voting.best-of-n, self-consistency, reliability, decoding, embeddings, selection
2604.12737Evaluating Differential Privacy Against Membership Inference in Federated Learning: Insights from the NIST Genomics Red Team Challenge
PDF
cs.CR, cs.LG84Real red-team setting: DP vs membership inference in federated learning; stacked black-box attack analysisprivacy, membership-inference, federated-learning, differential-privacy, red-teaming, genomics
2604.06712Broken Quantum: A Systematic Formal Verification Study of Security Vulnerabilities Across the Open-Source Quantum Computing Simulator Ecosystem
PDF
cs.CR, cs.SE, quant-ph84Large formal security audit (547 findings) + novel QASM injection; strong, reusable security evidencesecurity, formal-verification, static-analysis, SMT, quantum, software-supply-chain
2604.12446Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling
PDF
cs.CR, cs.CV84Practical input-level backdoor detection for T2I diffusion via cross-attention scaling probes.backdoors, diffusion-models, text-to-image, model-security, detection, cross-attention
2604.12944Distorted or Fabricated? A Survey on Hallucination in Video LLMs
PDF
cs.CV, cs.AI84Survey+taxonomy of hallucinations in Video-LLMs with eval/mitigation overview; reliability-relevanthallucinations, video-llm, evaluation, mitigation, survey, reliability
2603.19169ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis
PDF
cs.CV, cs.AI84Uses DPO + explicit rejection in medical VLM/RL pipeline; reliability-oriented design in high-stakes setting.DPO, rejection, medical AI, VLM, RL, reliability
2603.23043Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts
PDF
cs.LG, cs.AI84OOD robustness eval for climate foundation models under true no-analog shifts; tackles contamination.distribution shift, OOD evaluation, robustness, foundation models, climate
2604.11801CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation
PDF
cs.CL84Dual-head tuning to get calibrated probabilities without losing LLM explanation abilitycalibration, uncertainty, probabilities, fine-tuning, explanations, reliability
2604.01538Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging
PDF
cs.CL, cs.AI84Weight-space model merging to reduce instruction-following forgetting during domain adaptation.model-merging, catastrophic-forgetting, instruction-following, domain-adaptation, LLMs
2604.10905Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
PDF
cs.SD, cs.AI, cs.CL, eess.AS83Major open audio-language model upgrade: 30-min context + timestamped reasoning (temporal CoT).audio-language-models, long-context, multimodal, reasoning, temporal-grounding, datasets
2604.11129DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning
PDF
cs.CL83Training-free task steering via decoding-space vectors from ICL; broadly useful for control/guardrailssteering, task-vectors, in-context-learning, logits, LLM-control

AI Paper Insight Brief

2026-04-18

0) Executive takeaways (read this first)

  • “Active probing” is emerging as a robust security primitive: scaling cross-attention inside diffusion models exposes backdoor triggers (SET), and carefully crafted queries can elicit intermediate agent traces to infer multi-agent communication topology (CIA).
  • RL-style post-training is spreading beyond chat into domain agents and structured decision pipelines: PPO for clinical stenosis localization (ARIADNE), GRPO/RLVR for federated reasoning (PubSwap) and medical deep search (QuarkMedSearch), and GRPO for web navigation robustness (Triton curriculum).
  • Data/benchmark design is doing as much work as model scaling: long-tail mining + SSL + distillation yields real-time collision anticipation (BADAS-2.0); hard negatives + rejection samples + synthetic grounding drive web-agent generalization (Triton); private “moonshot” math benchmarks show frontier models still <10% (Riemann-Bench).
  • Reliability is increasingly framed as “selection + verification”: best-of-N selection improves via embedding consensus (RCS), decompilation improves via dual-path generation with recompilation checks (CoDe-R), and biology reasoning improves via structured DAG traces filtered by specialized verifiers (VCR-Agent/VC-Traces).
  • Evaluation is hitting fundamental limits in the rare-error regime: calibration auditing becomes statistically impossible below a verification floor without active querying, and verification costs can explode compositionally across pipelines (Verification Tax).

2) Key themes (clusters)

Theme: Active probing for security & model forensics

Theme: RL/Preference optimization as the “glue” for agents and pipelines

Theme: Long-tail robustness + edge deployment via data mining, SSL, and distillation

  • Why it matters: Safety-critical domains fail in rare regimes; scaling data coverage and compressing models for real-time inference is often more impactful than architecture tweaks.
  • Representative papers:
  • Common approach:
    • Targeted data acquisition (oracle mining + geospatial harvesting; historical-only splits to avoid contamination).
    • Domain SSL to adapt representations (V-JEPA-style SSL on 2.25M unlabeled driving videos).
    • Distill large teachers into deployable students with measured latency/accuracy trade-offs.
  • Open questions / failure modes:
    • “Accuracy vs stability” under true OOD (ClimaX lowest error but larger relative degradation; precipitation fragile).
    • Remaining hard long-tail categories (BADAS animal EWR <80% even for largest model).
    • Benchmark realism: OOD axes beyond those tested (more SSPs/GCMs; spatial/resolution shifts).

Theme: Verification, selection, and structured outputs for reliability

Theme: Privacy & reliability risks in infrastructure (FHE, quantum simulators, FL)

3) Technical synthesis

  • Alignment techniques are being repurposed as constraint enforcers: DPO is used to prefer topologically connected vessel masks (ARIADNE), while ORPO/GRPO are used to sharpen discrimination and long-horizon consistency in web navigation (Triton).
  • “Reject/abstain” is becoming a first-class action: ARIADNE’s MDP includes Reject to reduce false positives; Triton adds explicit None/reject samples; unlearning work aims to induce refusals on sensitive prompts.
  • Active vs passive evaluation is a recurring fault line: SET and CIA succeed by active probing/elicitation; Verification Tax formalizes why passive auditing fails when errors are rare.
  • Consensus/center-of-mass ideas show up in different guises: RCS uses a Fréchet mean in embedding space for best-of-N; SET learns a benign “center” in response-shift space for one-class detection.
  • Verifier-gated training data is a common reliability lever: VC-Traces filters mechanistic actions with DTI/DE verifiers; Triton’s synthetic DOM grounding is accepted only under dual-agent consensus; QuarkMedSearch uses strict correctness-gated rewards to avoid reward hacking.
  • Distillation is paired with domain SSL to hit deployment constraints: BADAS-2.0 uses SSL on 2.25M unlabeled videos then KD to 86M/22M students with large latency gains.
  • OOD robustness is being measured as stability, not just error: climate emulation reports percent-change degradation under scenario shifts and highlights precipitation fragility.
  • System security is expanding to “meta” properties: CIA treats MAS topology as sensitive IP; Broken Quantum shows ecosystem-wide vulnerability patterns tied to 2^n scaling.
  • Compute/latency overhead is increasingly explicit: DeCoVec reports ~1.6–1.7× overhead; SET requires multi-run probing; CoDe-R adds dual-path inference; BADAS reports end-to-end latency budgets down to tens of ms.

4) Top 5 papers (with “why now”)

1) The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime

  • Proves passive ECE estimation rate Θ((L·ε/m)^{1/3}) and a detection phase transition near m·ε ≈ 1.
  • Shows label-free self-evaluation is worst-case uninformative; active querying improves to Θ(√(ε/m)).
  • Explains why many benchmark deltas are statistically indistinguishable and why pipeline verification can explode with depth.
  • Be skeptical about: assumptions (Lipschitz calibration, i.i.d. samples, binned ECE) and worst-case composition may overstate difficulty in structured real deployments.

2) Scaling Exposes the Trigger: Input-Level Backdoor Detection in T2I Diffusion via Cross-Attention Scaling (SET)

  • Introduces CSRD: backdoored prompts diverge from benign under cross-attention scaling trajectories.
  • Builds a one-class detector from response-shift features; reports average AUROC 95.1% and ACC 84.8% across attacks.
  • Particularly targets stealthy implicit triggers where surface detectors fail.
  • Be skeptical about: white-box requirement and per-input compute overhead from multi-scaling, multi-step probing; evaluation limited to SD v1.4 + MS-COCO prompts.

3) Beyond the Beep: BADAS-2.0 collision anticipation + real-time explainability

  • Scales labeled data to 178.5k videos and adds a long-tail benchmark; combines domain SSL + KD to edge models.
  • Reports Kaggle mAP 0.940 (vs 0.925) and major latency reduction (~2.5s → 35ms per window), enabling on-device budgets.
  • Adds attention heatmaps and a VLM explanation module (BADAS-Reason) for actionable outputs.
  • Be skeptical about: attention heatmaps are patch-level proxies; some long-tail groups remain challenging (e.g., animal EWR <80%).

4) From Imitation to Discrimination: Progressive curriculum for robust web navigation (Triton)

  • Dataset engineering (hard negatives + counterfactual rejects + dual-agent-verified synthetic grounding) plus SFT→ORPO→GRPO.
  • Reports 58.7% Step SR on Mind2Web, exceeding GPT-4.5 (42.4%) and Claude-4.5 (41.4%) in the paper’s table.
  • Demonstrates that “what not to click” training (rejection) is pivotal for DOM-heavy pages.
  • Be skeptical about: evaluation is on static Mind2Web snapshots; text-only (no pixel cues); GRPO adds rollout cost.

5) ARIADNE: DPO-aligned topology-preserving angiography segmentation + RL stenosis reasoning

  • Applies DPO to preference pairs that favor connected vessel topology; improves topology-sensitive metrics (clDice 0.8378).
  • Downstream PPO agent with Reject action reduces false positives (FPPI 0.85 vs ~1.89–2.45 baselines) while keeping recall 0.867.
  • Shows a concrete pattern: align perception to structural constraints, then do decision-time RL with asymmetric clinical rewards.
  • Be skeptical about: single-institution training data; 2D projection ambiguity; RL assumes at most one dominant stenosis per segment; DPO adds ~2.8× training time.

5) Practical next steps

  • If you deploy best-of-N: prototype RCS-style embedding consensus selection and measure gains vs self-consistency at higher N; track failure cases where “semantic center” is wrong.
  • For agent safety evaluation: treat “verification floor” as a first-class metric—report confidence intervals and whether deltas exceed the (L·ε/m)^{1/3} resolution implied by your error rate and sample size.
  • For multi-agent systems: add defenses against topology leakage (e.g., prevent intermediate-trace elicitation; constrain output formats) and red-team with CIA-style induction prompts.
  • For diffusion model supply-chain security: consider SET-like active probes as part of model acceptance testing when you have white-box access and a small clean reference set.
  • For long-horizon web agents: add explicit reject/None training and hard-negative mining; evaluate not just success but wrong-action rate on dense pages.
  • For federated RLVR: test PubSwap-style public coordination if you have small public prompt pools; sweep swap frequency to quantify off-policy drift vs communication savings.
  • For privacy-preserving compute: if using CKKS/FHE in production, budget for checksum-style ABFT (~13–16% overhead reported) rather than assuming ciphertext computation is fault-transparent.

Generated from per-paper analyses; no external browsing.