Daily AI Paper Report (2026-04-19)
Published:
Chinese version: [中文]
Run stats
- Candidates: 3640
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-17T00:00:00Z → 2026-04-18T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.11477 | OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems | cs.AI, cs.SE, q-fin.TR | 90 | Alignment via real-market feedback; tackles evaluator hacking/sycophancy in multi-agent systems | agents, alignment, multi-agent, RL, evaluation, reward-hacking, software-engineering |
2604.12311 | Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety | cs.SE, cs.AI, cs.HC | 86 | Large empirical study of LLM-generated code silent failures in safety-critical construction workflows | LLM reliability, code generation, safety-critical, evaluation, silent failures, software engineering |
2604.12995 | PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models | cs.CL, cs.CY | 86 | Large-scale policy reasoning benchmark (21K, US-China); strong eval value for real deployments. | benchmark, evaluation, policy, reasoning, LLM |
2604.07039 | AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules | cs.RO, cs.AI | 86 | Single-agent robot OS w/ capability modules, control authority + explicit safety constraints focus | agents, robotics, agent-architecture, runtime, safety-constraints, tool-use |
2603.29953 | Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect | cs.AI, cs.HC | 86 | Structured intent protocol improves cross-model/language goal preservation; large controlled study + user study | prompting, structured-intent, robustness, cross-model, evaluation, human-study |
2604.11655 | RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents | cs.CL, cs.AI, cs.MA | 86 | Automated, multi-stage eval for LLM role-playing agents: role adherence, consistency, long-horizon stability. | LLM-evaluation, agents, role-playing, checklists, long-horizon, reliability |
2603.17331 | Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design | cs.CR | 86 | Declarative sovereignty constraints for federated systems; relevant to secure AI deployment governance. | federated-computing, security, policy, sovereignty, systems |
2604.05738 | MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models | cs.CL | 86 | Large multimodal benchmark for expert→lay medical VLM alignment; targets hallucination risk | benchmark, medical-vlm, semantic-alignment, hallucinations, evaluation, dataset |
2604.07102 | The Impact of Steering Large Language Models with Persona Vectors in Educational Applications | cs.CL, cs.AI | 86 | Activation steering shifts grading/quality; shows risks of persona control in education | llm-steering, persona-vectors, reliability, calibration, education, evaluation |
2604.00560 | Quantum-Safe Code Auditing: LLM-Assisted Static Analysis and Quantum-Aware Risk Scoring for Post-Quantum Cryptography Migration | cs.CR, cs.SE, quant-ph | 86 | LLM-assisted static analysis + quantum-aware risk scoring for PQC migration; practical security tooling. | cybersecurity, static-analysis, LLM-tools, post-quantum-crypto, risk-scoring, software-security |
2604.12168 | Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference | cs.CR, cs.AI | 84 | FHE inference on Llama 3 targets privacy-preserving LLM deployment; concrete security angle | privacy, FHE, LLM-inference, cryptography, deployment, security |
2604.12258 | Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System | cs.CL, cs.AI | 84 | Agentic clinical research via MCP with explicit privacy-preserving tool/data separation; relevant to safe tool use | agents, tool use, privacy, MCP, clinical, system design |
2604.12282 | Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning | cs.CL | 84 | Multi-agent, multi-format spreadsheet understanding for huge sheets; practical agentic workflow setting. | agents, tool-use, document-understanding, spreadsheets, long-context |
2604.01841 | Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints | cs.AI | 84 | EHR tabular ICL benchmark + task-aligned retrieval (AWARE) for shift/imbalance; practical robustness focus | tabular-foundation-models, retrieval-augmented, EHR, distribution-shift, robustness, benchmark |
2604.11188 | MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis | cs.CL, cs.AI | 84 | Adversarial constraint-graph evolution to synthesize harder math reasoning data; reduces mode collapse vs prompts. | reasoning, data-synthesis, math, adversarial, constraint-graphs, LLM-training |
2603.17570 | FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models | cs.LG, cs.AI | 84 | Intrinsic explainability/uncertainty signals for outlier-detection foundation models; safety-critical OD use. | foundation-models, outlier-detection, explainability, uncertainty, tabular |
2604.07274 | A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering | cs.CL, cs.AI, cs.LG | 84 | Systematic RAG pipeline ablations for medical QA; practical grounding + design guidance | RAG, medical-QA, retrieval, reranking, evaluation, grounding |
2604.08203 | MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning | cs.CV, cs.AI | 82 | Agentic RL for medical VLM visual grounding using uncertainty; targets visual hallucination risk | VLM, medical AI, hallucinations, uncertainty, reinforcement learning, grounding |
2604.06079 | Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning | cs.CV, cs.AI | 82 | High-quality SciTikZ-230K + closed-loop RL for image→TikZ synthesis; reusable dataset + eval gaps. | multimodal, program-synthesis, dataset, reinforcement-learning, evaluation |
2604.07035 | Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models | cs.CL | 82 | Careful dense vs MoE reasoning LMs benchmark under realistic prompting/inference constraints | llm, reasoning, moe, efficiency, benchmarking, evaluation |
2603.09940 | SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG | cs.LG | 82 | Large multimodal biosignal FM benchmark (ECG+PPG), 20 tasks, systematic eval; reusable for reliability testing | benchmark, foundation-models, biosignals, ECG, PPG, evaluation, healthcare |
2603.30025 | ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection | cs.CL | 82 | Context-aware verifiable-claim detection; better front-end for fact-checking pipelines than claim-only baselines | misinformation, fact-checking, verifiability, retrieval, NLP, evaluation |
2604.01870 | Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler | cs.LG, eess.SY | 82 | Diffusion posterior sampling for intrinsically calibrated UQ; strong reliability angle for industrial ML. | uncertainty-quantification, diffusion, calibration, posterior-sampling, reliability |
2603.29142 | REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour | cs.AI, cs.HC | 82 | Locally deployable multi-agent LLM feedback system with judge-guided regeneration + tool use | llm-agents, multi-agent, llm-judge, tool-calling, education, reliability |
2604.08333 | Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification | cs.CV, cs.AI, cs.LG | 82 | Finds medical MLLMs underperform; probes where degradation comes from across 14 models | multimodal-LLM, medical-imaging, evaluation, representation-probing, reliability |
2604.00931 | PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor | cs.AI | 82 | Lifelong-learning counselor agent with memory/planning + skill evolution; high-stakes deployment implications. | agents, lifelong-learning, memory, planning, healthcare, safety-reliability |
2603.15216 | vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing | cs.CR | 80 | Verifiable causality analysis for cloud endpoint auditing; integrity against untrusted cloud results. | security, auditing, provenance, verifiable-computation, cloud |
2604.05930 | "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns? | cs.CL, cs.AI | 80 | MultiPun benchmark + adversarial distractors exposes VLM weakness on multimodal wordplay; useful eval artifact | VLM, benchmark, multimodal, evaluation, robustness, dataset |
2603.15091 | Trustworthy Koopman Operator Learning: Invariance Diagnostics and Error Bounds | math.NA, cs.LG, math.DS, math.OC | 80 | Diagnostics + error bounds to certify trustworthy Koopman learning; actionable guarantees for dynamics ML. | robustness, verification, error-bounds, koopman, dynamical-systems |
2604.12719 | Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning | cs.LG, stat.ML | 79 | MC Stochastic Depth for scalable uncertainty; theory + detection benchmark for safety-critical UQ | uncertainty, bayesian, stochastic-depth, reliability, safety, object-detection |
AI Paper Insight Brief
2026-04-19
0) Executive takeaways (read this first)
- “Better representations” is often about alignment to the downstream interface, not scale: in biosignals, longer context beats bigger models (SignalMC-MED); in tabular EHR, task-aligned retrieval embeddings beat naive similarity (AWARE); in Koopman learning, invariance diagnostics (PAD) determine whether spectra/forecasts are trustworthy.
- Verification is moving from “trust the cloud/model” to “prove it”: vCause provides cryptographic proofs for causality queries over provenance graphs; FCaC pushes cross-org admission into proof-carrying capability tokens; FHE-on-Llama-3 shows partial encrypted inference feasibility.
- Agentic systems are converging on “tool use + gating + audits”: MedVR uses RL with tool rewards and consensus credit assignment for visual grounding; SpreadsheetAgent uses extraction→verification→solving with multi-format tools; OOM-RL adds hard external constraints (capital loss) plus strict test locks to resist reward hacking/test evasion.
- Prompt/steering interventions are powerful but have sharp edges: structured intent formats dramatically reduce cross-language variance and help weaker models (PPS/5W3H/CO-STAR/RISEN), but can hit an “encoding overhead” boundary; persona-vector steering can degrade educational answer quality and induce systematic grading bias (especially in MoE).
- Safety-critical “vibe coding” is currently unsafe by default: construction-safety code generation shows high silent failure rates among scripts that execute successfully, implying the need for deterministic wrappers, verification, and domain-grounded constraints.
2) Key themes (clusters)
Theme: Retrieval/representation alignment under real-world constraints
- Why it matters: Many failures come from mismatched representations (neighbors, embeddings, subspaces) rather than insufficient model capacity. Aligning the representation to the task/interface yields robustness gains without full retraining.
- Representative papers:
- Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
- SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
- Trustworthy Koopman Operator Learning: Invariance Diagnostics and Error Bounds
- A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
- Common approach:
- Learn or choose representations that make “good neighborhoods” (AWARE uses SNNL + balanced sampling + ensemble).
- Evaluate under deployment-like constraints (chronological splits; imbalance; high dimensionality; long signals).
- Add diagnostics/certificates for when the representation is invalid (Koopman invariance via principal angles/PAD).
- Open questions / failure modes:
- How to jointly optimize retrieval and in-context inference end-to-end (AWARE notes retrieval–inference mismatch).
- Whether gains persist under fine-tuning/nonlinear heads (SignalMC-MED is frozen linear probing).
- Retrieval noise and summarization “signal clarity” limits (ContextClaim; medical RAG pipeline tradeoffs).
Theme: Proof-carrying integrity & sovereignty for distributed systems
- Why it matters: As more analysis and decision-making moves to untrusted clouds and cross-org federations, “auditability” increasingly requires cryptographic verifiability rather than logs or policy engines alone.
- Representative papers:
- vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing
- Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design
- Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
- Quantum-Safe Code Auditing: LLM-Assisted Static Analysis and Quantum-Aware Risk Scoring for Post-Quantum Cryptography Migration
- Common approach:
- Authenticated data structures + succinct proofs (vCause: hierarchical Merkle + recursive path digests).
- Stateless boundary verification via signed capability artifacts (FCaC: KYO→ECT→PoP/DPoP).
- Encrypt sensitive computation paths (FHE attention) or prioritize crypto migration via automated discovery + scoring.
- Open questions / failure modes:
- Freshness/revocation and lifecycle management (FCaC lacks immediate revocation in PoC).
- Proof/update trade-offs (vCause segmentation depth L; commitment interval freshness).
- Practicality limits: quantization + compile/PBS overheads (FHE), and precision sensitivity to test-path exclusions (quantum-safe auditing).
Theme: Agentic tool-use with verification/gating loops
- Why it matters: Tool-augmented agents can reduce hallucination and improve robustness, but only if exploration and tool outputs are credit-assigned and verified; otherwise they amplify error propagation.
- Representative papers:
- MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
- Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning
- Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning
- OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems
- Common approach:
- Explicit tool calls (zoom/crop; code execution; render/compile) integrated into the reasoning loop.
- Gated rewards/verification (compile gating; visual-fidelity gates; RO-lock test immutability).
- Self-consistency/consensus signals to shape credit assignment (MedVR CCA; SciTikZ DSC round-trip).
- Open questions / failure modes:
- Compute intensity and rollout cost (MedVR RL; SciTikZ DSC overhead; SpreadsheetAgent ~97s extraction avg).
- Limited action sets/tools (MedVR only Zoom-in; spreadsheet tools depend on strong VLMs).
- Human-in-the-loop dependence and reproducibility (OOM-RL relies on manual “Epistemic Autopsy”; partial code/log release).
Theme: Evaluation & diagnostics for “behavioral reliability” (not just accuracy)
- Why it matters: Many systems fail in ways not captured by standard metrics: procedural drift, prompt sensitivity, false-positive bias, and silent failures in executable outputs.
- Representative papers:
- RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
- Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect
- Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
- Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety
- Common approach:
- Checklist/Boolean decomposition of requirements (RPA-Check) and cross-model/language factorial studies (structured intent).
- Layer-wise probing to localize where information degrades (medical MLLM classification FHS).
- Sandbox execution + rubric-based judging to detect “runs but wrong” (construction safety SFR).
- Open questions / failure modes:
- LLM-as-judge ceiling/bias and limited scenario coverage (structured intent ceiling; RPA-Check only five scenarios).
- Execution success masking domain-logic failure (high silent failure rate in safety code).
- Generative-decoding mismatch for discriminative tasks (medical MLLMs underperform supervised classifiers).
Theme: Lightweight uncertainty/explainability signals for deployment
- Why it matters: Always-on systems need cheap, intrinsic uncertainty and interpretability signals; post-hoc sampling can be too expensive, but removing uncertainty is unsafe.
- Representative papers:
- Common approach:
- Distill expensive uncertainty into single-pass heads (FoMo-X distills MC-dropout variability).
- Use better posterior samplers for calibration without post-hoc calibration (DiffUQ via Schrödinger bridge/SOC).
- Repurpose architectural stochasticity as variational inference (MCSD) and benchmark calibration vs accuracy.
- Open questions / failure modes:
- Simulator–real mismatch for explainability heads (FoMo-X global heads fail to transfer).
- Training cost and discretization/network-size interactions (DiffUQ).
- Inference-time multi-pass cost (MCSD requires T passes; latency scales with T).
3) Technical synthesis
- Multiple papers converge on “gating” as the core anti-hallucination primitive: compile gating (SciTikZ), correctness-conditioned tool reward (MedVR), policy engine blocking (AEROS), RO-lock preventing test mutation (OOM-RL), and cryptographic verification (vCause/FCaC).
- Representation diagnostics are becoming first-class: Koopman PAD principal angles quantify invariance; medical MLLM Feature Health Scores localize where discriminative signal is lost; AWARE explicitly shapes retrieval neighborhoods with SNNL.
- Longer context beats bigger models in frozen settings: SignalMC-MED finds 10-minute signals help more consistently than scaling parameters; similarly, structured intent helps weaker models disproportionately (weak-model compensation).
- Retrieval is shifting earlier in pipelines: ContextClaim moves retrieval into claim detection; medical QA study isolates retrieval components; SpreadsheetAgent retrieves/inspects localized regions iteratively rather than serializing whole sheets.
- Tool-use RL is adopting self-supervised credit assignment: MedVR’s consensus mask (CCA) and entropy-triggered branching (EVR) mirror broader trends in using ensemble/consensus signals to shape exploration without labels.
- MoE vs dense is not a free lunch: deployment-aware benchmarking shows mid-scale MoE can be Pareto-favorable (Gemma-4-E4B), while persona steering shows MoE graders can have larger calibration shifts than dense models.
- “Runs” ≠ “correct” is now quantified: construction safety introduces Silent Failure Rate among executable scripts; medical MLLMs show generative pipelines can underperform discriminative baselines despite scale.
- Privacy/security tooling is becoming LLM-integrated: quantum-safe auditing uses LLM enrichment for context labeling; CARIS uses MCP to keep raw clinical data behind tools; FHE integrates into attention for encrypted inference.
4) Top 5 papers (with “why now”)
1) MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
- Trains medical VLMs to interleave text with explicit visual actions (Zoom-in) using RL (GRPO) without intermediate annotations.
- Introduces EVR (entropy-guided branching) and CCA (consensus-based credit assignment) to make tool-use learnable and grounded.
- Shows large gains on medical VQA and grounding (e.g., big mIoU improvements; annotation-free CCA nearly matches supervised upper bound on a grounding task).
- Skepticism: compute-heavy RL with many rollouts; action space currently limited to a single Zoom-in tool and intermediate-step clinical plausibility isn’t expert-validated.
2) vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing
- Provides cryptographic proofs for causality queries on versioned provenance graphs (membership + backward/forward causal components).
- Scales to large logs (reported processing 25M logs; proof generation for 100k-node causality in 49 ms) with low endpoint overhead (<1%).
- Formal security reductions under standard assumptions; practical for incident response where cloud is untrusted.
- Skepticism: confidentiality is out of scope; segmentation depth and commitment interval introduce operational trade-offs.
- Establishes a realistic long-duration (10-min) synchronized ECG+PPG benchmark with 20 clinical tasks and chronological splits.
- Finds domain biosignal FMs > general time-series FMs, ECG+PPG fusion helps, and longer signals help more than scaling model size under frozen linear probing.
- Shows hand-crafted physiological features remain strong and complementary to FM embeddings.
- Skepticism: frozen linear probes + mean pooling may understate benefits of fine-tuning and richer temporal aggregation; single health system limits generalization.
4) Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning
- Practical multi-agent loop: incremental extraction → verification → solving, preserving layout via image + LaTeX + code execution tools.
- Demonstrates consistent gains on SpreadsheetBench/RealHiTBench; ablations show verification and multi-format tools matter.
- Provides an operational template for “too-big-for-context” documents.
- Skepticism: depends on strong multimodal/tool-capable models; nontrivial runtime/memory (reported ~97s extraction avg; 21GB peak).
- Large factorial study (3,240 outputs) shows structured intent formats raise goal alignment and cut cross-language variance ~24×.
- Identifies weak-model compensation (structured prompts help weaker baselines more) and a capacity/overhead boundary where too much structure can hurt.
- Includes a user study (N=50) where AI-expanded 5W3H reduces interaction rounds and increases satisfaction.
- Skepticism: LLM-as-judge ceiling effects and single-judge bias; user study order not counterbalanced.
5) Practical next steps
- If you deploy RAG/RA-TICL in healthcare: measure retrieval neighborhood label purity and try task-aligned embedding shaping (e.g., SNNL + balanced sampling) before scaling the backbone (AWARE).
- For multimodal medical assistants: add explicit visual tool calls and track grounding metrics (IoU/footprints); consider consensus-based pseudo-supervision to avoid annotation bottlenecks (MedVR).
- For safety-critical code generation: implement sandbox execution + domain-logic validators and report a silent failure rate (not just “runs”); require explicit actionable directives (construction safety study).
- For cross-org agent deployments: adopt proof-carrying admission (capability tokens + PoP) and plan for revocation/freshness as first-class requirements (FCaC).
- For cloud forensics/auditing: evaluate whether your provenance queries need verifiable proofs; prototype Merkle-accumulator + recursive digest approaches and benchmark proof sizes/latency (vCause).
- For model selection under constraints: run prompt-conditioned Pareto sweeps (accuracy vs latency/VRAM) rather than relying on parameter counts; watch for prompt sensitivity pathologies (Gemma/Phi/Qwen tradeoff study).
- For uncertainty in production perception/monitoring: consider single-pass distilled uncertainty heads (FoMo-X) or MCSD where residual architectures already use stochastic depth; benchmark calibration (ECE/AUARC) vs accuracy trade-offs.
Generated from per-paper analyses; no external browsing.
