Daily AI Paper Report (2026-04-17)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 3469
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-17T00:00:00Z → 2026-04-18T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.10866OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
PDF
cs.CL94Large-scale agent benchmark (100 scenarios) via language world models; strong eval infrastructure valueagents, benchmark, evaluation, language-world-models, tool-use, simulation
2604.11546RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience
PDF
cs.CR93Practical black-box RL spoofing eval for LLM watermarks; strong security relevance + theory.watermarking, spoofing, black-box attack, RL, LLM security, evaluation
2604.04527ENCRUST: Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation
PDF
cs.SE, cs.AI, cs.PL92Agentic, validated C→safe Rust translation with ABI wrappers; strong real-world safety/security relevance.agentic-coding, program-repair, memory-safety, rust, software-security, verification, compilers
2604.11720On the Robustness of Watermarking for Autoregressive Image Generation
PDF
cs.CV, cs.AI, cs.CR91Shows removal/forgery attacks break AR image watermarking; important for provenance & misuse mitigationwatermarking, robustness, provenance, image-generation, security, adversarial-attacks
2604.11563Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo
PDF
cs.CL, cs.AI, cs.LG90Structured long-term persona memory with adversarial robustness claims on LoCoMo.agent memory, hallucination, robustness, LoCoMo, persona, RAG
2604.11141Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
PDF
cs.LG, cs.CR90MBR-based hallucination mitigation with theory+benchmarks; strong enterprise reliability anglehallucination, reliability, minimum-bayes-risk, uncertainty, enterprise, evaluation
2604.10968YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents
PDF
cs.CL90Large dataset+metrics for info-elicitation agents; high relevance to agent behavior, misuse, and evalsagents, evaluation, dataset, dialogue, information-elicitation, POMDP, safety
2604.11610Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks
PDF
cs.CL90Benchmark + method for heterogeneous LLM memory extraction; directly relevant to persistent agents.llm-memory, agents, benchmark, personalization, evaluation, prompt-optimization
2604.11087CausalGaze: Unveiling Hallucinations via Counterfactual Graph Intervention in Large Language Models
PDF
cs.LG90Causal interventions on internal graphs for hallucination detection; interpretability + reliability angle.hallucination, causal, interpretability, LLM-reliability, counterfactuals
2604.08501sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing
PDF
cs.DL, cs.CL, cs.SE90Local linter to verify scientific manuscripts; tackles AI vibe-writing, citations, integrity at scalescientific-integrity, verification, tooling, citation-checking, open-source, LLM-misuse
2604.04442Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning
PDF
cs.CR, cs.LG, cs.MA89Structurally constrained multi-agent cyber defense aimed at adversarial ambiguity; high security impact.cybersecurity, autonomous-agents, multi-agent-RL, robustness, causal-models, adversarial, critical-infrastructure
2604.11344Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service
PDF
cs.CR, cs.CL88Watermarking for embedding-as-a-service to deter model stealing; tackles robustness-utility-verifiabilitymodel-stealing, watermarking, embeddings, copyright, ml-security, verification
2604.11554Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
PDF
cs.CL88Open-source async RL post-training engine for omni-modal/agentic workflows; scalable infra impactRLHF, post-training, systems, agents, multimodal, scaling, open-source
2604.11502METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
PDF
cs.CL, cs.AI88Unified causal-reasoning benchmark + mechanistic diagnosis of failure modes across causal ladder.evaluation, causal-reasoning, benchmarks, mechanistic-analysis, robustness
2604.10893Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models
PDF
cs.CR, cs.AI88Adaptive watermark-stealing attack; important for LLM provenance, watermark robustness, and security evalswatermarking, model-security, attack, provenance, adversarial, LLM-services
2604.07973How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
PDF
cs.AI88Strong embodied navigation benchmark; shows LMMs far from human-level spatial actionembodied-agents, multimodal, benchmark, navigation, evaluation
2604.11416Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
PDF
cs.LG86Tighter formal certificates for label-poisoning robustness using white-box ensemble info.data poisoning, label flipping, certification, robust ML, ensembles
2604.11133How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts
PDF
cs.CL86Clinical numeracy robustness benchmark (1,624 items) targets safety-critical failure modesbenchmark, clinical, numerical-reasoning, robustness, evaluation, safety
2604.11261Inspectable AI for Science: A Research Object Approach to Generative AI Governance
PDF
cs.AI86Governance framework to log/inspect GenAI use in science; strong provenance/accountability angle.governance, provenance, auditability, FAIR, research-workflows, genai
2603.23860Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness
PDF
cs.LG, cs.AI86Links activation curvature to adversarial robustness; actionable design rule (optimal max|σ''| range).adversarial-robustness, activation-functions, loss-curvature, generalization, theory+empirics
2604.04347RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets
PDF
cs.AI86Systematic comparison of agent-evolution optimizers under tight eval budgets; useful for agentic R&D.agents, evaluation, optimization, LLM-guided-search, AutoML, benchmarks, sample-efficiency
2604.11465Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
PDF
cs.AI86Inference-time role orchestration boosts small agent performance on tool tasks without training.agents, inference-scaffolding, tool-use, efficiency, small-models, orchestration
2604.11119DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO
PDF
stat.ML, cs.LG86Held-out benchmark comparing DPO vs reward-guided DDO-RM; useful signal on preference optimization.alignment, preference-optimization, DPO, reward-models, evaluation
2604.10917HTAA: Enhancing LLM Planning via Hybrid Toolset Agentization & Adaptation
PDF
cs.CL86Hierarchical tool-use planning to scale to hundreds of tools; relevant to agent reliability and controlagents, tool-use, planning, hierarchical, training, scalable-orchestration
2603.28128ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment
PDF
cs.LG, cs.CR84Multimodal graphs + causal enrichment for smart-contract vuln detection; aims for robustness & explainabilitysmart-contracts, vulnerability-detection, explainability, robustness, graphs, security
2604.05552Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue
PDF
cs.CL, cs.AI84Dialogue-as-tree context management could improve long-horizon agent reliability/coherence.LLM agents, long context, dialogue, discourse trees, memory, reliability
2604.11466SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation
PDF
cs.MA, cs.AI84Evaluates LLM-agent social sims by process fidelity over time, not just final outcomes.agents, evaluation, social-simulation, validity, process-metrics, monitoring
2603.11872ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics
PDF
q-bio.GN, cs.AI84Interpretable hybrid LLM agent over scRNA-seq embeddings + retrieval; concrete agentic workflow for science.agents, interpretability, biomedical-LLM, retrieval, tool-routing, scRNA-seq
2603.22730How Utilitarian Are OpenAI's Models Really? Replicating and Reinterpreting Pfeffer, Krügel, and Uhl (2025)
PDF
cs.CL, cs.CY84Shows moral-behavior results can be prompt/refusal confounds; important for safety eval validity.safety-evaluation, refusals, prompting, robustness, ethics, replication, measurement
2604.10981ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks
PDF
cs.AI, cs.IR84Clarifies what 'continuity' measures vs memory/agentic-memory benchmarks; helps eval taxonomy.evaluation, memory, long-context, agents, benchmarks

AI Paper Insight Brief

2026-04-17

0) Executive takeaways (read this first)

  • Evaluation is the bottleneck, not just modeling: multiple papers show that single-prompt or single-simulator results can be misleading (moral judgments shift with framing; agent rankings shift with simulator choice; “memory” benchmarks don’t measure “continuity”).
  • Robustness failures increasingly look like “environment + procedure” issues (implicit tool faults, prompt framing, context management, simulator drift), not only model capability—so robustness work should instrument and stress the pipeline.
  • Watermarking is under sustained pressure from stronger black-box attacks: adaptive watermark stealing and RL-based spoofing achieve high success with limited samples; AR image watermarking shows both removal and forgery vulnerabilities, undermining provenance and dataset filtering.
  • Inference-time scaffolding and budget-aware optimization can materially lift small/cheap agents: role-orchestrated inference roughly doubles AppWorld completion for an 8B model; validation-free Elo evolution beats validation-heavy paradigms under fixed evaluation budgets.
  • Causal/structured constraints are emerging as a unifying safety lever: causal graphs constrain cyber-defense action trajectories; causal interventions refine hallucination detectors; causal training disentangles spurious features in smart-contract detection.
  • Domain-grounded RAG + structured representations are winning in high-stakes settings (single-cell genomics discovery, smart contract auditing, persona memory), but quality/faithfulness and attack surfaces (RAG stochasticity, adversarial perturbations) remain central.

2) Key themes (clusters)

Theme: Benchmark realism & evaluation brittleness

Theme: Agent efficiency under tight budgets (evaluation, context, tools)

Theme: Watermarking under attack (text, embeddings, images)

Theme: Causal/structured methods for robustness, safety, and interpretability

Theme: Grounded, interpretable domain assistants (science + memory + governance)

3) Technical synthesis

  • Robustness is increasingly evaluated as sensitivity to “presentation layers”: prompt framing (moral dilemmas), context format (clinical notes), and simulator choice (LWMs) can dominate measured behavior.
  • Multiple works converge on abstention/gating as a safety primitive: HUMBR abstains on low consensus; cyber-defense uses ETS gating; disagreement scores (Blue/Red) surface uncertainty.
  • “Structured memory” is splitting into two directions: (a) discourse-structure for context selection (Context-Agent) and (b) typed fact stores for hallucination resistance (Synthius-Mem).
  • Several papers show implicit faults (missing/truncated fields) are harder than explicit errors in tool environments (OccuBench), suggesting eval suites should prioritize silent-degradation tests.
  • Watermark security is moving from static to adaptive/learned attacks: per-step seal selection (AS) and RL policy optimization (RLSpoofer) both treat spoofing as distribution shaping under semantic constraints.
  • Causal graphs appear in three roles: constraint (SCM→MDP-DAG), detector refinement (attention-edge interventions), and training disentanglement (causal vs spurious branches).
  • Mechanistic findings suggest some capabilities rely on shallow-layer evidence aggregation (METER masking drops discovery accuracy 0.827→0.579 when blocking shallow evidence→option).
  • Ensemble/consensus methods are being formalized with risk bounds and correlation modeling (HUMBR’s Beta-Binomial + effective sample size), aligning engineering knobs (temperature stratification) with guarantees.
  • Systems papers emphasize operational robustness (Relax): fault isolation, staleness control, and streaming micro-batching as first-class requirements for agentic/omni-modal RL.

4) Top 5 papers (with “why now”)

1) OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

  • Expands evaluation to the “untestable majority” via LWM-simulated tool environments (100 scenarios; 382 solvable instances).
  • Makes robustness concrete with E0/E1/E2/E3 fault injection and a robustness score; shows implicit faults degrade most (avg E2 53.4% vs E0 67.5%).
  • Reveals simulator dependence is huge (agents average 29.3% CR under GPT-5.2 simulator vs 67.9% under Gemini Flash).
  • Skepticism: results depend on simulator fidelity; tasks solvable under one simulator may break under another.

2) Reducing Hallucination in Enterprise AI Workflows via HUMBR

  • Reference-free MBR selection with semantic+lexical utility and abstention; includes risk bounds with intra-model correlation and sample-size design inequality.
  • Strong offline gains (TruthfulQA Truth×Info 80.3 vs 69.5 greedy) and production evidence (81% win vs human drafts; reduced key-section misses to 0.8%).
  • Provides actionable engineering knobs (temperature stratification; α≈0.6–0.65).
  • Skepticism: ensembling cost is high; production tradeoff includes more uncited references (12.4%→25.2%).

3) RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

  • Shows sample-efficient black-box spoofing: 62% SSR on PF watermark with only 100 human–watermarked pairs (vs ~6% baselines).
  • Introduces “local capacity bottleneck” theory to motivate capacity-aware token rewards.
  • Broad evaluation across watermark families and attacker models.
  • Skepticism: optimizes a surrogate objective, not the true detector; effectiveness depends on surrogate quality and tuning.

4) ENCRUST: Safe C-to-Rust Translation with a Live Scaffold

  • Practical two-phase pipeline with compile+test invariant at every step; wrapper-based safe inner functions + type-directed wrapper elimination + agentic refinement.
  • Large real-world evaluation (15 programs; ~198k LoC) with 100% test correctness and substantial unsafe reductions (e.g., ~55% fewer raw pointer dereferences vs C2Rust on Coreutils).
  • Demonstrates how to make LLM code transformation project-scale and verifiable.
  • Skepticism: correctness only as good as test-vector coverage; TDWE is best-effort and Phase 2 doesn’t finish all tasks.

5) How Robust Are LLMs for Clinical Numeracy?

  • Controlled robustness benchmark (1,624 instances) across operations (retrieval/arithmetic/comparison/aggregation) and three semantically equivalent formats.
  • Finds strong retrieval but persistent failures on comparison/aggregation; note-style variants cause drops; medical fine-tuning can erode numeracy.
  • Directly relevant to safety-critical deployment where silent numeric errors are unacceptable.
  • Skepticism: template-based questions may not reflect real clinician phrasing; scope limited to vital signs.

5) Practical next steps

  • For any “values/ethics” or safety evaluation you run, adopt multi-prompt + repeated-timepoint protocols and log serving metadata (model version + system fingerprint where available), mirroring the moral-judgment replication findings.
  • Add implicit fault injection (missing/truncated/stale tool fields) to your agent eval harness; track robustness as min(CR_fault)/CR_clean (OccuBench-style), not just clean success.
  • If you rely on watermarking for provenance, treat it as adversarially learnable: benchmark against adaptive stealing and RL spoofing with low-sample budgets; measure both spoofing and scrubbing plus quality tradeoffs.
  • For small-model agents, prototype inference-time role scaffolds (summarize → act → isolated correct) and instrument failure taxonomy shifts (mechanical vs planning) to see what you’re actually fixing.
  • When building memory, decide explicitly between structured fact stores (high adversarial robustness, lower peripheral recall) vs discourse-tree retrieval; evaluate on adversarial false-premise queries (LoCoMo-style).
  • For high-stakes generation without ground truth, consider MBR-style centroid selection + abstention and measure intra-model correlation (diversity) since it drives effective sample size and guarantees (HUMBR).
  • If doing RAG-enriched security tooling, add robustness tests to structural perturbations and text attacks and include explanation quality metrics (e.g., MIoU-style) to ensure auditability (ORACAL-style).
  • For multimodal/agentic RL post-training, prioritize fault isolation + staleness control in your training stack (Relax-style max_staleness) to avoid long-tail failures and stale-rollout collapse.

Generated from per-paper analyses; no external browsing.