Daily AI Paper Report (2026-04-14)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 3223
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-10T00:00:00Z → 2026-04-11T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.07720Towards Knowledgeable Deep Research: Framework and Benchmark
PDF
cs.AI92Framework+benchmark for agentic deep research using structured+unstructured knowledgeagents, deep-research, benchmark, tool-use, knowledge, evaluation
2603.15221ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving
PDF
cs.LG, cs.AI90Closed-loop min-max adversarial training for long-tail driving safety; objective-aligned attacker distribution.adversarial-training, robustness, autonomous-driving, minimax, markov-games, safety
2604.07733CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
PDF
cs.AI90Long-horizon multi-agent strategy benchmark with dense progress signals (Civ V)agents, benchmark, evaluation, long-horizon, multi-agent, games
2603.08483X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
PDF
cs.CV, cs.AI, cs.LG88Deepfake detector leveraging generator internal cross-attention via inversion; aims for robustness/generalization.deepfakes, multimodal, audio-visual, forensics, robust-detection, inversion
2603.28613TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark
PDF
cs.CV, cs.AI, cs.CR, cs.MM86Updated inpainting forgery dataset/benchmark; targets hard case of localization in fully regenerated images.benchmark, dataset, image-forensics, inpainting, synthetic-media, robustness
2604.06805Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
PDF
cs.CL86Targets long-CoT inefficiency with reversible hierarchical Markov structure + dataset for backward reasoningLLM reasoning, chain-of-thought, efficiency, math, dataset, inference
2604.08140Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark
PDF
cs.CR, cs.AI, cs.MM, cs.NI86New byte-grounded benchmark adds auditable LLM reasoning for encrypted traffic interpretationbenchmark, cybersecurity, multimodal, network-traffic, LLM-reasoning, auditability
2604.07072Epistemic Robust Offline Reinforcement Learning
PDF
cs.LG86Uncertainty-set alternative to ensembles for offline RL; targets epistemic uncertainty & reliability.offline-RL, uncertainty, epistemic, robust-RL, Q-learning
2604.04749AI Trust OS -- A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments
PDF
cs.AI86Continuous observability + zero-trust compliance for LLM/RAG/multi-agent enterprise deploymentsAI-governance, observability, zero-trust, agents, enterprise, compliance
2604.04800Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
PDF
cs.LG, cs.CR86End-to-end federated unlearning + visualization eval; strong privacy/safety relevance.federated-learning, machine-unlearning, privacy, evaluation, distillation
2604.01572AI-Assisted Hardware Security Verification: A Survey and AI Accelerator Case Study
PDF
cs.CR86Survey of AI/LLM-assisted hardware security verification + practical accelerator case studysecurity, LLMs, verification, hardware-security, survey, formal-methods
2604.04820ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture
PDF
cs.AI, cs.CL86Agent-native protocol/framework aiming to reduce token cost and improve security for tool/MCP useagents, protocols, tool-use, MCP, security, systems
2604.05523Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition
PDF
cs.AI86Multi-agent economic competition benchmark; measures resource acquisition/strategy—relevant to agentic risk evalsagents, multi-agent, benchmark, economics, resource-acquisition, evaluation
2604.07747Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing
PDF
cs.AI, cs.CL, cs.LG86RLVR method to reduce distribution sharpening; targets pass@k reasoning robustnessLLM, reasoning, RLVR, math, training, robustness
2604.08184AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
PDF
cs.SD, cs.AI84Evaluation plan for all-type audio deepfake detection beyond speech; addresses real-world distortions.audio-deepfakes, benchmark, evaluation, security, robustness, ALLM
2604.07017A-MBER: Affective Memory Benchmark for Emotion Recognition
PDF
cs.AI84New benchmark for affective state inference using long-term conversational memory across sessionsbenchmark, memory, emotion recognition, evaluation, assistants, longitudinal
2604.07883An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
PDF
cs.AI, cs.CL, cs.CY, cs.MA84Agentic evaluation + source attribution protocol reduces false positives in bias auditingagent-evaluation, multi-agent, bias-detection, audit, source-attribution, education
2603.09675GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation
PDF
cs.LG, cs.AI84Open-source TS anomaly detection framework + critique of metrics; boosts reproducibility & eval rigor.evaluation, reproducibility, anomaly-detection, GNN, framework
2604.05674From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems
PDF
cs.CR, cs.AI84Multimodal LLM tool for CPS threat modeling from incomplete architecture; outputs quantified riskcyber-physical-systems, security, LLM, risk-assessment, threat-modeling, multimodal
2604.00550BloClaw: An Omniscient, Multi-Modal Agentic Workspace for Next-Generation Scientific Discovery
PDF
cs.AI84Agent workspace protocol/sandbox reliability; relevant to safe tool use though claims need validationagents, tool-use, sandboxing, protocols, AI4Science, systems
2603.29386PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization
PDF
cs.CV, cs.AI84Large dataset (350k) for prompt-based image forgery localization; useful for misuse defense.deepfakes, image-forensics, dataset, misinformation, localization
2604.04634Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale
PDF
cs.CV, cs.AI84Native-scale deepfake video detection + large new dataset; targets resizing/cropping artifact lossdeepfakes, video-detection, misinformation, dataset, robustness, forensics
2604.05939Context-Value-Action Architecture for Value-Driven Large Language Model Agents
PDF
cs.AI, cs.HC84Value-driven agent architecture; claims prompt reasoning can polarize values; proposes verifier w/ human ground truthagents, values, alignment, evaluation, human-ground-truth, robustness
2604.07894TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
PDF
cs.CL, cs.AI84Long-horizon personalization via evolving memory + self-learning context distillationLLM, personalization, memory, long-context, continual-learning, RAG
2603.19204Robustness, Cost, and Attack-Surface Concentration in Phishing Detection
PDF
cs.LG82Cost-aware evasion analysis for phishing detectors; introduces MEC/S(B)/RCI diagnostics for robustness gaps.security, adversarial-evasion, robustness-metrics, phishing, ml-security
2604.05364TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
PDF
cs.AI82Reasoning-focused forecasting benchmark with multi-agent verification loop and causally effective tracesbenchmark, evaluation, reasoning traces, multi-agent, forecasting, verification
2603.28113Lipschitz verification of neural networks through training
PDF
cs.LG82Train-for-verifiability approach makes Lipschitz robustness certifiable with cheap boundsverification, robustness, certified-training, lipschitz, adversarial
2603.22770From Arithmetic to Logic: The Resilience of Logic and Lookup-Based Neural Networks Under Parameter Bit-Flips
PDF
cs.LG, cs.AI82Theory of DNN resilience to parameter bit-flips; relevant to safety-critical deployment robustness.robustness, fault-tolerance, bit-flips, edge-AI, theory
2604.05458MA-IDS: Multi-Agent RAG Framework for IoT Network Intrusion Detection with an Experience Library
PDF
cs.CR, cs.AI82Multi-agent LLM+RAG intrusion detection with persistent experience library for IoT zero-daysRAG, agents, intrusion-detection, IoT, cybersecurity, experience-library
2604.08213EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization
PDF
cs.CV, cs.AI82Human-aligned instruction synthesis for image editing; SFT+DPO pipeline and large 100K datasetDPO, post-training, VLM, data-generation, image-editing, alignment

AI Paper Insight Brief

2026-04-14

0) Executive takeaways (read this first)

  • Robustness is increasingly “systems + evaluation,” not just model choice: multiple papers show that deployment failures (tool-call serialization, lost visual outputs, governance blind spots) and metric/threshold choices can dominate real-world reliability even when i.i.d. accuracy looks strong.
  • Attack surfaces concentrate where edits are cheap: phishing detection robustness is bounded by low-cost, presentation-layer feature edits (median minimal evasion cost = 2; edits concentrate on ~3 features), implying architecture upgrades alone won’t fix deployment fragility without changing features/costs.
  • Forensics is in a dataset/benchmark refresh cycle driven by new generators and laundering: FLUX.1 inpainting and native-resolution video processing both shift what “generalization” means; super-resolution laundering (Real-ESRGAN) is a strong attack that collapses localization performance.
  • Agent protocols are moving toward “LLM sees less”: ANX and AI Trust OS both emphasize isolating sensitive data from the LLM and using telemetry/probes + structured protocols to make agent behavior auditable and compliant.
  • Reasoning quality is being operationalized with verifiers and dense progress signals: forecasting (TFRBench), math reasoning (CLoT; DAHS+BHA), and long-horizon strategy (CivBench) all add verification loops or dense intermediate metrics to avoid misleading end metrics.

2) Key themes (clusters)

Theme: Cost- and threat-model-aware robustness (beyond i.i.d. accuracy)

Theme: Forensics under new generators + laundering attacks

  • Why it matters: New generation pipelines (e.g., FLUX.1, modern video generators) and post-processing (super-resolution) can erase or alter forensic traces, breaking detectors trained on older distributions.
  • Representative papers:
  • Common approach:
    • Build updated, large-scale datasets/benchmarks tied to current generators (TGIF2 adds FLUX.1; video dataset spans ~140K videos/15 generators).
    • Separate regimes that prior methods conflate (spliced vs fully regenerated; prompt-based edits with recovered masks).
    • Stress-test with realistic laundering (Real-ESRGAN super-resolution) and resolution-preserving pipelines (native-scale 3D patchification).
  • Open questions / failure modes:
    • Out-of-domain generalization remains limited (SID degrades on FLUX.1; IFL fine-tuning can induce semantic bias; PromptForge leave-one-out IoU ~41.5%).
    • Post-processing attacks can dominate (Real-ESRGAN sharply reduces IFL F1; native-scale video detection increases compute cost).
    • Annotation pipelines can fail on edge cases (PromptForge mask errors on color-only edits due to DINO v3 sensitivity limits).

Theme: Agent infrastructure & governance: protocols, telemetry, and “LLM sees less”

Theme: Verification- and progress-based evaluation for reasoning/agents

3) Technical synthesis

  • “Evaluation mismatch” is a recurring failure mode: TSAD shows VUS can be competitive while thresholded detection yields zero correct predictions; phishing shows AUC ~0.98–0.995 yet median MEC=2 under feasible edits.
  • Robustness often reduces to controlling the interface between components: BloClaw (routing + sandbox interception), ANX (protocol + UI-to-Core isolation), AI Trust OS (telemetry probes + evidence ledger) all treat the LLM as one module in a controlled system.
  • Data refresh is now part of the method: TGIF2 (FLUX.1 + random masks), native-scale video detection (new 140K/15-generator dataset + Magic Videos benchmark), AT-ADD (40+ speech generators; 70+ all-type generators) all treat generator churn as a first-class benchmark requirement.
  • “Preserve signal” vs “normalize input”: video detection argues fixed resizing destroys high-frequency traces; native 3D patchification + variable resolution improves robustness but increases compute.
  • Structured uncertainty representations are replacing brute-force ensembles: ERSAC models per-state uncertainty sets (box/convex hull/ellipsoid) and uses Epinets for efficient ellipsoids, recovering SAC-N as a special case.
  • Verification is being made token-efficient: CLoT’s hierarchical pruning reduces token use (e.g., 325k → 136k in an ablation) while improving accuracy; DAHS+BHA targets large-k coverage without relying on pass@1 alone.
  • Benchmarks increasingly include “stress layers”: A-MBER adds pseudo-relevant-history and insufficient-evidence tags; TGIF2 adds random masks and SR laundering; AT-ADD adds real-world perturbations and unseen generators.
  • Interpretability is shifting from post-hoc to “generate structured evidence”: mmTraffic generates forensic JSON reports from bytes; MA-IDS stores human-readable rules in an Experience Library; TFRBench evaluates logic-to-number consistency.

4) Top 5 papers (with “why now”)

1) AI Trust OS – A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments

  • Telemetry-first governance with Shadow AI discovery scanning LangSmith/Datadog to auto-register undocumented AI systems.
  • Zero-trust probe boundary: ephemeral read-only probes; excludes code/prompts/payload PII; evidence ledger with watermarks.
  • Demonstrated a concrete evidence run (multi-provider) including discovery of an undeclared fine-tuned model and PII patterns in traces.
  • Skepticism: evaluation is largely a single-workspace run; broader observability coverage and longitudinal validation are pending.

2) Robustness, Cost, and Attack-Surface Concentration in Phishing Detection

  • Exact cost-aware evasion via shortest-path search over discrete monotone edits; introduces MEC/FRI/RCI diagnostics.
  • Finds median MEC=2 and strong concentration (RCI3 > 0.78), implying robustness is dominated by a few cheap-to-edit features.
  • Provides an architecture-invariant bound: if many instances are evadable via minimal-cost transitions, no classifier can raise that MEC quantile without changing features/costs.
  • Skepticism: uses a dated UCI dataset and a monotone-only threat model; modern feature sets and richer actions could change conclusions.

3) TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

  • Large updated dataset (271,788 manipulated images) adding FLUX.1 inpainting and random non-semantic masks.
  • Shows IFL methods fail on fully regenerated images; fine-tuning helps but can induce semantic bias and still generalizes poorly out-of-domain.
  • Demonstrates Real-ESRGAN as a strong laundering attack that sharply reduces localization performance.
  • Skepticism: empirical benchmark; does not itself provide a generator-robust localization method.

4) Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

  • Argues fixed 224×224 preprocessing destroys forensic cues; proposes native-scale 3D patchification with Qwen2.5-ViT.
  • Couples method with an up-to-date dataset (~140K videos, 15 generators) and Magic Videos benchmark (6 recent generators).
  • Reports strong cross-dataset performance (e.g., DVF-Test AUC 97.6%) and robustness under compression/downscaling relative to baselines.
  • Skepticism: native-resolution processing increases compute/memory; ongoing generator churn requires continuous dataset updates.

5) ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture

  • Protocol-first agent interaction (markup/config/CLI) + 3EX decoupling (Expression/Exchange/Execution) with dynamic discovery (ANXHub).
  • Security primitives: sensitive fields bypass the LLM (UI-to-Core), and confirmations are human-only with no programmatic exit.
  • Demonstrates large token/time reductions vs GUI automation and MCP-based skills on a form-filling benchmark.
  • Skepticism: evaluation is narrow (form-filling); security claims need adversarial validation and real deployment studies.

5) Practical next steps

  • Adopt cost-aware robustness audits for security classifiers: compute minimal evasion cost and concentration (MEC/RCI-style) to identify “cheap feature bottlenecks,” then redesign features/costs rather than only swapping models.
  • For anomaly detection pipelines, report at least one threshold-agnostic metric (e.g., VUS) and analyze score distributions/threshold sensitivity to avoid “good metric, zero detections” failures.
  • For agent toolchains, harden the interface: replace fragile JSON tool-calls with structured protocols + maximal extraction; add sandbox interception to persist all artifacts (plots/HTML) by default.
  • For enterprise LLM deployments, implement telemetry-driven Shadow AI discovery and evidence ledgers; ensure probes are read-only and exclude prompts/payload PII, then generate deterministic exports for audits.
  • For forgery detection/localization, add laundering attacks (super-resolution, compression, resizing) to CI evaluation; track generator families separately (e.g., SDXL vs FLUX.1) and measure out-of-domain generalization explicitly.
  • For long-horizon agent evaluation, prefer dense progress estimators (turn-level victory probability / intermediate standing) over terminal win rates to detect regressions and agentic-setup effects.
  • For reasoning training, measure broad coverage (large-k pass@k) and add verification/annealing mechanisms (backward checks, hint annealing) to avoid distribution sharpening and error propagation.

Generated from per-paper analyses; no external browsing.