Daily AI Paper Report (2026-03-30)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 1714
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.24570Anti-I2V: Safeguarding your photos from malicious image-to-video generation
PDF
cs.CV, cs.AI90Targets misuse: adversarial protection vs image-to-video diffusion incl. DiT; timely safety anglemisuse-prevention, adversarial-perturbations, diffusion, video-generation, deepfakes, DiT
2603.21698A Blueprint for Self-Evolving Coding Agents in Vehicle Aerodynamic Drag Prediction
PDF
cs.AI88Contract-centric self-evolving coding agents; strong for agentic reliability, leakage control, reproducibility.agents, coding-agents, autonomous-optimization, evaluation-contracts, reproducibility, leakage-prevention
2603.22179MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management
PDF
cs.AI88Agentic multimodal VLM for cardiac diagnosis across ECG/echo/CMR; large-scale training, real deployment relevanceagentic-systems, multimodal, vision-language, medical-ai, orchestration, clinical-decision-support
2603.17838Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark
PDF
cs.CL86NEVU benchmark for actor-conditioned, direction-aware human values in factual news; useful for alignment evals.alignment, values, benchmark, evaluation, news, grounding
2603.23966Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
PDF
cs.CR, cs.AI86Agentic LLM framework for SOC triage/threat hunting; high real-world security relevance.agentic-ai, cybersecurity, SOC, SIEM, threat-hunting, LLM-tools
2603.21613AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents
PDF
cs.IR, cs.AI86End-to-end policy optimization for tool-using LLM recommender agents; trajectory-level feedback linkage.agents, tool-use, policy-optimization, ReAct, RL, recommenders, evaluation
2603.18813Can LLM generate interesting mathematical research problems?
PDF
cs.AI86Agent+benchmark for LLM mathematical creativity; 665 novel research problems w/ expert verificationLLM, agents, evaluation, creativity, math, benchmark
2603.23146Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy
PDF
cs.CL, cs.AI86Shows AI-text detectors fail via artifacts; adds explainability beyond benchmark accuracyAI-generated-text, detection, dataset-artifacts, robustness, explainable-AI, evaluation
2603.23160UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities
PDF
cs.CL86Unified toolkit for multi-turn dialogue eval; improves comparability and scalable interactive testingevaluation, dialogue, toolkit, benchmarks, interactive-systems, metrics
2603.24051FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval
PDF
cs.CL86Forward-synth tool-use dialogues w/ dynamic tool retrieval; useful for agent tool-use training/evalLLM agents, tool use, synthetic data, dialogue generation, retrieval, finance
2603.234473DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding
PDF
cs.CV, cs.AI86City-scale multimodal LLM framework + 1.2M dataset for 3D perception/planning; strong reuse potentialmultimodal-llm, 3d, city-scale, dataset, planning, vision-language
2603.22985Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation
PDF
cs.CL, cs.CY86Fine-grained multimodal toxicity labels (incivility vs intolerance); improves moderation modeling & eval.content-moderation, multimodal, toxicity, dataset, evaluation, vision-language
2603.18779SoK: Practical Aspects of Releasing Differentially Private Graphs
PDF
cs.CR, cs.SI86SoK on practical DP graph release; clarifies guarantees, pitfalls, and utility tradeoffs.privacy, differential-privacy, graphs, systematization-of-knowledge, deployment
2603.08014FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning
PDF
cs.LG, cs.AI86Federated LoRA aggregation fix; better convergence + privacy-preserving LLM finetuning practicalityfederated-learning, LoRA, LLM-finetuning, privacy, optimization
2603.23178SAiW: Source-Attributable Invisible Watermarking for Proactive Deepfake Defense
PDF
cs.AI84Source-attributable invisible watermarking for proactive deepfake defense and provenance verification.security, deepfakes, watermarking, provenance, media-integrity
2603.24213Uncovering Memorization in Timeseries Imputation models: LBRM Membership Inference and its link to attribute Leakage
PDF
cs.LG, cs.AI84Black-box membership+attribute inference for time-series imputation; concrete privacy leakage.privacy, membership-inference, attribute-inference, timeseries, security, leakage
2603.22987A Critical Review on the Effectiveness and Privacy Threats of Membership Inference Attacks
PDF
cs.CR, cs.LG84Clarifies when membership inference is a real privacy threat; warns against overusing MIA as metric.privacy, membership-inference, security, evaluation, threat-models, ml-security
2603.17328A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication
PDF
cs.AI, cs.LG84Targets multimodal hallucination/logic looseness via evidentiary protocol + synthetic grounding enginemultimodal LLM, reliability, hallucinations, structured reasoning, evaluation, decision support
2603.22977DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube
PDF
cs.CL, cs.AI, cs.LG84First large Dari YouTube misinformation+harm dataset; useful for safety triage in low-resource settings.misinformation, dataset, low-resource, harm-assessment, content-moderation, YouTube
2603.21515When the Abyss Looks Back: Unveiling Evolving Dark Patterns in Cookie Consent Banners
PDF
cs.CR83Detects newly evolved cookie-consent dark patterns (DP11–DP19); practical privacy/security measurement.privacy, security, dark-patterns, measurement, compliance, web
2603.23279Emergence of Fragility in LLM-based Social Networks: the Case of Moltbook
PDF
cs.SI, cs.AI82Large-scale empirical study of LLM-agent social network fragility; relevant to multi-agent risk dynamics.multi-agent, LLM-agents, emergent-behavior, network-science, systemic-risk
2603.19152VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models
PDF
cs.CL, cs.AI82RL w/ verifiable rewards + variable entropy to enforce constraints; targets low-resource LM reliabilityalignment, RLVR, reliability, low-resource, constraints, training
2603.22015Retrieving Climate Change Disinformation by Narrative
PDF
cs.CL82Narrative retrieval for climate disinfo without fixed labels; supports emerging narrative trackingmisinformation, retrieval, narratives, climate, evaluation, synthetic-data
2603.22988Robustness Quantification and Uncertainty Quantification: Comparing Two Methods for Assessing the Reliability of Classifier Predictions
PDF
cs.LG82Compares robustness vs uncertainty for per-prediction reliability, incl. distribution shift; practical for safetyreliability, uncertainty, robustness, distribution-shift, calibration, evaluation
2603.21619Efficient Zero-Shot AI-Generated Image Detection
PDF
cs.CV, cs.AI82Training-free, fast AI-generated image detection via frequency-perturbation sensitivity; good generalization angleai-generated-content, detection, robustness, forensics, security, frequency-domain
2603.22982How Far Should We Need to Go : Evaluate Provenance-based Intrusion Detection Systems in Industrial Scenarios
PDF
cs.CR82Systematic industrial eval of provenance-based IDS; highlights real-world gaps vs DARPA-style benchmarks.security, intrusion-detection, provenance, evaluation, datasets, robustness
2603.18647Beyond TVLA: Anderson-Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection
PDF
cs.CR, cs.AI82Better side-channel leakage test for NN implementations; full-distribution vs mean-shift TVLA.security, side-channels, leakage-detection, neural-networks, evaluation
2603.08459Data-Driven Priors for Uncertainty-Aware Deterioration Risk Prediction with Multimodal Data
PDF
cs.LG82Uncertainty-aware multimodal prediction with data-driven priors; reliability angle transferable beyond healthuncertainty, calibration, multimodal, reliability, bayesian
2603.22846CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models
PDF
cs.AI80Competitive multi-agent training + new benchmark for embodied tracking; useful for adversarial agent evals.agents, multi-agent-RL, benchmark, embodied-ai, adversarial-training, VLA
2603.19101FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning
PDF
cs.CR, cs.AI, cs.DC, cs.LG80Federated learning defense vs targeted label-flipping in ITS; safety-critical robustness.federated-learning, data-poisoning, robustness, transportation, label-flipping

AI Paper Insight Brief

2026-03-30

0) Executive takeaways (read this first)

  • “End-to-end policy optimization” is spreading beyond chat: multiple papers use GRPO/PPO-style RL to optimize full trajectories (tool calls, multimodal reasoning, embodied control), not just final answers—suggesting a near-term shift in how agentic systems are trained and evaluated.
  • Reliability is being operationalized as “selective action” (defer/contain/reject) rather than just calibration: clinical risk prediction (MedCertAIn), SOC triage (policy-guided threat hunting), and robustness-vs-uncertainty (RQ vs UQ) all emphasize decision workflows.
  • Data/benchmark work is becoming more structure-heavy and direction-aware: NEVU (actor-conditioned, direction-aware values) and “Beyond Hate” (incivility vs intolerance) show a move from coarse labels to decomposed constructs that better match governance needs.
  • Federated and privacy/security work is getting more “systems-realistic”: FedMomentum fixes a concrete LoRA aggregation failure mode (momentum loss); FedTrident adds persistent client exclusion + remediation (unlearning); industrial PIDS evaluation shows large portability drops and high FPRs in real enterprise logs.
  • Proactive and measurement-driven defenses are gaining ground: invisible source-attributable watermarking (SAiW), anti–image-to-video cloaking (Anti-I2V), and cookie-banner dark-pattern measurement (UMBRA) tie defenses to concrete post-interaction or post-processing behaviors.

2) Key themes (clusters)

Theme: RL for agentic trajectories (tools, multimodal reasoning, embodied control)

Theme: Selective prediction & reliability as a workflow primitive

  • Why it matters: In high-stakes settings, the key product is often “know when not to act.” These papers emphasize deferral/containment decisions and reliability ranking rather than only average accuracy.
  • Representative papers:
  • Common approach:
    • Use uncertainty/reliability scores to drive coverage vs performance trade-offs (selective AUROC/AUPRC; accuracy–rejection curves).
    • Construct “hard/uncertain” sets without labels (MedCertAIn corruptions + cross-modal mismatch) or via robustness neighborhoods (RQ via ε-contamination).
    • Couple a lightweight filter/policy with expensive downstream review (SOC: DRL action × anomaly score to decide LLM triage).
  • Open questions / failure modes:
    • Mean-field VI and heuristic corruptions may misrepresent real clinical shifts (MedCertAIn).
    • RQ results are shown for Naive Bayes / Generative Forests on discrete UCI datasets—unclear transfer to deep nets.
    • SOC pipeline uses binary actions and fixed 5-minute windows; short-lived attacks and richer action spaces are not handled.

Theme: Data-centric alignment benchmarks with decomposed constructs

  • Why it matters: Coarse labels hide what models actually learn and create governance blind spots. These datasets explicitly separate who holds a value, direction (aligned vs contradictory), and what kind of harm is present.
  • Representative papers:
  • Common approach:
    • Replace single labels with structured targets (actor-conditioned directed values; incivility vs intolerance; misinformation × harm).
    • Evaluate failure modes that matter operationally (direction reversal rate in NEVU; FNR−FPR moderation bias in Beyond Hate; narrative variance as retrieval difficulty).
    • Show lightweight adaptation can matter a lot (NEVU: LoRA-finetuned open models outperform prompting-only baselines).
  • Open questions / failure modes:
    • Annotation noise and long-tail labels (NEVU) and limited subset size (Beyond Hate: 2,030 memes) may cap conclusions.
    • Domain dependence: climate narrative retrieval relies on NodeRAG summaries and hypothetical generation; runtime and proprietary dependencies are noted.
    • DariMis is text-only metadata; video/audio signals and joint harm prediction are future work.

Theme: Federated learning robustness: from optimization pathologies to poisoning + remediation

  • Why it matters: FL is moving from “average updates” to maintaining training dynamics and handling persistent adversaries—both critical for real deployments.
  • Representative papers:
  • Common approach:
    • Fix structural mismatches in parameterization/aggregation (FedMomentum: aggregate ΔW=ΣBiAi then truncated SVD to reconstruct rank‑r LoRA; merge residuals into backbone).
    • Go beyond per-round filtering: maintain history and act on persistent clients (FedTrident rating + blacklist) and remediate global state (approximate unlearning by subtracting stored contributions).
  • Open questions / failure modes:
    • FedMomentum adds server compute (randomized SVD ~0.60s/round reported) and downlink depends on residual rank/threshold.
    • FedTrident assumes TLFA footprints are visible via output-layer neuron analysis; deeper-layer or more evasive attacks aren’t evaluated.

Theme: Security & privacy evaluation gets more distribution-sensitive and deployment-grounded

3) Technical synthesis

  • GRPO is emerging as a common RL primitive across domains: recommender ranking (list-wise GRPO), multimodal clinical MCQs (MARCUS uses GRPO), and embodied tracking (CoMaTrack uses GRPO with KL to SFT).
  • Curriculum via adversaries vs via priors: CoMaTrack uses competitive opponents to self-escalate difficulty; MedCertAIn uses label-free “high-uncertainty” context sets (corruptions + cross-modal mismatch) to shape Bayesian priors.
  • “Trajectory = policy output” is the unifying agent training abstraction: AgenticRec explicitly includes Think/Act/Obs and Rank tokens; CoMaTrack jointly emits language + waypoints; SOC triage uses a policy layer to gate expensive LLM analysis.
  • Structured labels reduce governance-relevant error asymmetries: Beyond Hate shows coarse hate labels induce large under-detection (FNR−FPR) that improves with intolerance supervision; NEVU reduces direction reversal after LoRA.
  • Aggregation correctness matters for PEFT in FL: FedMomentum’s key move is aggregating BA products (ΔW) then reconstructing low-rank structure via truncated SVD, rather than averaging A/B separately.
  • “Remediation” is becoming first-class in defenses: FedTrident subtracts historical contributions (approx unlearning) after blacklisting; watermarking (SAiW) and cloaking (Anti‑I2V) aim to prevent misuse at creation time rather than detect after the fact.
  • Distribution-sensitive testing is a recurring motif: ADLA targets non-mean leakage; AI-text detection paper uses SHAP to show feature reliance shifts across corpora; PIDS evaluation shows cross-host/platform AUC drops.
  • Retrieval is being used to escape fixed taxonomies: climate narrative retrieval uses HyDE-style speculative docs + NodeRAG community summaries; this parallels broader moves away from fixed-label classification in evolving domains.

4) Top 5 papers (with “why now”)

1) MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

  • Trains modality experts (ECG/echo/CMR) plus an orchestrator that decomposes queries and aggregates outputs.
  • Reports strong multimodal integration accuracy (70.0%) vs GPT‑5 Thinking (22.5%) and Gemini 2.5 Pro (27.5%).
  • Uses counterfactual probing (including image-absent probes) to mitigate “mirage reasoning,” reporting 0% system-level mirage rate.
  • Skepticism: training data development is single-center; evaluation is retrospective and benchmark-based (MCQ/VQA), not prospective clinical impact.

2) FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning

  • Identifies “loss of training momentum” from structurally incorrect LoRA aggregation in FL.
  • Aggregates correct ΔW=ΣBiAi then reconstructs rank‑r LoRA via truncated randomized SVD with balanced factorization; merges residual energy into backbone.
  • Shows consistent gains across math, commonsense, and code; randomized SVD makes aggregation feasible (0.60s/round vs exact >1000s).
  • Skepticism: added server compute and downlink cost depends on residual threshold/rank; experiments are limited to specific settings (e.g., LLaMA2‑7B, 10 clients).

3) CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

  • Introduces competitive multi-agent RL for embodied visual tracking, using an opponent as an automatic curriculum.
  • Shows multi-agent RL improves over SFT and single-agent RL (e.g., STT SR 88.2 → 89.5 → 92.1).
  • Releases CoMaTrack-Bench for adversarial EVT evaluation; strong zero-shot gains vs a baseline on the new benchmark.
  • Skepticism: opponent realism and multi-agent non-stationarity/compute cost are acknowledged limitations.

4) When the Abyss Looks Back: Unveiling Evolving Dark Patterns in Cookie Consent Banners

  • UMBRA detects 19 dark patterns (including 9 “evolved” interaction-dependent ones) and links them to cookie-setting behavior.
  • Large-scale measurement on 14K sites; reports high detection accuracy (up to 99% for DP11–DP19) and concrete post-rejection cookie persistence (fake opt-outs).
  • Connects UI manipulation to security-relevant cookie attributes (e.g., XSS exposure), moving beyond “compliance UX” framing.
  • Skepticism: heuristic/lexicon rules may need continual updates; DOM obfuscation and device rendering differences can evade measurement.

5) How Far Should We Need to Go: Evaluate Provenance-based IDS in Industrial Scenarios

  • Tests five anomaly-based PIDSes on real enterprise provenance logs; shows large portability drops (avg AUC −26.77% across hosts, −38.03% across platforms).
  • Finds high false positives on ever-changing hosts (FPR >23% for three systems even without attacks).
  • Proposes unsupervised FP reduction (TF‑IDF + Louvain) reducing Nodlink FPR ~25%→~10% and grouping FPs to cut manual effort.
  • Skepticism: single-organization data and non-shareability limit reproducibility and external validity.

5) Practical next steps

  • If you train tool-using agents: treat tool calls as policy tokens and optimize trajectory-level rewards (AgenticRec-style), then add a second-stage hard-negative refinement to improve top‑K discrimination.
  • If you do multimodal safety/clinical ML: evaluate selective metrics (coverage curves) and build label-free “uncertainty context sets” (corruptions + modality mismatch) to stress-test deferral behavior (MedCertAIn).
  • If you deploy federated PEFT: avoid separate A/B averaging for LoRA; consider server-side ΔW aggregation + low-rank reconstruction (FedMomentum) and measure convergence speed vs aggregation compute/downlink.
  • If you defend FL against poisoning: add persistence-aware client scoring + exclusion and plan for remediation/unlearning of past contributions (FedTrident), then test against dynamic source/target flips.
  • If you rely on AI-text detectors: require cross-dataset and cross-generator evaluation plus explanation audits (SHAP-style) before deployment; benchmark accuracy alone is not a validity guarantee.
  • If you evaluate leakage or privacy risk: prefer distribution-sensitive tests (ADLA) where mean-shift tests can fail; for MIAs, report reliability under realistic priors (weighted precision) and attacker cost (C0–C4 framework).
  • If you build moderation datasets: decompose constructs (tone vs intolerance; actor + direction for values) and track moderation-relevant error asymmetries (e.g., FNR−FPR), not just accuracy/F1.

Generated from per-paper analyses; no external browsing.