Daily AI Paper Report (2026-03-30)

Published: March 30, 2026

Chinese version: [中文]

Run stats

Candidates: 1714
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.24570`	Anti-I2V: Safeguarding your photos from malicious image-to-video generation PDF	cs.CV, cs.AI	90	Targets misuse: adversarial protection vs image-to-video diffusion incl. DiT; timely safety angle	misuse-prevention, adversarial-perturbations, diffusion, video-generation, deepfakes, DiT
`2603.21698`	A Blueprint for Self-Evolving Coding Agents in Vehicle Aerodynamic Drag Prediction PDF	cs.AI	88	Contract-centric self-evolving coding agents; strong for agentic reliability, leakage control, reproducibility.	agents, coding-agents, autonomous-optimization, evaluation-contracts, reproducibility, leakage-prevention
`2603.22179`	MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management PDF	cs.AI	88	Agentic multimodal VLM for cardiac diagnosis across ECG/echo/CMR; large-scale training, real deployment relevance	agentic-systems, multimodal, vision-language, medical-ai, orchestration, clinical-decision-support
`2603.17838`	Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark PDF	cs.CL	86	NEVU benchmark for actor-conditioned, direction-aware human values in factual news; useful for alignment evals.	alignment, values, benchmark, evaluation, news, grounding
`2603.23966`	Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage PDF	cs.CR, cs.AI	86	Agentic LLM framework for SOC triage/threat hunting; high real-world security relevance.	agentic-ai, cybersecurity, SOC, SIEM, threat-hunting, LLM-tools
`2603.21613`	AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents PDF	cs.IR, cs.AI	86	End-to-end policy optimization for tool-using LLM recommender agents; trajectory-level feedback linkage.	agents, tool-use, policy-optimization, ReAct, RL, recommenders, evaluation
`2603.18813`	Can LLM generate interesting mathematical research problems? PDF	cs.AI	86	Agent+benchmark for LLM mathematical creativity; 665 novel research problems w/ expert verification	LLM, agents, evaluation, creativity, math, benchmark
`2603.23146`	Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy PDF	cs.CL, cs.AI	86	Shows AI-text detectors fail via artifacts; adds explainability beyond benchmark accuracy	AI-generated-text, detection, dataset-artifacts, robustness, explainable-AI, evaluation
`2603.23160`	UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities PDF	cs.CL	86	Unified toolkit for multi-turn dialogue eval; improves comparability and scalable interactive testing	evaluation, dialogue, toolkit, benchmarks, interactive-systems, metrics
`2603.24051`	FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval PDF	cs.CL	86	Forward-synth tool-use dialogues w/ dynamic tool retrieval; useful for agent tool-use training/eval	LLM agents, tool use, synthetic data, dialogue generation, retrieval, finance
`2603.23447`	3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding PDF	cs.CV, cs.AI	86	City-scale multimodal LLM framework + 1.2M dataset for 3D perception/planning; strong reuse potential	multimodal-llm, 3d, city-scale, dataset, planning, vision-language
`2603.22985`	Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation PDF	cs.CL, cs.CY	86	Fine-grained multimodal toxicity labels (incivility vs intolerance); improves moderation modeling & eval.	content-moderation, multimodal, toxicity, dataset, evaluation, vision-language
`2603.18779`	SoK: Practical Aspects of Releasing Differentially Private Graphs PDF	cs.CR, cs.SI	86	SoK on practical DP graph release; clarifies guarantees, pitfalls, and utility tradeoffs.	privacy, differential-privacy, graphs, systematization-of-knowledge, deployment
`2603.08014`	FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning PDF	cs.LG, cs.AI	86	Federated LoRA aggregation fix; better convergence + privacy-preserving LLM finetuning practicality	federated-learning, LoRA, LLM-finetuning, privacy, optimization
`2603.23178`	SAiW: Source-Attributable Invisible Watermarking for Proactive Deepfake Defense PDF	cs.AI	84	Source-attributable invisible watermarking for proactive deepfake defense and provenance verification.	security, deepfakes, watermarking, provenance, media-integrity
`2603.24213`	Uncovering Memorization in Timeseries Imputation models: LBRM Membership Inference and its link to attribute Leakage PDF	cs.LG, cs.AI	84	Black-box membership+attribute inference for time-series imputation; concrete privacy leakage.	privacy, membership-inference, attribute-inference, timeseries, security, leakage
`2603.22987`	A Critical Review on the Effectiveness and Privacy Threats of Membership Inference Attacks PDF	cs.CR, cs.LG	84	Clarifies when membership inference is a real privacy threat; warns against overusing MIA as metric.	privacy, membership-inference, security, evaluation, threat-models, ml-security
`2603.17328`	A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication PDF	cs.AI, cs.LG	84	Targets multimodal hallucination/logic looseness via evidentiary protocol + synthetic grounding engine	multimodal LLM, reliability, hallucinations, structured reasoning, evaluation, decision support
`2603.22977`	DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube PDF	cs.CL, cs.AI, cs.LG	84	First large Dari YouTube misinformation+harm dataset; useful for safety triage in low-resource settings.	misinformation, dataset, low-resource, harm-assessment, content-moderation, YouTube
`2603.21515`	When the Abyss Looks Back: Unveiling Evolving Dark Patterns in Cookie Consent Banners PDF	cs.CR	83	Detects newly evolved cookie-consent dark patterns (DP11–DP19); practical privacy/security measurement.	privacy, security, dark-patterns, measurement, compliance, web
`2603.23279`	Emergence of Fragility in LLM-based Social Networks: the Case of Moltbook PDF	cs.SI, cs.AI	82	Large-scale empirical study of LLM-agent social network fragility; relevant to multi-agent risk dynamics.	multi-agent, LLM-agents, emergent-behavior, network-science, systemic-risk
`2603.19152`	VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models PDF	cs.CL, cs.AI	82	RL w/ verifiable rewards + variable entropy to enforce constraints; targets low-resource LM reliability	alignment, RLVR, reliability, low-resource, constraints, training
`2603.22015`	Retrieving Climate Change Disinformation by Narrative PDF	cs.CL	82	Narrative retrieval for climate disinfo without fixed labels; supports emerging narrative tracking	misinformation, retrieval, narratives, climate, evaluation, synthetic-data
`2603.22988`	Robustness Quantification and Uncertainty Quantification: Comparing Two Methods for Assessing the Reliability of Classifier Predictions PDF	cs.LG	82	Compares robustness vs uncertainty for per-prediction reliability, incl. distribution shift; practical for safety	reliability, uncertainty, robustness, distribution-shift, calibration, evaluation
`2603.21619`	Efficient Zero-Shot AI-Generated Image Detection PDF	cs.CV, cs.AI	82	Training-free, fast AI-generated image detection via frequency-perturbation sensitivity; good generalization angle	ai-generated-content, detection, robustness, forensics, security, frequency-domain
`2603.22982`	How Far Should We Need to Go : Evaluate Provenance-based Intrusion Detection Systems in Industrial Scenarios PDF	cs.CR	82	Systematic industrial eval of provenance-based IDS; highlights real-world gaps vs DARPA-style benchmarks.	security, intrusion-detection, provenance, evaluation, datasets, robustness
`2603.18647`	Beyond TVLA: Anderson-Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection PDF	cs.CR, cs.AI	82	Better side-channel leakage test for NN implementations; full-distribution vs mean-shift TVLA.	security, side-channels, leakage-detection, neural-networks, evaluation
`2603.08459`	Data-Driven Priors for Uncertainty-Aware Deterioration Risk Prediction with Multimodal Data PDF	cs.LG	82	Uncertainty-aware multimodal prediction with data-driven priors; reliability angle transferable beyond health	uncertainty, calibration, multimodal, reliability, bayesian
`2603.22846`	CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models PDF	cs.AI	80	Competitive multi-agent training + new benchmark for embodied tracking; useful for adversarial agent evals.	agents, multi-agent-RL, benchmark, embodied-ai, adversarial-training, VLA
`2603.19101`	FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning PDF	cs.CR, cs.AI, cs.DC, cs.LG	80	Federated learning defense vs targeted label-flipping in ITS; safety-critical robustness.	federated-learning, data-poisoning, robustness, transportation, label-flipping

AI Paper Insight Brief

2026-03-30

0) Executive takeaways (read this first)

“End-to-end policy optimization” is spreading beyond chat: multiple papers use GRPO/PPO-style RL to optimize full trajectories (tool calls, multimodal reasoning, embodied control), not just final answers—suggesting a near-term shift in how agentic systems are trained and evaluated.
Reliability is being operationalized as “selective action” (defer/contain/reject) rather than just calibration: clinical risk prediction (MedCertAIn), SOC triage (policy-guided threat hunting), and robustness-vs-uncertainty (RQ vs UQ) all emphasize decision workflows.
Data/benchmark work is becoming more structure-heavy and direction-aware: NEVU (actor-conditioned, direction-aware values) and “Beyond Hate” (incivility vs intolerance) show a move from coarse labels to decomposed constructs that better match governance needs.
Federated and privacy/security work is getting more “systems-realistic”: FedMomentum fixes a concrete LoRA aggregation failure mode (momentum loss); FedTrident adds persistent client exclusion + remediation (unlearning); industrial PIDS evaluation shows large portability drops and high FPRs in real enterprise logs.
Proactive and measurement-driven defenses are gaining ground: invisible source-attributable watermarking (SAiW), anti–image-to-video cloaking (Anti-I2V), and cookie-banner dark-pattern measurement (UMBRA) tie defenses to concrete post-interaction or post-processing behaviors.

2) Key themes (clusters)

Theme: RL for agentic trajectories (tools, multimodal reasoning, embodied control)

Why it matters: Training agents on final-task metrics without credit assignment to intermediate steps leads to brittle tool use and reasoning. These works treat the whole trajectory as the policy output and optimize it directly.
Representative papers:
Common approach:
- Optimize trajectory-level objectives (e.g., list-wise NDCG rewards; MCQ accuracy; tracking rewards) with GRPO-style updates and KL constraints to stabilize.
- Add structure to prevent collapse: variable/position-aware entropy control (VEPO), hard-negative refinement (AgenticRec PPR), competitive opponents as curriculum (CoMaTrack).
- Use lightweight adaptation during RL (e.g., LoRA during RL in CoMaTrack; staged training in MARCUS).
Open questions / failure modes:
- How robust are these policies under distribution shift when tools change, observations are missing, or opponents become non-stationary?
- Reward hacking / “policy collapse” remains a central risk (explicitly targeted by VEPO; implicitly relevant to tool overuse and trajectory formatting constraints).
- Offline-to-online gap: AgenticRec is offline with fixed candidate pools; real-world feedback loops and exploration costs are untested.

Theme: Selective prediction & reliability as a workflow primitive

Why it matters: In high-stakes settings, the key product is often “know when not to act.” These papers emphasize deferral/containment decisions and reliability ranking rather than only average accuracy.
Representative papers:
Common approach:
- Use uncertainty/reliability scores to drive coverage vs performance trade-offs (selective AUROC/AUPRC; accuracy–rejection curves).
- Construct “hard/uncertain” sets without labels (MedCertAIn corruptions + cross-modal mismatch) or via robustness neighborhoods (RQ via ε-contamination).
- Couple a lightweight filter/policy with expensive downstream review (SOC: DRL action × anomaly score to decide LLM triage).
Open questions / failure modes:
- Mean-field VI and heuristic corruptions may misrepresent real clinical shifts (MedCertAIn).
- RQ results are shown for Naive Bayes / Generative Forests on discrete UCI datasets—unclear transfer to deep nets.
- SOC pipeline uses binary actions and fixed 5-minute windows; short-lived attacks and richer action spaces are not handled.

Theme: Data-centric alignment benchmarks with decomposed constructs

Why it matters: Coarse labels hide what models actually learn and create governance blind spots. These datasets explicitly separate who holds a value, direction (aligned vs contradictory), and what kind of harm is present.
Representative papers:
Common approach:
- Replace single labels with structured targets (actor-conditioned directed values; incivility vs intolerance; misinformation × harm).
- Evaluate failure modes that matter operationally (direction reversal rate in NEVU; FNR−FPR moderation bias in Beyond Hate; narrative variance as retrieval difficulty).
- Show lightweight adaptation can matter a lot (NEVU: LoRA-finetuned open models outperform prompting-only baselines).
Open questions / failure modes:
- Annotation noise and long-tail labels (NEVU) and limited subset size (Beyond Hate: 2,030 memes) may cap conclusions.
- Domain dependence: climate narrative retrieval relies on NodeRAG summaries and hypothetical generation; runtime and proprietary dependencies are noted.
- DariMis is text-only metadata; video/audio signals and joint harm prediction are future work.

Theme: Federated learning robustness: from optimization pathologies to poisoning + remediation

Why it matters: FL is moving from “average updates” to maintaining training dynamics and handling persistent adversaries—both critical for real deployments.
Representative papers:
- FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning
- FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning
Common approach:
- Fix structural mismatches in parameterization/aggregation (FedMomentum: aggregate ΔW=ΣBiAi then truncated SVD to reconstruct rank‑r LoRA; merge residuals into backbone).
- Go beyond per-round filtering: maintain history and act on persistent clients (FedTrident rating + blacklist) and remediate global state (approximate unlearning by subtracting stored contributions).
Open questions / failure modes:
- FedMomentum adds server compute (randomized SVD ~0.60s/round reported) and downlink depends on residual rank/threshold.
- FedTrident assumes TLFA footprints are visible via output-layer neuron analysis; deeper-layer or more evasive attacks aren’t evaluated.

Theme: Security & privacy evaluation gets more distribution-sensitive and deployment-grounded

Why it matters: “Standard tests” often miss real leakage or real-world failure modes. Several papers propose stronger tests or show that lab benchmarks overstate readiness.
Representative papers:
Common approach:
- Use tests sensitive to full distributions (ADLA vs Welch t-test) to detect leakage under countermeasures.
- Evaluate on industrial or attack-based settings rather than only benchmark claims (PIDS portability drops; DP graph release attacked via link prediction/reconstruction).
- Formalize “realistic threat” conditions (MIA C0–C4; weighted precision under realistic priors).
Open questions / failure modes:
- ADLA is shown on a constrained MLP implementation (only first hidden neuron on-device) and one protection setup.
- Industrial PIDS data can’t be shared; generalization across orgs remains uncertain.
- DP graph release results highlight persistent attack success; dependency-aware privacy mechanisms remain open.

3) Technical synthesis

GRPO is emerging as a common RL primitive across domains: recommender ranking (list-wise GRPO), multimodal clinical MCQs (MARCUS uses GRPO), and embodied tracking (CoMaTrack uses GRPO with KL to SFT).
Curriculum via adversaries vs via priors: CoMaTrack uses competitive opponents to self-escalate difficulty; MedCertAIn uses label-free “high-uncertainty” context sets (corruptions + cross-modal mismatch) to shape Bayesian priors.
“Trajectory = policy output” is the unifying agent training abstraction: AgenticRec explicitly includes Think/Act/Obs and Rank tokens; CoMaTrack jointly emits language + waypoints; SOC triage uses a policy layer to gate expensive LLM analysis.
Structured labels reduce governance-relevant error asymmetries: Beyond Hate shows coarse hate labels induce large under-detection (FNR−FPR) that improves with intolerance supervision; NEVU reduces direction reversal after LoRA.
Aggregation correctness matters for PEFT in FL: FedMomentum’s key move is aggregating BA products (ΔW) then reconstructing low-rank structure via truncated SVD, rather than averaging A/B separately.
“Remediation” is becoming first-class in defenses: FedTrident subtracts historical contributions (approx unlearning) after blacklisting; watermarking (SAiW) and cloaking (Anti‑I2V) aim to prevent misuse at creation time rather than detect after the fact.
Distribution-sensitive testing is a recurring motif: ADLA targets non-mean leakage; AI-text detection paper uses SHAP to show feature reliance shifts across corpora; PIDS evaluation shows cross-host/platform AUC drops.
Retrieval is being used to escape fixed taxonomies: climate narrative retrieval uses HyDE-style speculative docs + NodeRAG community summaries; this parallels broader moves away from fixed-label classification in evolving domains.

4) Top 5 papers (with “why now”)

1) MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

Trains modality experts (ECG/echo/CMR) plus an orchestrator that decomposes queries and aggregates outputs.
Reports strong multimodal integration accuracy (70.0%) vs GPT‑5 Thinking (22.5%) and Gemini 2.5 Pro (27.5%).
Uses counterfactual probing (including image-absent probes) to mitigate “mirage reasoning,” reporting 0% system-level mirage rate.
Skepticism: training data development is single-center; evaluation is retrospective and benchmark-based (MCQ/VQA), not prospective clinical impact.

2) FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning

Identifies “loss of training momentum” from structurally incorrect LoRA aggregation in FL.
Aggregates correct ΔW=ΣBiAi then reconstructs rank‑r LoRA via truncated randomized SVD with balanced factorization; merges residual energy into backbone.
Shows consistent gains across math, commonsense, and code; randomized SVD makes aggregation feasible (0.60s/round vs exact >1000s).
Skepticism: added server compute and downlink cost depends on residual threshold/rank; experiments are limited to specific settings (e.g., LLaMA2‑7B, 10 clients).

3) CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Introduces competitive multi-agent RL for embodied visual tracking, using an opponent as an automatic curriculum.
Shows multi-agent RL improves over SFT and single-agent RL (e.g., STT SR 88.2 → 89.5 → 92.1).
Releases CoMaTrack-Bench for adversarial EVT evaluation; strong zero-shot gains vs a baseline on the new benchmark.
Skepticism: opponent realism and multi-agent non-stationarity/compute cost are acknowledged limitations.

4) When the Abyss Looks Back: Unveiling Evolving Dark Patterns in Cookie Consent Banners

UMBRA detects 19 dark patterns (including 9 “evolved” interaction-dependent ones) and links them to cookie-setting behavior.
Large-scale measurement on 14K sites; reports high detection accuracy (up to 99% for DP11–DP19) and concrete post-rejection cookie persistence (fake opt-outs).
Connects UI manipulation to security-relevant cookie attributes (e.g., XSS exposure), moving beyond “compliance UX” framing.
Skepticism: heuristic/lexicon rules may need continual updates; DOM obfuscation and device rendering differences can evade measurement.

5) How Far Should We Need to Go: Evaluate Provenance-based IDS in Industrial Scenarios

Tests five anomaly-based PIDSes on real enterprise provenance logs; shows large portability drops (avg AUC −26.77% across hosts, −38.03% across platforms).
Finds high false positives on ever-changing hosts (FPR >23% for three systems even without attacks).
Proposes unsupervised FP reduction (TF‑IDF + Louvain) reducing Nodlink FPR ~25%→~10% and grouping FPs to cut manual effort.
Skepticism: single-organization data and non-shareability limit reproducibility and external validity.

5) Practical next steps

If you train tool-using agents: treat tool calls as policy tokens and optimize trajectory-level rewards (AgenticRec-style), then add a second-stage hard-negative refinement to improve top‑K discrimination.
If you do multimodal safety/clinical ML: evaluate selective metrics (coverage curves) and build label-free “uncertainty context sets” (corruptions + modality mismatch) to stress-test deferral behavior (MedCertAIn).
If you deploy federated PEFT: avoid separate A/B averaging for LoRA; consider server-side ΔW aggregation + low-rank reconstruction (FedMomentum) and measure convergence speed vs aggregation compute/downlink.
If you defend FL against poisoning: add persistence-aware client scoring + exclusion and plan for remediation/unlearning of past contributions (FedTrident), then test against dynamic source/target flips.
If you rely on AI-text detectors: require cross-dataset and cross-generator evaluation plus explanation audits (SHAP-style) before deployment; benchmark accuracy alone is not a validity guarantee.
If you evaluate leakage or privacy risk: prefer distribution-sensitive tests (ADLA) where mean-shift tests can fail; for MIAs, report reliability under realistic priors (weighted precision) and attacker cost (C0–C4 framework).
If you build moderation datasets: decompose constructs (tone vs intolerance; actor + direction for values) and track moderation-relevant error asymmetries (e.g., FNR−FPR), not just accuracy/F1.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-30

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: RL for agentic trajectories (tools, multimodal reasoning, embodied control)

Theme: Selective prediction & reliability as a workflow primitive

Theme: Data-centric alignment benchmarks with decomposed constructs

Theme: Federated learning robustness: from optimization pathologies to poisoning + remediation

Theme: Security & privacy evaluation gets more distribution-sensitive and deployment-grounded

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps