Daily AI Paper Report (2026-03-25)

Published: March 25, 2026

Chinese version: [中文]

Run stats

Candidates: 223
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-23T00:00:00Z → 2026-03-24T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.21697`	Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models PDF	cs.CR, cs.AI, cs.MM	95	Comic-based multimodal jailbreak benchmark; very high attack success across 15 MLLMs.	multimodal-safety, jailbreaks, benchmark, red-teaming, MLLM, adversarial-prompts
`2603.21687`	Mirage The Illusion of Visual Understanding PDF	cs.AI	95	Shows multimodal benchmarks can be gamed w/ no image; exposes "mirage reasoning" reliability failure	multimodal, evaluation, hallucination, reliability, benchmarking, medical-ai
`2603.21642`	Are AI-assisted Development Tools Immune to Prompt Injection? PDF	cs.CR, cs.SE	93	First empirical prompt-injection/tool-poisoning study across 7 real MCP dev clients.	prompt-injection, tool-poisoning, MCP, agent-security, empirical-study, secure-tool-use
`2603.21972`	Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe PDF	cs.LG, cs.CL	92	Empirical recipe for scaling RL in long-horizon tool agents; actionable axes + takeaways on TravelPlanner.	tool-using agents, long-horizon RL, RLHF/RLVR, agent evaluation, reward design, planning
`2603.22117`	On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation PDF	cs.LG, cs.AI	92	Token-level signed Δlog p reveals reasoning-critical RLVR updates; actionable analysis + interventions	LLM, RLVR, reasoning, post-training, mechanistic-analysis, token-level
`2603.21641`	Auditing MCP Servers for Over-Privileged Tool Capabilities PDF	cs.CR, cs.SE	90	Practical auditing toolkit for over-privileged MCP servers with static+dynamic fuzzing.	MCP, tool-permissions, sandboxing, security-audit, fuzzing, eBPF, agent-infra
`2603.21461`	DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment PDF	cs.LG, cs.AI, cs.CL	90	Inference-time preference alignment via prompt-conditional SAE steering; compute-light with strong benchmarks.	alignment, preference optimization, SAE, steering, mechanistic interpretability, inference-time control
`2603.21558`	Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment PDF	cs.AI	90	Stabilizes recursive self-training by step-level symbolic verification; targets drift/mode collapse risk	self-training, recursive-self-improvement, verification, neuro-symbolic, reasoning, safety
`2603.21469`	Hardening Confidential Federated Compute against Side-channel Attacks PDF	cs.CR, cs.DS	90	Finds side-channels that can bypass DP in confidential federated compute; proposes mitigations	privacy, differential-privacy, federated-learning, side-channels, security, confidential-compute
`2603.21975`	SecureBreak -- A dataset towards safe and secure models PDF	cs.CR, cs.AI, cs.CL, cs.LG	88	Security-focused dataset for robustness evaluation/training against jailbreaks/injection.	dataset, security-alignment, jailbreaks, prompt-injection, robustness-eval, guardrails
`2603.22214`	Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models PDF	cs.CR, cs.AI, cs.LG	88	Systematic study of LLM-as-judge reliability vs humans; important for scalable eval and security assessment.	evaluation, LLM-as-judge, reliability, human agreement, model auditing, safety eval
`2603.21693`	Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain PDF	cs.AI	88	Single-pass logprob-based medical MLLM hallucination detection; avoids costly multi-sample entropy methods	hallucination-detection, MLLM, medical, VQA, uncertainty, logprobs, reliability
`2603.21654`	Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks PDF	cs.CR, cs.AI	86	Comprehensive RAG security review: threats (poisoning/inference) + defenses + benchmarks.	RAG, security, data-poisoning, membership-inference, defenses, survey, benchmarking
`2603.21523`	SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems PDF	cs.RO, cs.AI	86	Safety assurance framework for LLM-enabled cyber-physical systems; targets hallucination-driven unsafe acts.	CPS safety, robotics, neuro-symbolic, assurance, runtime safety, hallucinations
`2603.21577`	Mind over Space: Can Multimodal Large Language Models Mentally Navigate? PDF	cs.AI	86	New benchmark for long-horizon spatial planning from egocentric video; targets agentic MLLM limits	agents, benchmark, embodied-ai, multimodal, planning, long-context, evaluation
`2603.21607`	INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation PDF	cs.AI	85	Mechanistic RAG UQ fix: induction heads inflate entropy; proposes gating for reliability.	RAG, uncertainty, hallucinations, mechanistic-interpretability, calibration, reliability
`2603.21489`	Effective Strategies for Asynchronous Software Engineering Agents PDF	cs.CL, cs.AI	84	Practical strategies for asynchronous multi-agent SWE; tackles interference, dependencies, and integration.	agents, software engineering, multi-agent coordination, asynchrony, long-horizon tasks, workflow
`2603.21925`	Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support PDF	cs.AI	84	Guideline-page image RAG with routing/filtering + traceable citations; strong clinical decision support eval	RAG, grounding, citations, multimodal, healthcare, evaluation, retrieval
`2603.21454`	Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis PDF	cs.CL	83	Black-box method to detect benchmark contamination via multi-session solution diversity.	evaluation, benchmark-contamination, SWE-bench, leakage, multi-agent, audit-methods
`2603.21692`	Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces PDF	cs.AI, cs.DC, cs.SE	82	Proposes structured reasoning provenance for agents: queryable 'why' records at scale.	agents, observability, auditing, reasoning-provenance, governance, monitoring
`2603.21705`	Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs PDF	cs.LG	82	Fisher/Hessian-motivated layer-adaptive model merging for long-to-short reasoning; practical compression lever	model-merging, reasoning, compression, Fisher-information, alignment, LLM
`2603.21522`	Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation PDF	cs.SE, cs.AI	82	Failure management for LLM multi-agent systems using historical patterns + trace representations	multi-agent, reliability, monitoring, debugging, reasoning-traces, software-engineering
`2603.21563`	Counterfactual Credit Policy Optimization for Multi-Agent Collaboration PDF	cs.AI	81	Counterfactual credit assignment for collaborative agents; reduces variance/free-riding in multi-agent RL.	multi-agent RL, credit assignment, counterfactual baselines, collaboration, agent training
`2603.21606`	mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT PDF	cs.LG, cs.AI	80	Multi-task SFT mixture method that avoids per-dataset overfitting; broad benchmark gains.	SFT, data-mixtures, post-training, overfitting, training-recipes, LLM
`2603.21877`	P^2O: Joint Policy and Prompt Optimization PDF	cs.LG, cs.AI	80	Combines prompt optimization with RLVR to tackle hard samples and sparse rewards; exploration boost.	RLVR, reasoning, prompt optimization, genetic search, training stability, verifiable rewards
`2603.21872`	Manifold-Aware Exploration for Reinforcement Learning in Video Generation PDF	cs.CV, cs.AI	80	Constrains GRPO exploration to stay near video manifold; improves stability of reward-based post-training	RL, GRPO, video-generation, alignment, stability, exploration, diffusion
`2603.21663`	TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression PDF	cs.CL	80	Multi-turn RL for long-context compression; tackles credit assignment without heavy judge overhead	long-context, reinforcement-learning, reward-shaping, memory, training, alignment-methods
`2603.21840`	Select, Label, Evaluate: Active Testing in NLP PDF	cs.CL, cs.AI	78	Active Testing benchmark across many NLP datasets; reduces labeling cost while estimating performance well.	evaluation, active testing, data efficiency, benchmarking, test set design, annotation
`2603.22184`	Revisiting Quantum Code Generation: Where Should Domain Knowledge Live? PDF	cs.LG, quant-ph	78	Compares finetune vs RAG vs agent+exec feedback for domain codegen; useful evidence on specialization tradeoffs	code-generation, agents, RAG, execution-feedback, evaluation, domain-adaptation
`2603.22276`	Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels PDF	cs.LG, stat.ML	78	Makes high-rank DoRA practical via factored norms + fused kernels; useful for efficient adaptation	efficiency, fine-tuning, LoRA, DoRA, systems, kernels, scaling

AI Paper Insight Brief

2026-03-25

0) Executive takeaways (read this first)

Evaluation integrity is under active attack—from code benchmarks to multimodal “vision” tests. Cross-session behavioral diversity (CCV) can flag SWE-bench contamination, while “Mirage” shows many multimodal benchmarks remain highly answerable without images (often retaining ~70–80% of accuracy).
Inference-time, reversible alignment is getting more practical. DSPA uses sparse autoencoder (SAE) features for prompt-conditional, token-conditional steering, improving MT-Bench with modest multiple-choice regression and strong robustness under tiny preference datasets (≈100–250 triples).
Agent reliability is shifting from “smarter prompts” to “software engineering + ops primitives.” CAID (git worktrees + dependency-aware delegation + test-gated merges) improves long-horizon SWE benchmarks; EAGER and AER propose trace representations for faster failure detection and population-level behavioral analytics.
Security focus is moving to the tool boundary (MCP) and the RAG pipeline. Empirical MCP client testing finds no client blocks all tool-poisoning attacks; protocol-aware auditing (static + dynamic eBPF fuzzing) catches over-privileged servers; a large RAG-security survey consolidates threats/defenses/benchmarks.
RL/RLVR for reasoning is being debugged at the token and credit-assignment level. Directional token shifts (Δlog p) explain sparse RLVR changes and enable test-time extrapolation + training reweighting; CCPO and TAMTRL reshape credit assignment for multi-agent collaboration and multi-turn memory RL; P²O uses prompt evolution + context distillation to break “hard-sample zero-reward” dead zones.
Formal verification and DP are re-entering the loop as practical mitigations. SafePilot uses Z3/Spot to verify LLM-generated CPS plans; confidential federated compute work shows DP can be undermined by side-channels unless message padding and DP-resize mechanisms are added.

2) Key themes (clusters)

Theme: Benchmark trust & contamination (code + multimodal)

Why it matters: If benchmarks can be solved via leakage or modality shortcuts, reported “reasoning” and “visual understanding” gains are inflated, and downstream decisions (model selection, safety claims, curation) become unreliable.
Representative papers:
Common approach:
- Replace artifact-only checks with behavioral or counterfactual controls (session-isolated repeated solves; image-absent “mirage-mode”).
- Quantify susceptibility with simple ratios/metrics (contamination score; mirage-score = acc(no image)/acc(with image)).
- Reduce evaluation cost while preserving statistical validity (active testing with Horvitz–Thompson estimators + adaptive stopping).
Open questions / failure modes:
- How well do contamination/mirage diagnostics generalize across model families, decoding settings, and domains?
- Model-set dependence: cleaning procedures like B-Clean depend on which models are used to filter.
- Can models learn to “fake diversity” or “fake uncertainty” to evade behavioral contamination checks?

Theme: Inference-time alignment & mechanistic uncertainty signals

Why it matters: Deployment often needs cheap, reversible alignment and reliable uncertainty without retraining or heavy sampling—especially for RAG and open-ended generation.
Representative papers:
Common approach:
- Use internal representations (SAE features; induction-head SinkRate; token log-prob variance + evidence gain) to drive interventions/scores.
- Prefer training-free or low-data methods (DSPA robust down to ~100–250 preference triples; INTRYGUE training-free; CEBaG deterministic with 3 forward passes).
- Emphasize auditability (sparse feature edits; mechanistic probes/ablations; hyperparameter-free scoring).
Open questions / failure modes:
- White-box dependence: INTRYGUE needs attention access; CEBaG needs logprobs.
- Faithfulness vs truth: INTRYGUE measures grounding faithfulness—wrong retrieved docs can still look “certain.”
- Steering misuse risk: inference-time steering could be applied adversarially.

Theme: Agent engineering for long-horizon reliability (coordination, debugging, provenance)

Why it matters: As agents become asynchronous and autonomous, the dominant failures are integration bugs, recurring trace-level failure patterns, and lack of population-level observability.
Representative papers:
Common approach:
- Import SWE/ops primitives: dependency graphs, isolated workspaces, test-gated merges (CAID).
- Learn trace representations for retrieval-based diagnosis and step-wise mitigation (EAGER dual encoders + contrastive objectives).
- Standardize structured provenance schemas with replay modes for regression (AER: intent/observation/inference + mock replay).
Open questions / failure modes:
- Overhead: more agents, more isolation, more logging can increase cost and latency.
- Faithfulness: provenance fields can be self-reported and rationalized; representation models are preliminary.
- Generalization beyond code/tool domains where tests and version control exist.

Theme: Tool/RAG security & privacy leakage in “secure” compute

Why it matters: Tool integration and retrieval pipelines expand the attack surface; DP and TEEs don’t automatically prevent leakage if metadata and side-channels remain observable.
Representative papers:
Common approach:
- Protocol-aware security evaluation (MCP tool metadata as an injection vector; capability-family auditing).
- Combine static + dynamic evidence (Docker sandbox + eBPF telemetry; fuzzing).
- Treat side-channels as first-class DP threats; add DP padding and DP-timed resizing with proofs.
Open questions / failure modes:
- Static coverage gaps (e.g., MCP audit static misses JS/TS; dynamic requires eBPF/Linux).
- Defense trade-offs: padding overhead; DP-resize complexity; incomplete channel coverage.
- Ecosystem drift: client versions/configurations change rapidly; security posture can regress.

Theme: RL/RLVR stabilization via better credit assignment and exploration control

Why it matters: RL for reasoning/agents/video is limited by sparse rewards, hard-sample dead zones, free-riding in multi-agent setups, and unstable exploration.
Representative papers:
Common approach:
- Replace shared/terminal rewards with counterfactual or shaped per-component signals (CCPO marginal contributions; TAMTRL turn rewards; Δlog p token diagnostics).
- Use trust regions / normalization to stabilize updates (dual KL anchors; EMA normalization; gradient equalizers).
- Add exploration that stays “on-manifold” (video GRPO variance correction) or “prompt-assisted” (P²O prompt evolution).
Open questions / failure modes:
- Extra compute/runtime (counterfactuals; prompt evolution; video RL).
- Hyperparameter sensitivity (extrapolation γ/τ; trust-region weights; prompt-search budgets).
- Transfer beyond math/video/specific topologies remains under-tested.

3) Technical synthesis

Behavioral counterfactuals are becoming the common diagnostic tool: CCV uses session-isolated repeated solves; Mirage uses image-absent controls; CCPO uses counterfactual rollouts; CEBaG uses text-only vs multimodal scoring passes.
“White-box signals” are increasingly used to fix evaluation and safety gaps: induction-head SinkRate (INTRYGUE), SAE latents (DSPA), token logprob variance/evidence gain (CEBaG), signed Δlog p (RLVR direction).
Credit assignment is converging on normalization + bounded shaping: CCPO’s EMA z-scoring/tanh shaping; TAMTRL’s min–max normalization (and collapse without it); SAGE-GRPO’s timestep equalizer; RLVR reweighting upweights low-prob tokens.
Agent reliability work is splitting into two layers: (a) coordination primitives (CAID’s worktrees/merges/tests) and (b) observability primitives (EAGER embeddings for failure retrieval; AER schema + mock replay).
Security is shifting from “model jailbreaks” to “system boundary jailbreaks”: MCP tool metadata poisoning and over-privileged servers; RAG pipeline threats; DP-in-TEE side-channels.
Formal methods are being used as practical guardrails rather than end-to-end verification: SafePilot verifies plans with Z3/Spot and iteratively re-prompts; DP side-channel mitigations come with theorems but target specific channels.
Data efficiency is a recurring theme across alignment and evaluation: DSPA works under severe preference-data restriction; Active Testing cuts labeling up to 95%; MSFT reduces wasted compute by excluding early-overfitting sub-datasets.
“Training-free” or “no weight updates” is not just convenience—it’s becoming a safety/ops feature: DSPA steering is reversible; FIM-based merging is data-free; INTRYGUE is training-free; CEBaG is deterministic and sampling-free.

4) Top 5 papers (with “why now”)

1) Mirage The Illusion of Visual Understanding

Shows frontier multimodal models often confidently describe non-existent images and still score highly when images are omitted (mirage-scores ~70–80% average).
Demonstrates benchmark fragility: B-Clean removes ~74–77% of questions in some benchmarks and can drastically change accuracies/rankings.
“Why now”: multimodal models are being deployed in high-stakes domains (medicine); this provides a concrete, scalable evaluation control (image-absent) and a cleaning protocol.
Be skeptical about: B-Clean is model-set dependent; mechanistic causes of mirage are not fully identified.

2) Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

Introduces a black-box, API-only contamination detector using session-isolated repeated trials and patch-diversity metrics.
Reports perfect separation between contaminated vs genuine reasoning on 9 SWE-bench problems (small but striking), plus a bias-resistant analysis workflow (HCCA).
“Why now”: coding benchmarks are central to frontier claims; this is a practical method to audit them without model internals.
Be skeptical about: evaluated on 9 problems / one model; reasoning classifier is heuristic and evaluated on the same data.

3) DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Inference-time, prompt-conditional sparse steering in SAE space; edits only token-active latents.
Improves MT-Bench across multiple models and stays robust with very small preference datasets (down to ~100–250 triples), with large compute savings vs a two-stage baseline (modeled 4.47× FLOPs; observed 11.5× wall-clock).
“Why now”: demand for cheap, reversible alignment and mechanistic auditability is rising.
Be skeptical about: depends on availability/quality of SAEs; open-ended eval relies on LLM judges; no formal safety guarantees.

4) Are AI-assisted Development Tools Immune to Prompt Injection?

Empirically tests tool-poisoning prompt injection across 7 MCP clients with 4 concrete attacks; finds no client blocks all attacks.
Highlights large variance: Cursor unsafe across all tested attacks; Claude Desktop and Cline strongest in tested configs; many clients lack static validation/sandboxing/audit logging.
“Why now”: MCP-style tool ecosystems are rapidly becoming default in IDE/CLI workflows; this is direct operational risk.
Be skeptical about: limited to specific versions/configurations and local testbed; sandboxing assessment partly documentation-based.

5) On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Argues RLVR changes are best understood via signed token probability shifts (Δlog p), not magnitude-only metrics.
Shows Δlog p-selected token replacement recovers RLVR performance with ~10% token swaps; proposes test-time extrapolation and training-time advantage reweighting with reported gains (e.g., Avg@32 improvements on AIME and other math sets).
“Why now”: RLVR is widely used for reasoning; this offers both interpretability and practical knobs to improve it.
Be skeptical about: extrapolation needs both base + RL models at test time and introduces tunable hyperparameters (τ, γ).

5) Practical next steps

Add “counterfactual controls” to your eval harness: for multimodal, run image-absent mirage-mode; for coding, run session-isolated repeated solves and measure diversity (CCV-style).
Treat tool metadata as untrusted input: adopt MCP server auditing (static rules + optional dynamic sandbox/eBPF) and require capability inventories + least-privilege hardening before deployment.
Instrument agents with structured provenance (intent/observation/inference + evidence chains) and enable mock replay to regression-test prompt/model changes on a pinned incident corpus.
For multi-agent SWE, enforce physical isolation (git worktrees/branches), dependency-aware delegation, and test-gated merges; measure integration failure rate vs engineer count to find the parallelism “knee.”
If you do RAG, evaluate uncertainty methods that incorporate how context was used (e.g., induction-head activity) and separately track retrieval quality to avoid “faithful-but-wrong” confidence.
For RLVR / agent RL, prioritize credit assignment: try counterfactual marginal rewards (CCPO) for collaboration, and consider probability-aware reweighting to avoid ignoring low-probability but crucial tokens.
For safety-critical planning (CPS/robotics), integrate formal verification loops (Z3/Spot) and log verification failures as first-class training/eval artifacts.
For DP-in-TEE deployments, audit for metadata side-channels (message length, allocation/page faults) and consider DP padding + DP-timed resizing mechanisms where applicable.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-25

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Benchmark trust & contamination (code + multimodal)

Theme: Inference-time alignment & mechanistic uncertainty signals

Theme: Agent engineering for long-horizon reliability (coordination, debugging, provenance)

Theme: Tool/RAG security & privacy leakage in “secure” compute

Theme: RL/RLVR stabilization via better credit assignment and exploration control

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps