Daily AI Paper Report (2026-03-25)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 223
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-23T00:00:00Z → 2026-03-24T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.21697Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
PDF
cs.CR, cs.AI, cs.MM95Comic-based multimodal jailbreak benchmark; very high attack success across 15 MLLMs.multimodal-safety, jailbreaks, benchmark, red-teaming, MLLM, adversarial-prompts
2603.21687Mirage The Illusion of Visual Understanding
PDF
cs.AI95Shows multimodal benchmarks can be gamed w/ no image; exposes "mirage reasoning" reliability failuremultimodal, evaluation, hallucination, reliability, benchmarking, medical-ai
2603.21642Are AI-assisted Development Tools Immune to Prompt Injection?
PDF
cs.CR, cs.SE93First empirical prompt-injection/tool-poisoning study across 7 real MCP dev clients.prompt-injection, tool-poisoning, MCP, agent-security, empirical-study, secure-tool-use
2603.21972Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe
PDF
cs.LG, cs.CL92Empirical recipe for scaling RL in long-horizon tool agents; actionable axes + takeaways on TravelPlanner.tool-using agents, long-horizon RL, RLHF/RLVR, agent evaluation, reward design, planning
2603.22117On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
PDF
cs.LG, cs.AI92Token-level signed Δlog p reveals reasoning-critical RLVR updates; actionable analysis + interventionsLLM, RLVR, reasoning, post-training, mechanistic-analysis, token-level
2603.21641Auditing MCP Servers for Over-Privileged Tool Capabilities
PDF
cs.CR, cs.SE90Practical auditing toolkit for over-privileged MCP servers with static+dynamic fuzzing.MCP, tool-permissions, sandboxing, security-audit, fuzzing, eBPF, agent-infra
2603.21461DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
PDF
cs.LG, cs.AI, cs.CL90Inference-time preference alignment via prompt-conditional SAE steering; compute-light with strong benchmarks.alignment, preference optimization, SAE, steering, mechanistic interpretability, inference-time control
2603.21558Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
PDF
cs.AI90Stabilizes recursive self-training by step-level symbolic verification; targets drift/mode collapse riskself-training, recursive-self-improvement, verification, neuro-symbolic, reasoning, safety
2603.21469Hardening Confidential Federated Compute against Side-channel Attacks
PDF
cs.CR, cs.DS90Finds side-channels that can bypass DP in confidential federated compute; proposes mitigationsprivacy, differential-privacy, federated-learning, side-channels, security, confidential-compute
2603.21975SecureBreak -- A dataset towards safe and secure models
PDF
cs.CR, cs.AI, cs.CL, cs.LG88Security-focused dataset for robustness evaluation/training against jailbreaks/injection.dataset, security-alignment, jailbreaks, prompt-injection, robustness-eval, guardrails
2603.22214Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
PDF
cs.CR, cs.AI, cs.LG88Systematic study of LLM-as-judge reliability vs humans; important for scalable eval and security assessment.evaluation, LLM-as-judge, reliability, human agreement, model auditing, safety eval
2603.21693Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain
PDF
cs.AI88Single-pass logprob-based medical MLLM hallucination detection; avoids costly multi-sample entropy methodshallucination-detection, MLLM, medical, VQA, uncertainty, logprobs, reliability
2603.21654Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks
PDF
cs.CR, cs.AI86Comprehensive RAG security review: threats (poisoning/inference) + defenses + benchmarks.RAG, security, data-poisoning, membership-inference, defenses, survey, benchmarking
2603.21523SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems
PDF
cs.RO, cs.AI86Safety assurance framework for LLM-enabled cyber-physical systems; targets hallucination-driven unsafe acts.CPS safety, robotics, neuro-symbolic, assurance, runtime safety, hallucinations
2603.21577Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
PDF
cs.AI86New benchmark for long-horizon spatial planning from egocentric video; targets agentic MLLM limitsagents, benchmark, embodied-ai, multimodal, planning, long-context, evaluation
2603.21607INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation
PDF
cs.AI85Mechanistic RAG UQ fix: induction heads inflate entropy; proposes gating for reliability.RAG, uncertainty, hallucinations, mechanistic-interpretability, calibration, reliability
2603.21489Effective Strategies for Asynchronous Software Engineering Agents
PDF
cs.CL, cs.AI84Practical strategies for asynchronous multi-agent SWE; tackles interference, dependencies, and integration.agents, software engineering, multi-agent coordination, asynchrony, long-horizon tasks, workflow
2603.21925Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support
PDF
cs.AI84Guideline-page image RAG with routing/filtering + traceable citations; strong clinical decision support evalRAG, grounding, citations, multimodal, healthcare, evaluation, retrieval
2603.21454Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
PDF
cs.CL83Black-box method to detect benchmark contamination via multi-session solution diversity.evaluation, benchmark-contamination, SWE-bench, leakage, multi-agent, audit-methods
2603.21692Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces
PDF
cs.AI, cs.DC, cs.SE82Proposes structured reasoning provenance for agents: queryable 'why' records at scale.agents, observability, auditing, reasoning-provenance, governance, monitoring
2603.21705Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs
PDF
cs.LG82Fisher/Hessian-motivated layer-adaptive model merging for long-to-short reasoning; practical compression levermodel-merging, reasoning, compression, Fisher-information, alignment, LLM
2603.21522Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation
PDF
cs.SE, cs.AI82Failure management for LLM multi-agent systems using historical patterns + trace representationsmulti-agent, reliability, monitoring, debugging, reasoning-traces, software-engineering
2603.21563Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
PDF
cs.AI81Counterfactual credit assignment for collaborative agents; reduces variance/free-riding in multi-agent RL.multi-agent RL, credit assignment, counterfactual baselines, collaboration, agent training
2603.21606mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT
PDF
cs.LG, cs.AI80Multi-task SFT mixture method that avoids per-dataset overfitting; broad benchmark gains.SFT, data-mixtures, post-training, overfitting, training-recipes, LLM
2603.21877P^2O: Joint Policy and Prompt Optimization
PDF
cs.LG, cs.AI80Combines prompt optimization with RLVR to tackle hard samples and sparse rewards; exploration boost.RLVR, reasoning, prompt optimization, genetic search, training stability, verifiable rewards
2603.21872Manifold-Aware Exploration for Reinforcement Learning in Video Generation
PDF
cs.CV, cs.AI80Constrains GRPO exploration to stay near video manifold; improves stability of reward-based post-trainingRL, GRPO, video-generation, alignment, stability, exploration, diffusion
2603.21663TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression
PDF
cs.CL80Multi-turn RL for long-context compression; tackles credit assignment without heavy judge overheadlong-context, reinforcement-learning, reward-shaping, memory, training, alignment-methods
2603.21840Select, Label, Evaluate: Active Testing in NLP
PDF
cs.CL, cs.AI78Active Testing benchmark across many NLP datasets; reduces labeling cost while estimating performance well.evaluation, active testing, data efficiency, benchmarking, test set design, annotation
2603.22184Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?
PDF
cs.LG, quant-ph78Compares finetune vs RAG vs agent+exec feedback for domain codegen; useful evidence on specialization tradeoffscode-generation, agents, RAG, execution-feedback, evaluation, domain-adaptation
2603.22276Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
PDF
cs.LG, stat.ML78Makes high-rank DoRA practical via factored norms + fused kernels; useful for efficient adaptationefficiency, fine-tuning, LoRA, DoRA, systems, kernels, scaling

AI Paper Insight Brief

2026-03-25

0) Executive takeaways (read this first)

  • Evaluation integrity is under active attack—from code benchmarks to multimodal “vision” tests. Cross-session behavioral diversity (CCV) can flag SWE-bench contamination, while “Mirage” shows many multimodal benchmarks remain highly answerable without images (often retaining ~70–80% of accuracy).
  • Inference-time, reversible alignment is getting more practical. DSPA uses sparse autoencoder (SAE) features for prompt-conditional, token-conditional steering, improving MT-Bench with modest multiple-choice regression and strong robustness under tiny preference datasets (≈100–250 triples).
  • Agent reliability is shifting from “smarter prompts” to “software engineering + ops primitives.” CAID (git worktrees + dependency-aware delegation + test-gated merges) improves long-horizon SWE benchmarks; EAGER and AER propose trace representations for faster failure detection and population-level behavioral analytics.
  • Security focus is moving to the tool boundary (MCP) and the RAG pipeline. Empirical MCP client testing finds no client blocks all tool-poisoning attacks; protocol-aware auditing (static + dynamic eBPF fuzzing) catches over-privileged servers; a large RAG-security survey consolidates threats/defenses/benchmarks.
  • RL/RLVR for reasoning is being debugged at the token and credit-assignment level. Directional token shifts (Δlog p) explain sparse RLVR changes and enable test-time extrapolation + training reweighting; CCPO and TAMTRL reshape credit assignment for multi-agent collaboration and multi-turn memory RL; P²O uses prompt evolution + context distillation to break “hard-sample zero-reward” dead zones.
  • Formal verification and DP are re-entering the loop as practical mitigations. SafePilot uses Z3/Spot to verify LLM-generated CPS plans; confidential federated compute work shows DP can be undermined by side-channels unless message padding and DP-resize mechanisms are added.

2) Key themes (clusters)

Theme: Benchmark trust & contamination (code + multimodal)

  • Why it matters: If benchmarks can be solved via leakage or modality shortcuts, reported “reasoning” and “visual understanding” gains are inflated, and downstream decisions (model selection, safety claims, curation) become unreliable.
  • Representative papers:
  • Common approach:
    • Replace artifact-only checks with behavioral or counterfactual controls (session-isolated repeated solves; image-absent “mirage-mode”).
    • Quantify susceptibility with simple ratios/metrics (contamination score; mirage-score = acc(no image)/acc(with image)).
    • Reduce evaluation cost while preserving statistical validity (active testing with Horvitz–Thompson estimators + adaptive stopping).
  • Open questions / failure modes:
    • How well do contamination/mirage diagnostics generalize across model families, decoding settings, and domains?
    • Model-set dependence: cleaning procedures like B-Clean depend on which models are used to filter.
    • Can models learn to “fake diversity” or “fake uncertainty” to evade behavioral contamination checks?

Theme: Inference-time alignment & mechanistic uncertainty signals

Theme: Agent engineering for long-horizon reliability (coordination, debugging, provenance)

Theme: Tool/RAG security & privacy leakage in “secure” compute

Theme: RL/RLVR stabilization via better credit assignment and exploration control

3) Technical synthesis

  • Behavioral counterfactuals are becoming the common diagnostic tool: CCV uses session-isolated repeated solves; Mirage uses image-absent controls; CCPO uses counterfactual rollouts; CEBaG uses text-only vs multimodal scoring passes.
  • “White-box signals” are increasingly used to fix evaluation and safety gaps: induction-head SinkRate (INTRYGUE), SAE latents (DSPA), token logprob variance/evidence gain (CEBaG), signed Δlog p (RLVR direction).
  • Credit assignment is converging on normalization + bounded shaping: CCPO’s EMA z-scoring/tanh shaping; TAMTRL’s min–max normalization (and collapse without it); SAGE-GRPO’s timestep equalizer; RLVR reweighting upweights low-prob tokens.
  • Agent reliability work is splitting into two layers: (a) coordination primitives (CAID’s worktrees/merges/tests) and (b) observability primitives (EAGER embeddings for failure retrieval; AER schema + mock replay).
  • Security is shifting from “model jailbreaks” to “system boundary jailbreaks”: MCP tool metadata poisoning and over-privileged servers; RAG pipeline threats; DP-in-TEE side-channels.
  • Formal methods are being used as practical guardrails rather than end-to-end verification: SafePilot verifies plans with Z3/Spot and iteratively re-prompts; DP side-channel mitigations come with theorems but target specific channels.
  • Data efficiency is a recurring theme across alignment and evaluation: DSPA works under severe preference-data restriction; Active Testing cuts labeling up to 95%; MSFT reduces wasted compute by excluding early-overfitting sub-datasets.
  • “Training-free” or “no weight updates” is not just convenience—it’s becoming a safety/ops feature: DSPA steering is reversible; FIM-based merging is data-free; INTRYGUE is training-free; CEBaG is deterministic and sampling-free.

4) Top 5 papers (with “why now”)

1) Mirage The Illusion of Visual Understanding

  • Shows frontier multimodal models often confidently describe non-existent images and still score highly when images are omitted (mirage-scores ~70–80% average).
  • Demonstrates benchmark fragility: B-Clean removes ~74–77% of questions in some benchmarks and can drastically change accuracies/rankings.
  • “Why now”: multimodal models are being deployed in high-stakes domains (medicine); this provides a concrete, scalable evaluation control (image-absent) and a cleaning protocol.
  • Be skeptical about: B-Clean is model-set dependent; mechanistic causes of mirage are not fully identified.

2) Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

  • Introduces a black-box, API-only contamination detector using session-isolated repeated trials and patch-diversity metrics.
  • Reports perfect separation between contaminated vs genuine reasoning on 9 SWE-bench problems (small but striking), plus a bias-resistant analysis workflow (HCCA).
  • “Why now”: coding benchmarks are central to frontier claims; this is a practical method to audit them without model internals.
  • Be skeptical about: evaluated on 9 problems / one model; reasoning classifier is heuristic and evaluated on the same data.

3) DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

  • Inference-time, prompt-conditional sparse steering in SAE space; edits only token-active latents.
  • Improves MT-Bench across multiple models and stays robust with very small preference datasets (down to ~100–250 triples), with large compute savings vs a two-stage baseline (modeled 4.47× FLOPs; observed 11.5× wall-clock).
  • “Why now”: demand for cheap, reversible alignment and mechanistic auditability is rising.
  • Be skeptical about: depends on availability/quality of SAEs; open-ended eval relies on LLM judges; no formal safety guarantees.

4) Are AI-assisted Development Tools Immune to Prompt Injection?

  • Empirically tests tool-poisoning prompt injection across 7 MCP clients with 4 concrete attacks; finds no client blocks all attacks.
  • Highlights large variance: Cursor unsafe across all tested attacks; Claude Desktop and Cline strongest in tested configs; many clients lack static validation/sandboxing/audit logging.
  • “Why now”: MCP-style tool ecosystems are rapidly becoming default in IDE/CLI workflows; this is direct operational risk.
  • Be skeptical about: limited to specific versions/configurations and local testbed; sandboxing assessment partly documentation-based.

5) On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

  • Argues RLVR changes are best understood via signed token probability shifts (Δlog p), not magnitude-only metrics.
  • Shows Δlog p-selected token replacement recovers RLVR performance with ~10% token swaps; proposes test-time extrapolation and training-time advantage reweighting with reported gains (e.g., Avg@32 improvements on AIME and other math sets).
  • “Why now”: RLVR is widely used for reasoning; this offers both interpretability and practical knobs to improve it.
  • Be skeptical about: extrapolation needs both base + RL models at test time and introduces tunable hyperparameters (τ, γ).

5) Practical next steps

  • Add “counterfactual controls” to your eval harness: for multimodal, run image-absent mirage-mode; for coding, run session-isolated repeated solves and measure diversity (CCV-style).
  • Treat tool metadata as untrusted input: adopt MCP server auditing (static rules + optional dynamic sandbox/eBPF) and require capability inventories + least-privilege hardening before deployment.
  • Instrument agents with structured provenance (intent/observation/inference + evidence chains) and enable mock replay to regression-test prompt/model changes on a pinned incident corpus.
  • For multi-agent SWE, enforce physical isolation (git worktrees/branches), dependency-aware delegation, and test-gated merges; measure integration failure rate vs engineer count to find the parallelism “knee.”
  • If you do RAG, evaluate uncertainty methods that incorporate how context was used (e.g., induction-head activity) and separately track retrieval quality to avoid “faithful-but-wrong” confidence.
  • For RLVR / agent RL, prioritize credit assignment: try counterfactual marginal rewards (CCPO) for collaboration, and consider probability-aware reweighting to avoid ignoring low-probability but crucial tokens.
  • For safety-critical planning (CPS/robotics), integrate formal verification loops (Z3/Spot) and log verification failures as first-class training/eval artifacts.
  • For DP-in-TEE deployments, audit for metadata side-channels (message length, allocation/page faults) and consider DP padding + DP-timed resizing mechanisms where applicable.

Generated from per-paper analyses; no external browsing.