Daily AI Paper Report (2026-03-25)
Published:
Chinese version: [中文]
Run stats
- Candidates: 223
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-23T00:00:00Z → 2026-03-24T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.21697 | Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models | cs.CR, cs.AI, cs.MM | 95 | Comic-based multimodal jailbreak benchmark; very high attack success across 15 MLLMs. | multimodal-safety, jailbreaks, benchmark, red-teaming, MLLM, adversarial-prompts |
2603.21687 | Mirage The Illusion of Visual Understanding | cs.AI | 95 | Shows multimodal benchmarks can be gamed w/ no image; exposes "mirage reasoning" reliability failure | multimodal, evaluation, hallucination, reliability, benchmarking, medical-ai |
2603.21642 | Are AI-assisted Development Tools Immune to Prompt Injection? | cs.CR, cs.SE | 93 | First empirical prompt-injection/tool-poisoning study across 7 real MCP dev clients. | prompt-injection, tool-poisoning, MCP, agent-security, empirical-study, secure-tool-use |
2603.21972 | Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe | cs.LG, cs.CL | 92 | Empirical recipe for scaling RL in long-horizon tool agents; actionable axes + takeaways on TravelPlanner. | tool-using agents, long-horizon RL, RLHF/RLVR, agent evaluation, reward design, planning |
2603.22117 | On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation | cs.LG, cs.AI | 92 | Token-level signed Δlog p reveals reasoning-critical RLVR updates; actionable analysis + interventions | LLM, RLVR, reasoning, post-training, mechanistic-analysis, token-level |
2603.21641 | Auditing MCP Servers for Over-Privileged Tool Capabilities | cs.CR, cs.SE | 90 | Practical auditing toolkit for over-privileged MCP servers with static+dynamic fuzzing. | MCP, tool-permissions, sandboxing, security-audit, fuzzing, eBPF, agent-infra |
2603.21461 | DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment | cs.LG, cs.AI, cs.CL | 90 | Inference-time preference alignment via prompt-conditional SAE steering; compute-light with strong benchmarks. | alignment, preference optimization, SAE, steering, mechanistic interpretability, inference-time control |
2603.21558 | Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment | cs.AI | 90 | Stabilizes recursive self-training by step-level symbolic verification; targets drift/mode collapse risk | self-training, recursive-self-improvement, verification, neuro-symbolic, reasoning, safety |
2603.21469 | Hardening Confidential Federated Compute against Side-channel Attacks | cs.CR, cs.DS | 90 | Finds side-channels that can bypass DP in confidential federated compute; proposes mitigations | privacy, differential-privacy, federated-learning, side-channels, security, confidential-compute |
2603.21975 | SecureBreak -- A dataset towards safe and secure models | cs.CR, cs.AI, cs.CL, cs.LG | 88 | Security-focused dataset for robustness evaluation/training against jailbreaks/injection. | dataset, security-alignment, jailbreaks, prompt-injection, robustness-eval, guardrails |
2603.22214 | Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models | cs.CR, cs.AI, cs.LG | 88 | Systematic study of LLM-as-judge reliability vs humans; important for scalable eval and security assessment. | evaluation, LLM-as-judge, reliability, human agreement, model auditing, safety eval |
2603.21693 | Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain | cs.AI | 88 | Single-pass logprob-based medical MLLM hallucination detection; avoids costly multi-sample entropy methods | hallucination-detection, MLLM, medical, VQA, uncertainty, logprobs, reliability |
2603.21654 | Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks | cs.CR, cs.AI | 86 | Comprehensive RAG security review: threats (poisoning/inference) + defenses + benchmarks. | RAG, security, data-poisoning, membership-inference, defenses, survey, benchmarking |
2603.21523 | SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems | cs.RO, cs.AI | 86 | Safety assurance framework for LLM-enabled cyber-physical systems; targets hallucination-driven unsafe acts. | CPS safety, robotics, neuro-symbolic, assurance, runtime safety, hallucinations |
2603.21577 | Mind over Space: Can Multimodal Large Language Models Mentally Navigate? | cs.AI | 86 | New benchmark for long-horizon spatial planning from egocentric video; targets agentic MLLM limits | agents, benchmark, embodied-ai, multimodal, planning, long-context, evaluation |
2603.21607 | INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation | cs.AI | 85 | Mechanistic RAG UQ fix: induction heads inflate entropy; proposes gating for reliability. | RAG, uncertainty, hallucinations, mechanistic-interpretability, calibration, reliability |
2603.21489 | Effective Strategies for Asynchronous Software Engineering Agents | cs.CL, cs.AI | 84 | Practical strategies for asynchronous multi-agent SWE; tackles interference, dependencies, and integration. | agents, software engineering, multi-agent coordination, asynchrony, long-horizon tasks, workflow |
2603.21925 | Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support | cs.AI | 84 | Guideline-page image RAG with routing/filtering + traceable citations; strong clinical decision support eval | RAG, grounding, citations, multimodal, healthcare, evaluation, retrieval |
2603.21454 | Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis | cs.CL | 83 | Black-box method to detect benchmark contamination via multi-session solution diversity. | evaluation, benchmark-contamination, SWE-bench, leakage, multi-agent, audit-methods |
2603.21692 | Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces | cs.AI, cs.DC, cs.SE | 82 | Proposes structured reasoning provenance for agents: queryable 'why' records at scale. | agents, observability, auditing, reasoning-provenance, governance, monitoring |
2603.21705 | Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs | cs.LG | 82 | Fisher/Hessian-motivated layer-adaptive model merging for long-to-short reasoning; practical compression lever | model-merging, reasoning, compression, Fisher-information, alignment, LLM |
2603.21522 | Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation | cs.SE, cs.AI | 82 | Failure management for LLM multi-agent systems using historical patterns + trace representations | multi-agent, reliability, monitoring, debugging, reasoning-traces, software-engineering |
2603.21563 | Counterfactual Credit Policy Optimization for Multi-Agent Collaboration | cs.AI | 81 | Counterfactual credit assignment for collaborative agents; reduces variance/free-riding in multi-agent RL. | multi-agent RL, credit assignment, counterfactual baselines, collaboration, agent training |
2603.21606 | mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT | cs.LG, cs.AI | 80 | Multi-task SFT mixture method that avoids per-dataset overfitting; broad benchmark gains. | SFT, data-mixtures, post-training, overfitting, training-recipes, LLM |
2603.21877 | P^2O: Joint Policy and Prompt Optimization | cs.LG, cs.AI | 80 | Combines prompt optimization with RLVR to tackle hard samples and sparse rewards; exploration boost. | RLVR, reasoning, prompt optimization, genetic search, training stability, verifiable rewards |
2603.21872 | Manifold-Aware Exploration for Reinforcement Learning in Video Generation | cs.CV, cs.AI | 80 | Constrains GRPO exploration to stay near video manifold; improves stability of reward-based post-training | RL, GRPO, video-generation, alignment, stability, exploration, diffusion |
2603.21663 | TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression | cs.CL | 80 | Multi-turn RL for long-context compression; tackles credit assignment without heavy judge overhead | long-context, reinforcement-learning, reward-shaping, memory, training, alignment-methods |
2603.21840 | Select, Label, Evaluate: Active Testing in NLP | cs.CL, cs.AI | 78 | Active Testing benchmark across many NLP datasets; reduces labeling cost while estimating performance well. | evaluation, active testing, data efficiency, benchmarking, test set design, annotation |
2603.22184 | Revisiting Quantum Code Generation: Where Should Domain Knowledge Live? | cs.LG, quant-ph | 78 | Compares finetune vs RAG vs agent+exec feedback for domain codegen; useful evidence on specialization tradeoffs | code-generation, agents, RAG, execution-feedback, evaluation, domain-adaptation |
2603.22276 | Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels | cs.LG, stat.ML | 78 | Makes high-rank DoRA practical via factored norms + fused kernels; useful for efficient adaptation | efficiency, fine-tuning, LoRA, DoRA, systems, kernels, scaling |
AI Paper Insight Brief
2026-03-25
0) Executive takeaways (read this first)
- Evaluation integrity is under active attack—from code benchmarks to multimodal “vision” tests. Cross-session behavioral diversity (CCV) can flag SWE-bench contamination, while “Mirage” shows many multimodal benchmarks remain highly answerable without images (often retaining ~70–80% of accuracy).
- Inference-time, reversible alignment is getting more practical. DSPA uses sparse autoencoder (SAE) features for prompt-conditional, token-conditional steering, improving MT-Bench with modest multiple-choice regression and strong robustness under tiny preference datasets (≈100–250 triples).
- Agent reliability is shifting from “smarter prompts” to “software engineering + ops primitives.” CAID (git worktrees + dependency-aware delegation + test-gated merges) improves long-horizon SWE benchmarks; EAGER and AER propose trace representations for faster failure detection and population-level behavioral analytics.
- Security focus is moving to the tool boundary (MCP) and the RAG pipeline. Empirical MCP client testing finds no client blocks all tool-poisoning attacks; protocol-aware auditing (static + dynamic eBPF fuzzing) catches over-privileged servers; a large RAG-security survey consolidates threats/defenses/benchmarks.
- RL/RLVR for reasoning is being debugged at the token and credit-assignment level. Directional token shifts (Δlog p) explain sparse RLVR changes and enable test-time extrapolation + training reweighting; CCPO and TAMTRL reshape credit assignment for multi-agent collaboration and multi-turn memory RL; P²O uses prompt evolution + context distillation to break “hard-sample zero-reward” dead zones.
- Formal verification and DP are re-entering the loop as practical mitigations. SafePilot uses Z3/Spot to verify LLM-generated CPS plans; confidential federated compute work shows DP can be undermined by side-channels unless message padding and DP-resize mechanisms are added.
2) Key themes (clusters)
Theme: Benchmark trust & contamination (code + multimodal)
- Why it matters: If benchmarks can be solved via leakage or modality shortcuts, reported “reasoning” and “visual understanding” gains are inflated, and downstream decisions (model selection, safety claims, curation) become unreliable.
- Representative papers:
- Common approach:
- Replace artifact-only checks with behavioral or counterfactual controls (session-isolated repeated solves; image-absent “mirage-mode”).
- Quantify susceptibility with simple ratios/metrics (contamination score; mirage-score = acc(no image)/acc(with image)).
- Reduce evaluation cost while preserving statistical validity (active testing with Horvitz–Thompson estimators + adaptive stopping).
- Open questions / failure modes:
- How well do contamination/mirage diagnostics generalize across model families, decoding settings, and domains?
- Model-set dependence: cleaning procedures like B-Clean depend on which models are used to filter.
- Can models learn to “fake diversity” or “fake uncertainty” to evade behavioral contamination checks?
Theme: Inference-time alignment & mechanistic uncertainty signals
- Why it matters: Deployment often needs cheap, reversible alignment and reliable uncertainty without retraining or heavy sampling—especially for RAG and open-ended generation.
- Representative papers:
- Common approach:
- Use internal representations (SAE features; induction-head SinkRate; token log-prob variance + evidence gain) to drive interventions/scores.
- Prefer training-free or low-data methods (DSPA robust down to ~100–250 preference triples; INTRYGUE training-free; CEBaG deterministic with 3 forward passes).
- Emphasize auditability (sparse feature edits; mechanistic probes/ablations; hyperparameter-free scoring).
- Open questions / failure modes:
- White-box dependence: INTRYGUE needs attention access; CEBaG needs logprobs.
- Faithfulness vs truth: INTRYGUE measures grounding faithfulness—wrong retrieved docs can still look “certain.”
- Steering misuse risk: inference-time steering could be applied adversarially.
Theme: Agent engineering for long-horizon reliability (coordination, debugging, provenance)
- Why it matters: As agents become asynchronous and autonomous, the dominant failures are integration bugs, recurring trace-level failure patterns, and lack of population-level observability.
- Representative papers:
- Common approach:
- Import SWE/ops primitives: dependency graphs, isolated workspaces, test-gated merges (CAID).
- Learn trace representations for retrieval-based diagnosis and step-wise mitigation (EAGER dual encoders + contrastive objectives).
- Standardize structured provenance schemas with replay modes for regression (AER: intent/observation/inference + mock replay).
- Open questions / failure modes:
- Overhead: more agents, more isolation, more logging can increase cost and latency.
- Faithfulness: provenance fields can be self-reported and rationalized; representation models are preliminary.
- Generalization beyond code/tool domains where tests and version control exist.
Theme: Tool/RAG security & privacy leakage in “secure” compute
- Why it matters: Tool integration and retrieval pipelines expand the attack surface; DP and TEEs don’t automatically prevent leakage if metadata and side-channels remain observable.
- Representative papers:
- Common approach:
- Protocol-aware security evaluation (MCP tool metadata as an injection vector; capability-family auditing).
- Combine static + dynamic evidence (Docker sandbox + eBPF telemetry; fuzzing).
- Treat side-channels as first-class DP threats; add DP padding and DP-timed resizing with proofs.
- Open questions / failure modes:
- Static coverage gaps (e.g., MCP audit static misses JS/TS; dynamic requires eBPF/Linux).
- Defense trade-offs: padding overhead; DP-resize complexity; incomplete channel coverage.
- Ecosystem drift: client versions/configurations change rapidly; security posture can regress.
Theme: RL/RLVR stabilization via better credit assignment and exploration control
- Why it matters: RL for reasoning/agents/video is limited by sparse rewards, hard-sample dead zones, free-riding in multi-agent setups, and unstable exploration.
- Representative papers:
- Common approach:
- Replace shared/terminal rewards with counterfactual or shaped per-component signals (CCPO marginal contributions; TAMTRL turn rewards; Δlog p token diagnostics).
- Use trust regions / normalization to stabilize updates (dual KL anchors; EMA normalization; gradient equalizers).
- Add exploration that stays “on-manifold” (video GRPO variance correction) or “prompt-assisted” (P²O prompt evolution).
- Open questions / failure modes:
- Extra compute/runtime (counterfactuals; prompt evolution; video RL).
- Hyperparameter sensitivity (extrapolation γ/τ; trust-region weights; prompt-search budgets).
- Transfer beyond math/video/specific topologies remains under-tested.
3) Technical synthesis
- Behavioral counterfactuals are becoming the common diagnostic tool: CCV uses session-isolated repeated solves; Mirage uses image-absent controls; CCPO uses counterfactual rollouts; CEBaG uses text-only vs multimodal scoring passes.
- “White-box signals” are increasingly used to fix evaluation and safety gaps: induction-head SinkRate (INTRYGUE), SAE latents (DSPA), token logprob variance/evidence gain (CEBaG), signed Δlog p (RLVR direction).
- Credit assignment is converging on normalization + bounded shaping: CCPO’s EMA z-scoring/tanh shaping; TAMTRL’s min–max normalization (and collapse without it); SAGE-GRPO’s timestep equalizer; RLVR reweighting upweights low-prob tokens.
- Agent reliability work is splitting into two layers: (a) coordination primitives (CAID’s worktrees/merges/tests) and (b) observability primitives (EAGER embeddings for failure retrieval; AER schema + mock replay).
- Security is shifting from “model jailbreaks” to “system boundary jailbreaks”: MCP tool metadata poisoning and over-privileged servers; RAG pipeline threats; DP-in-TEE side-channels.
- Formal methods are being used as practical guardrails rather than end-to-end verification: SafePilot verifies plans with Z3/Spot and iteratively re-prompts; DP side-channel mitigations come with theorems but target specific channels.
- Data efficiency is a recurring theme across alignment and evaluation: DSPA works under severe preference-data restriction; Active Testing cuts labeling up to 95%; MSFT reduces wasted compute by excluding early-overfitting sub-datasets.
- “Training-free” or “no weight updates” is not just convenience—it’s becoming a safety/ops feature: DSPA steering is reversible; FIM-based merging is data-free; INTRYGUE is training-free; CEBaG is deterministic and sampling-free.
4) Top 5 papers (with “why now”)
1) Mirage The Illusion of Visual Understanding
- Shows frontier multimodal models often confidently describe non-existent images and still score highly when images are omitted (mirage-scores ~70–80% average).
- Demonstrates benchmark fragility: B-Clean removes ~74–77% of questions in some benchmarks and can drastically change accuracies/rankings.
- “Why now”: multimodal models are being deployed in high-stakes domains (medicine); this provides a concrete, scalable evaluation control (image-absent) and a cleaning protocol.
- Be skeptical about: B-Clean is model-set dependent; mechanistic causes of mirage are not fully identified.
- Introduces a black-box, API-only contamination detector using session-isolated repeated trials and patch-diversity metrics.
- Reports perfect separation between contaminated vs genuine reasoning on 9 SWE-bench problems (small but striking), plus a bias-resistant analysis workflow (HCCA).
- “Why now”: coding benchmarks are central to frontier claims; this is a practical method to audit them without model internals.
- Be skeptical about: evaluated on 9 problems / one model; reasoning classifier is heuristic and evaluated on the same data.
3) DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
- Inference-time, prompt-conditional sparse steering in SAE space; edits only token-active latents.
- Improves MT-Bench across multiple models and stays robust with very small preference datasets (down to ~100–250 triples), with large compute savings vs a two-stage baseline (modeled 4.47× FLOPs; observed 11.5× wall-clock).
- “Why now”: demand for cheap, reversible alignment and mechanistic auditability is rising.
- Be skeptical about: depends on availability/quality of SAEs; open-ended eval relies on LLM judges; no formal safety guarantees.
4) Are AI-assisted Development Tools Immune to Prompt Injection?
- Empirically tests tool-poisoning prompt injection across 7 MCP clients with 4 concrete attacks; finds no client blocks all attacks.
- Highlights large variance: Cursor unsafe across all tested attacks; Claude Desktop and Cline strongest in tested configs; many clients lack static validation/sandboxing/audit logging.
- “Why now”: MCP-style tool ecosystems are rapidly becoming default in IDE/CLI workflows; this is direct operational risk.
- Be skeptical about: limited to specific versions/configurations and local testbed; sandboxing assessment partly documentation-based.
5) On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
- Argues RLVR changes are best understood via signed token probability shifts (Δlog p), not magnitude-only metrics.
- Shows Δlog p-selected token replacement recovers RLVR performance with ~10% token swaps; proposes test-time extrapolation and training-time advantage reweighting with reported gains (e.g., Avg@32 improvements on AIME and other math sets).
- “Why now”: RLVR is widely used for reasoning; this offers both interpretability and practical knobs to improve it.
- Be skeptical about: extrapolation needs both base + RL models at test time and introduces tunable hyperparameters (τ, γ).
5) Practical next steps
- Add “counterfactual controls” to your eval harness: for multimodal, run image-absent mirage-mode; for coding, run session-isolated repeated solves and measure diversity (CCV-style).
- Treat tool metadata as untrusted input: adopt MCP server auditing (static rules + optional dynamic sandbox/eBPF) and require capability inventories + least-privilege hardening before deployment.
- Instrument agents with structured provenance (intent/observation/inference + evidence chains) and enable mock replay to regression-test prompt/model changes on a pinned incident corpus.
- For multi-agent SWE, enforce physical isolation (git worktrees/branches), dependency-aware delegation, and test-gated merges; measure integration failure rate vs engineer count to find the parallelism “knee.”
- If you do RAG, evaluate uncertainty methods that incorporate how context was used (e.g., induction-head activity) and separately track retrieval quality to avoid “faithful-but-wrong” confidence.
- For RLVR / agent RL, prioritize credit assignment: try counterfactual marginal rewards (CCPO) for collaboration, and consider probability-aware reweighting to avoid ignoring low-probability but crucial tokens.
- For safety-critical planning (CPS/robotics), integrate formal verification loops (Z3/Spot) and log verification failures as first-class training/eval artifacts.
- For DP-in-TEE deployments, audit for metadata side-channels (message length, allocation/page faults) and consider DP padding + DP-timed resizing mechanisms where applicable.
Generated from per-paper analyses; no external browsing.
