Daily AI Paper Report (2026-04-01)
Published:
Chinese version: [中文]
Run stats
- Candidates: 223
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.28013 | Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers | cs.CR, cs.AI, cs.LG | 95 | Stage-level prompt-injection tracking w/ canaries across models, surfaces, and safety tiers | prompt-injection, agent-security, evaluation, canary-tokens, kill-chain, frontier-models |
2603.28063 | Reward Hacking as Equilibrium under Finite Evaluation | cs.AI, cs.GT | 95 | Formal result: reward hacking emerges under finite evaluation; computable distortion index. | reward-hacking, principal-agent, evaluation, alignment-theory, RLHF, DPO |
2603.28650 | Information-Theoretic Limits of Safety Verification for Self-Improving Systems | cs.LG, cs.AI, stat.ML | 95 | Formal impossibility bounds for safety gates in self-improving systems; high AI safety relevance | AI safety, self-improvement, verification, risk bounds, theory, TPR/FPR, impossibility |
2603.28166 | Evaluating Privilege Usage of Agents on Real-World Tools | cs.CR, cs.AI | 92 | GrantBox sandbox evaluates real-world tool privilege usage—core risk for autonomous agents | agent-security, tool-use, privilege, sandbox, benchmark, real-world-tools |
2603.28345 | Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code | cs.SE, cs.AI | 92 | Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security. | program-analysis, information-flow, taint-analysis, LLM-security, prompting, software-engineering |
2603.28551 | "What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents | cs.CR, cs.ET, cs.HC, cs.MA | 90 | Empirical study + corpus on computer-use agent risk awareness and post-hoc auditability. | computer-use-agents, auditability, traceability, user-safety, agent-security, HCI |
2603.28204 | ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models | cs.LG, cs.AI | 90 | Token-level entropy regulation for RLVR/GRPO improves credit assignment in reasoning chains | LLM reasoning, RLVR, GRPO, post-training, credit assignment, entropy, optimization |
2603.28054 | Who Wrote the Book? Detecting and Attributing LLM Ghostwriters | cs.CL | 90 | GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detection. | authorship-attribution, model-fingerprinting, misuse-detection, dataset, robustness, OOD |
2603.28407 | MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome | cs.AI, cs.CL | 89 | MiroEval benchmarks multimodal deep-research agents on process + outcome with refreshable tasks | agents, evaluation, deep-research, multimodal, process-metrics, benchmark |
2603.28376 | Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design | cs.CL, cs.AI | 88 | Verification-centric deep research agent design across synthesis/trajectories/test-time scaling. | deep-research-agents, verification, tool-use, long-horizon, test-time-scaling, RAG |
2603.27982 | CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models | cs.CV, cs.AI, cs.CL | 88 | New benchmark for commonsense-driven hallucination when vision conflicts with priors | VLM, hallucination, evaluation, robustness, benchmarks, visual grounding, reliability |
2603.28569 | CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments | cs.LG, cs.AI, cs.IR, cs.PF | 87 | Real cloud-ticket agent benchmark measuring robustness and resolution efficiency beyond accuracy | agents, evaluation, real-world, customer-support, long-horizon, reliability |
2603.28387 | The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation | cs.AI, cs.LG | 86 | Shows prompt framing can fake multimodal clinical gains; important eval artifact warning. | evaluation, prompting, multimodal, VLM, clinical-AI, spurious-cues |
2603.28430 | IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression | cs.LG, cs.CL | 86 | Hardware-aligned SO(4) rotations for low-bit KV-cache compression; practical LLM efficiency gain | LLM efficiency, KV cache, compression, quantization, inference, systems, long-context |
2603.28304 | The Necessity of Setting Temperature in LLM-as-a-Judge | cs.CL | 86 | Shows temperature materially affects LLM-as-judge reliability; key for eval validity. | LLM-as-a-judge, evaluation, reproducibility, temperature, meta-eval, benchmarking |
2603.28618 | Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning | cs.AI | 85 | Dual-role RLVR separates perception vs reasoning credit; targets evidence extraction failures | multimodal, RLVR, credit assignment, VLM, reasoning, perception, reliability |
2603.27918 | Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey | cs.CR, cs.AI | 84 | Comprehensive survey of adversarial attacks on MLLMs with taxonomy across modalities/settings | multimodal, adversarial-attacks, survey, security, jailbreaks, threat-models |
2603.28092 | InkDrop: Invisible Backdoor Attacks Against Dataset Condensation | cs.LG | 84 | Stealthy backdoor attack on dataset condensation; highlights new data-poisoning surface. | backdoors, data-poisoning, dataset-condensation, adversarial-ML, security |
2603.28135 | CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning | cs.AI | 84 | Training-free metacognitive control for test-time reasoning: prune/repair/abstain under budget. | test-time-compute, reasoning, abstention, search, inference, reliability |
2603.28301 | LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models | cs.LG | 83 | Benchmark for paraphrase robustness in VLA robots; large drops under simple synonyms. | robustness, paraphrase, VLA, robotics, benchmark, instruction-following |
2603.28610 | ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning | cs.CV, cs.AI, cs.CL | 83 | Learns input-side adaptive resolution via bandits to cut visual tokens while keeping reasoning | multimodal, efficiency, adaptive compute, token budget, context length, bandits, MLLM |
2603.28005 | Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation | cs.CL | 82 | Prompt-controlled test questions whether atomic decomposition truly helps LLM judges in QA eval | LLM-judges, evaluation, factuality, reference-grounding, prompting, methodology |
2603.28605 | Unsafe2Safe: Controllable Image Anonymization for Downstream Utility | cs.CV, cs.CY, cs.LG | 82 | Automated privacy-risk detection + diffusion editing to anonymize images while preserving utility. | privacy, data-sanitization, diffusion-editing, dataset-curation, VLM |
2603.28696 | AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding | cs.CV, cs.AI | 82 | Entropy-guided token budgeting for long-video MLLMs; principled stop/allocate mechanism. | multimodal, long-context, video-understanding, efficiency, uncertainty, token-selection |
2603.28488 | Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification | cs.CL, cs.AI, cs.MA | 80 | Structured multi-agent debate + progressive RAG for controversial claim verification robustness | claim-verification, RAG, multi-agent, debate, hallucinations, calibration |
2603.28476 | With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems | cs.IR, cs.LG, cs.SI | 80 | Analyzes coordinated user manipulation against risk-controlling recommenders with safety guarantees. | recommenders, adversarial-manipulation, conformal-risk-control, safety, social-computing |
2603.28622 | Trust-Aware Routing for Distributed Generative AI Inference at the Edge | cs.DC, cs.AI, cs.NI | 80 | Trust-aware routing for distributed generative inference; robustness to misbehaving edge peers | agentic systems, distributed inference, trust, robustness, security, edge AI, systems |
2603.28026 | When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA | cs.AI | 80 | Training-free contrastive decoding to reduce answer-choice priors; improves figure grounding. | multimodal-eval, grounding, decoding, bias, scientific-qa, robustness |
2603.28378 | Membership Inference Attacks against Large Audio Language Models | cs.SD, cs.AI | 79 | First systematic MIA study for audio LMs; shows confounds and proposes distribution-matched eval | privacy, membership-inference, audio, evaluation, dataset-shift, leakage |
2603.28662 | AMIGO: Agentic Multi-Image Grounding Oracle Benchmark | cs.LG, cs.AI | 78 | AMIGO tests long-horizon multi-image grounding via constrained questioning under uncertainty | multimodal-agents, benchmark, interactive-eval, uncertainty, grounding, long-horizon |
AI Paper Insight Brief
2026-04-01
0) Executive takeaways (read this first)
- Evaluation is the new attack surface: multiple papers show that how we evaluate (prompt framing, temperature, decomposition style, MC vs QA format) can dominate measured “capability” and hide grounding failures—so benchmark design is now a first-class safety concern.
- Agent security failures are pipeline- and surface-dependent, not model-dependent: prompt injection and privilege misuse vary wildly by injection surface and agent stage; outcome-only ASR is insufficient to choose defenses or architectures.
- “Priors beat pixels” remains a core multimodal failure mode: VLMs often normalize anomalies to commonsense (CDH), follow answer-choice priors in scientific MCQA, or respond to “MRI available” scaffolds without using images—suggesting grounding interventions must explicitly counter prior dominance.
- Verification-centric designs are emerging as a unifying reliability lever: from deep-research agents (verification at synthesis/trajectory/inference) to stage-level canaries and process-centric benchmarks, verification is being operationalized as instrumentation + control, not just post-hoc scoring.
- Training-time and data-supply-chain risks are sharpening: stealthy backdoors in dataset condensation (InkDrop) and broad MLLM attack taxonomies highlight that multimodal systems inherit vulnerabilities across encoders, fusion, and instruction-following—often transferable in black-box settings.
- Test-time compute is shifting from “more tokens” to “better control”: metacognitive controllers (CoT2-Meta) and long-video token/resolution allocators (AdaptToken, ResAdapt) show consistent gains under fixed budgets by allocating compute to high-value steps/frames.
2) Key themes (clusters)
Theme: Agent prompt-injection & privilege misuse need stage/surface localization
- Why it matters: Agent deployments fail in different places (memory write, relay, tool args) and via different surfaces (tool outputs vs memory poisoning). Without stage-level instrumentation, defenses can look effective while failing in realistic multi-surface pipelines.
- Representative papers:
- Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
- Evaluating Privilege Usage of Agents on Real-World Tools
- Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code
- “What Did It Actually Do?”: Understanding Risk Awareness and Traceability for Computer-Use Agents
- Common approach:
- Instrument pipelines with stage-level signals (EXPOSED→PERSISTED→RELAYED→EXECUTED) rather than single ASR.
- Evaluate across multiple injection surfaces and realistic tool stacks (real MCP servers; NL/PL placeholder flows).
- Add traceability/logging (propagation loggers, remote request loggers, user-facing action traces).
- Open questions / failure modes:
- Defense “success” that is actually surface mismatch (defenses fail outside assumed channel).
- Hard-to-analyze Layer 2 reachability in NL/PL flows (taxonomy can say “propagates”, but code-level reachability still fails).
- Human operators lacking post-hoc auditability and persistence understanding even when logs exist.
Theme: Multimodal grounding failures driven by priors and prompt scaffolds
- Why it matters: Models can appear strong while ignoring visual evidence—dangerous in anomaly-sensitive domains (clinical, inspection, forensics) and scientific figure interpretation.
- Representative papers:
- CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
- When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA
- The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
- Common approach:
- Construct counterfactual vs commonsense paired inputs to isolate prior-driven collapse (CF-Acc vs CS-Acc; CFAD/CCR/RPD).
- Use inference-time debiasing (contrastive decoding subtracting text-only logits).
- Run prompt ablations (MRI preamble vs actual image; swap/OOD images; false-modality preambles).
- Open questions / failure modes:
- Multiple-choice formats can amplify prior collapse (CDH-Bench shows larger CFAD in MC).
- Debiasing can hurt when the text prior is correct (SCICON harmed cases when visual evidence margin is weak/negative).
- Prompt scaffolds can create illusory multimodal gains that alignment fine-tuning (MPO) may suppress by collapsing performance rather than fixing grounding.
Theme: Evaluation methodology is brittle (judges, prompts, temperature, decomposition)
- Why it matters: If evaluation pipelines are unstable, we can’t trust progress claims or safety regressions—especially when LLM judges are used for training signals and benchmarking.
- Representative papers:
- Common approach:
- Prompt-controlled comparisons (match rubric richness/token budgets; multiple frozen prompt variants).
- Treat temperature as a causal treatment (AIPW causal estimates; repeated seeds).
- Expand evaluation beyond outcomes to process + factuality verification (process logs → process metrics; statement-level verification over web+attachments).
- Open questions / failure modes:
- Atomic decomposition isn’t automatically better; holistic rubrics can be more accurate and cheaper for incompleteness detection.
- Higher temperature increases parse errors and inconsistency, especially with CoT prompting.
- Process evaluation requires access to traces, limiting applicability to closed systems.
Theme: Verification-centric agents and test-time control under budgets
- Why it matters: Long-horizon agents and reasoning systems fail via error propagation; budgeted compute must be spent on verification/repair and evidence acquisition, not just longer generations.
- Representative papers:
- Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
- CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
- Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
- AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
- Common approach:
- Add explicit verification loops (uniqueness checks in synthesis; verifier agents; discard-all resets).
- Use controllers to expand/prune/repair/stop/abstain under fixed budgets (UCB-style selection; process+outcome fusion).
- Use uncertainty/entropy signals to allocate compute (token-level entropy for early stopping; group-level entropy for long video).
- Open questions / failure modes:
- Verification-heavy pipelines can be extremely token-expensive (PROCLAIM ~210.9K tokens/claim).
- Controllers depend on oracle quality; misjudgment can cause over-pruning or wasted repair.
- Training-free uncertainty heuristics may be threshold-sensitive across domains/models.
Theme: Security & privacy threats across modalities and data pipelines
- Why it matters: Multimodal systems expand attack surfaces (encoders, fusion, instruction following), and new ML workflows (dataset condensation) create concentrated supply-chain artifacts that are attractive to attackers.
- Representative papers:
- Common approach:
- Organize threats by attacker objective (integrity, jailbreak, control/injection, poisoning/backdoors) and attacker knowledge.
- Demonstrate stealth + effectiveness tradeoffs (InkDrop multi-objective loss with LPIPS + label/feature alignment).
- Audit privacy with confound controls (audio MIA blind baselines to detect dataset separability; modality disentanglement).
- Open questions / failure modes:
- Literature skew: attacks are vision-dominant (survey: 58/65 empirically characterized works involve image/video).
- MIA can be spuriously inflated by dataset artifacts (audio acoustic features can separate train/test near-perfectly).
- Long-form attribution methods (TRACE) are not tested against obfuscation/adversarial paraphrase.
3) Technical synthesis
- Stage decomposition is recurring: kill-chain canaries decompose injection propagation; MiroEval decomposes outcome into synthesis/factuality/process; CoT2-Meta decomposes reasoning into expand/prune/repair/stop; PRCO decomposes training into Observer vs Solver roles.
- Uncertainty/entropy is becoming a control signal across training and inference: ERPO uses token entropy to gate RL updates at “critical decision pivots”; AdaptToken uses response entropy for group allocation + early stopping; CoT2-Meta uses process/outcome scoring to decide actions.
- “Priors vs evidence” shows up in multiple guises: commonsense-driven hallucination (paired counterfactuals), choice-induced priors (MCQA), and scaffold prompts (clinical VLM) all indicate that text context can dominate even when images are present.
- Benchmark formats can change failure rates materially: CDH-Bench finds MC format worsens prior collapse vs binary QA; prompt-injection ASR ranges 0–100% for the same model depending on surface; judge temperature shifts consistency and parseability.
- Verification is being operationalized as both training data hygiene and inference-time scaling: Marco’s uniqueness verification in synthesis and verifier-guided test-time scaling; PROCLAIM’s progressive retrieval + judicial panel; MiroEval’s agentic factuality verifier.
- Defense evaluation must match threat models: kill-chain results show active defenses can fail catastrophically under surface mismatch; audio MIA shows privacy audits must control distribution shift; multimodal attack survey emphasizes attacker-knowledge regimes (black-box dominates).
- Compute budgets are treated explicitly: CoT2-Meta counts all calls into a budget C; AdaptToken-Lite halves inference time via early stopping; ResAdapt targets pre-encoder pixel budget to trade spatial for temporal evidence.
- Instrumentation + traceability is expanding beyond developers to end users: AgentTrace improves user comprehension and anomaly detection; GrantBox logs outbound requests/authorization parameters; NL/PL taxonomy enables taint/slicing decisions.
4) Top 5 papers (with “why now”)
- Introduces stage-level prompt injection metrics (EXPOSED/PERSISTED/RELAYED/EXECUTED) that explain where defenses work.
- Shows exposure is universal (100%); safety depends on downstream propagation.
- Demonstrates surface dependence: same model can be 0% vs 100% ASR depending on injection surface (e.g., DeepSeek memory_poison vs tool_poison/propagation).
- Skepticism: modest per-cell sample sizes and synthetic payloads; root cause of model differences (e.g., Claude write_memory filtering) not isolated.
- Defines commonsense-driven hallucination and provides a paired counterfactual design that isolates prior collapse.
- Reports systematic gaps: mean CFAD 16.39% (QA) / 25.20% (MC); 7/8 models degrade on counterfactuals.
- Highlights that MC format amplifies prior-driven errors, relevant to many real product UIs.
- Skepticism: synthetic images and limited scale (300 pairs); broader real-world anomaly coverage not shown.
3) Evaluating Privilege Usage of Agents on Real-World Tools
- Provides GrantBox, integrating 10 real MCP servers / 122 privilege-sensitive tools in containers with logging.
- Finds very high prompt-injection success: avg ASR 90.55% (ReAct) and 79.05% (Plan-and-Execute) in their setup.
- Makes privilege misuse measurable via real outbound request logging rather than toy tools.
- Skepticism: current evaluation focuses on “native” agent behavior without defenses; environment setup complexity may affect reproducibility.
4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
- Training-free controller that allocates budget across expand/prune/repair/stop/abstain using fused process+outcome scoring.
- Reports consistent gains under matched budgets across 15 benchmarks and improved calibration (e.g., ECE ~0.035).
- Provides a concrete failure taxonomy (search-not-converged, evaluator misjudgment, over-pruning) useful for debugging.
- Skepticism: depends on quality of online process evaluation signals; hand-designed controller/meta-state may not generalize.
5) The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
- Shows that simply mentioning MRI availability can drive most of the “multimodal” gain (~70–80% of confidence shift), even when images add little signal.
- Uses strong ablations (false-modality, swap images) and expert trace review to argue gains are often not evidence-based.
- Demonstrates alignment intervention (MPO) can suppress MRI-referencing but collapses performance, highlighting difficulty of fixing the root cause.
- Skepticism: limited to two cohorts and open-weight models; scaffold estimate derived from the most responsive models.
5) Practical next steps
- Adopt stage-level injection telemetry in your agent stack (canary tokens + EXPOSED/PERSISTED/RELAYED/EXECUTED logging) and require defenses to report where they stop propagation, not just ASR.
- Run multi-surface prompt-injection evaluations (memory poisoning, tool output poisoning, relay propagation, permission escalation) before shipping; treat “surface mismatch” as a primary failure mode.
- Add privilege-usage auditing: log tool calls + authorization parameters (GrantBox-style) and build regression tests for “least privilege” violations under adversarial prompts.
- Harden multimodal grounding evaluations by adding paired counterfactual-vs-commonsense cases (CDH-style) and prompt scaffolding ablations (e.g., “modality available” preambles, swap images).
- For scientific/MCQA products, test choice-induced priors by comparing multimodal vs text-only logits; consider SCICON-style subtraction when visual evidence margin is strong, and measure harmed-case rate when priors are correct.
- Stabilize LLM-judge pipelines: fix and report temperature; measure parse error and consistency across seeds; consider matched holistic rubrics when completeness/partial-support is key (and track token cost).
- Instrument process quality for research agents (MiroEval-style): collect process logs, compute process metrics, and correlate with factuality/outcome to detect “good-looking reports from bad processes.”
- Treat condensed/synthetic datasets as supply-chain artifacts: add backdoor scanning and provenance controls for dataset condensation outputs; assume stealthy triggers can be imperceptible (InkDrop).
- For audio privacy audits, always run blind baselines (metadata/text/acoustic) and distribution-matched splits before concluding memorization-based leakage.
- Budgeted inference: if you’re scaling test-time compute, prefer controllers (prune/repair/stop) and uncertainty-guided allocation (entropy-based early stopping) over uniform “more sampling,” and track calibration alongside accuracy.
Generated from per-paper analyses; no external browsing.
