Daily AI Paper Report (2026-04-01)

Published: April 01, 2026

Chinese version: [中文]

Run stats

Candidates: 223
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.28013`	Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers PDF	cs.CR, cs.AI, cs.LG	95	Stage-level prompt-injection tracking w/ canaries across models, surfaces, and safety tiers	prompt-injection, agent-security, evaluation, canary-tokens, kill-chain, frontier-models
`2603.28063`	Reward Hacking as Equilibrium under Finite Evaluation PDF	cs.AI, cs.GT	95	Formal result: reward hacking emerges under finite evaluation; computable distortion index.	reward-hacking, principal-agent, evaluation, alignment-theory, RLHF, DPO
`2603.28650`	Information-Theoretic Limits of Safety Verification for Self-Improving Systems PDF	cs.LG, cs.AI, stat.ML	95	Formal impossibility bounds for safety gates in self-improving systems; high AI safety relevance	AI safety, self-improvement, verification, risk bounds, theory, TPR/FPR, impossibility
`2603.28166`	Evaluating Privilege Usage of Agents on Real-World Tools PDF	cs.CR, cs.AI	92	GrantBox sandbox evaluates real-world tool privilege usage—core risk for autonomous agents	agent-security, tool-use, privilege, sandbox, benchmark, real-world-tools
`2603.28345`	Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code PDF	cs.SE, cs.AI	92	Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security.	program-analysis, information-flow, taint-analysis, LLM-security, prompting, software-engineering
`2603.28551`	"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents PDF	cs.CR, cs.ET, cs.HC, cs.MA	90	Empirical study + corpus on computer-use agent risk awareness and post-hoc auditability.	computer-use-agents, auditability, traceability, user-safety, agent-security, HCI
`2603.28204`	ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models PDF	cs.LG, cs.AI	90	Token-level entropy regulation for RLVR/GRPO improves credit assignment in reasoning chains	LLM reasoning, RLVR, GRPO, post-training, credit assignment, entropy, optimization
`2603.28054`	Who Wrote the Book? Detecting and Attributing LLM Ghostwriters PDF	cs.CL	90	GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detection.	authorship-attribution, model-fingerprinting, misuse-detection, dataset, robustness, OOD
`2603.28407`	MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome PDF	cs.AI, cs.CL	89	MiroEval benchmarks multimodal deep-research agents on process + outcome with refreshable tasks	agents, evaluation, deep-research, multimodal, process-metrics, benchmark
`2603.28376`	Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design PDF	cs.CL, cs.AI	88	Verification-centric deep research agent design across synthesis/trajectories/test-time scaling.	deep-research-agents, verification, tool-use, long-horizon, test-time-scaling, RAG
`2603.27982`	CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models PDF	cs.CV, cs.AI, cs.CL	88	New benchmark for commonsense-driven hallucination when vision conflicts with priors	VLM, hallucination, evaluation, robustness, benchmarks, visual grounding, reliability
`2603.28569`	CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments PDF	cs.LG, cs.AI, cs.IR, cs.PF	87	Real cloud-ticket agent benchmark measuring robustness and resolution efficiency beyond accuracy	agents, evaluation, real-world, customer-support, long-horizon, reliability
`2603.28387`	The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation PDF	cs.AI, cs.LG	86	Shows prompt framing can fake multimodal clinical gains; important eval artifact warning.	evaluation, prompting, multimodal, VLM, clinical-AI, spurious-cues
`2603.28430`	IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression PDF	cs.LG, cs.CL	86	Hardware-aligned SO(4) rotations for low-bit KV-cache compression; practical LLM efficiency gain	LLM efficiency, KV cache, compression, quantization, inference, systems, long-context
`2603.28304`	The Necessity of Setting Temperature in LLM-as-a-Judge PDF	cs.CL	86	Shows temperature materially affects LLM-as-judge reliability; key for eval validity.	LLM-as-a-judge, evaluation, reproducibility, temperature, meta-eval, benchmarking
`2603.28618`	Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning PDF	cs.AI	85	Dual-role RLVR separates perception vs reasoning credit; targets evidence extraction failures	multimodal, RLVR, credit assignment, VLM, reasoning, perception, reliability
`2603.27918`	Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey PDF	cs.CR, cs.AI	84	Comprehensive survey of adversarial attacks on MLLMs with taxonomy across modalities/settings	multimodal, adversarial-attacks, survey, security, jailbreaks, threat-models
`2603.28092`	InkDrop: Invisible Backdoor Attacks Against Dataset Condensation PDF	cs.LG	84	Stealthy backdoor attack on dataset condensation; highlights new data-poisoning surface.	backdoors, data-poisoning, dataset-condensation, adversarial-ML, security
`2603.28135`	CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning PDF	cs.AI	84	Training-free metacognitive control for test-time reasoning: prune/repair/abstain under budget.	test-time-compute, reasoning, abstention, search, inference, reliability
`2603.28301`	LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models PDF	cs.LG	83	Benchmark for paraphrase robustness in VLA robots; large drops under simple synonyms.	robustness, paraphrase, VLA, robotics, benchmark, instruction-following
`2603.28610`	ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning PDF	cs.CV, cs.AI, cs.CL	83	Learns input-side adaptive resolution via bandits to cut visual tokens while keeping reasoning	multimodal, efficiency, adaptive compute, token budget, context length, bandits, MLLM
`2603.28005`	Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation PDF	cs.CL	82	Prompt-controlled test questions whether atomic decomposition truly helps LLM judges in QA eval	LLM-judges, evaluation, factuality, reference-grounding, prompting, methodology
`2603.28605`	Unsafe2Safe: Controllable Image Anonymization for Downstream Utility PDF	cs.CV, cs.CY, cs.LG	82	Automated privacy-risk detection + diffusion editing to anonymize images while preserving utility.	privacy, data-sanitization, diffusion-editing, dataset-curation, VLM
`2603.28696`	AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding PDF	cs.CV, cs.AI	82	Entropy-guided token budgeting for long-video MLLMs; principled stop/allocate mechanism.	multimodal, long-context, video-understanding, efficiency, uncertainty, token-selection
`2603.28488`	Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification PDF	cs.CL, cs.AI, cs.MA	80	Structured multi-agent debate + progressive RAG for controversial claim verification robustness	claim-verification, RAG, multi-agent, debate, hallucinations, calibration
`2603.28476`	With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems PDF	cs.IR, cs.LG, cs.SI	80	Analyzes coordinated user manipulation against risk-controlling recommenders with safety guarantees.	recommenders, adversarial-manipulation, conformal-risk-control, safety, social-computing
`2603.28622`	Trust-Aware Routing for Distributed Generative AI Inference at the Edge PDF	cs.DC, cs.AI, cs.NI	80	Trust-aware routing for distributed generative inference; robustness to misbehaving edge peers	agentic systems, distributed inference, trust, robustness, security, edge AI, systems
`2603.28026`	When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA PDF	cs.AI	80	Training-free contrastive decoding to reduce answer-choice priors; improves figure grounding.	multimodal-eval, grounding, decoding, bias, scientific-qa, robustness
`2603.28378`	Membership Inference Attacks against Large Audio Language Models PDF	cs.SD, cs.AI	79	First systematic MIA study for audio LMs; shows confounds and proposes distribution-matched eval	privacy, membership-inference, audio, evaluation, dataset-shift, leakage
`2603.28662`	AMIGO: Agentic Multi-Image Grounding Oracle Benchmark PDF	cs.LG, cs.AI	78	AMIGO tests long-horizon multi-image grounding via constrained questioning under uncertainty	multimodal-agents, benchmark, interactive-eval, uncertainty, grounding, long-horizon

AI Paper Insight Brief

2026-04-01

0) Executive takeaways (read this first)

Evaluation is the new attack surface: multiple papers show that how we evaluate (prompt framing, temperature, decomposition style, MC vs QA format) can dominate measured “capability” and hide grounding failures—so benchmark design is now a first-class safety concern.
Agent security failures are pipeline- and surface-dependent, not model-dependent: prompt injection and privilege misuse vary wildly by injection surface and agent stage; outcome-only ASR is insufficient to choose defenses or architectures.
“Priors beat pixels” remains a core multimodal failure mode: VLMs often normalize anomalies to commonsense (CDH), follow answer-choice priors in scientific MCQA, or respond to “MRI available” scaffolds without using images—suggesting grounding interventions must explicitly counter prior dominance.
Verification-centric designs are emerging as a unifying reliability lever: from deep-research agents (verification at synthesis/trajectory/inference) to stage-level canaries and process-centric benchmarks, verification is being operationalized as instrumentation + control, not just post-hoc scoring.
Training-time and data-supply-chain risks are sharpening: stealthy backdoors in dataset condensation (InkDrop) and broad MLLM attack taxonomies highlight that multimodal systems inherit vulnerabilities across encoders, fusion, and instruction-following—often transferable in black-box settings.
Test-time compute is shifting from “more tokens” to “better control”: metacognitive controllers (CoT2-Meta) and long-video token/resolution allocators (AdaptToken, ResAdapt) show consistent gains under fixed budgets by allocating compute to high-value steps/frames.

2) Key themes (clusters)

Theme: Agent prompt-injection & privilege misuse need stage/surface localization

Why it matters: Agent deployments fail in different places (memory write, relay, tool args) and via different surfaces (tool outputs vs memory poisoning). Without stage-level instrumentation, defenses can look effective while failing in realistic multi-surface pipelines.
Representative papers:
Common approach:
- Instrument pipelines with stage-level signals (EXPOSED→PERSISTED→RELAYED→EXECUTED) rather than single ASR.
- Evaluate across multiple injection surfaces and realistic tool stacks (real MCP servers; NL/PL placeholder flows).
- Add traceability/logging (propagation loggers, remote request loggers, user-facing action traces).
Open questions / failure modes:
- Defense “success” that is actually surface mismatch (defenses fail outside assumed channel).
- Hard-to-analyze Layer 2 reachability in NL/PL flows (taxonomy can say “propagates”, but code-level reachability still fails).
- Human operators lacking post-hoc auditability and persistence understanding even when logs exist.

Theme: Multimodal grounding failures driven by priors and prompt scaffolds

Why it matters: Models can appear strong while ignoring visual evidence—dangerous in anomaly-sensitive domains (clinical, inspection, forensics) and scientific figure interpretation.
Representative papers:
Common approach:
- Construct counterfactual vs commonsense paired inputs to isolate prior-driven collapse (CF-Acc vs CS-Acc; CFAD/CCR/RPD).
- Use inference-time debiasing (contrastive decoding subtracting text-only logits).
- Run prompt ablations (MRI preamble vs actual image; swap/OOD images; false-modality preambles).
Open questions / failure modes:
- Multiple-choice formats can amplify prior collapse (CDH-Bench shows larger CFAD in MC).
- Debiasing can hurt when the text prior is correct (SCICON harmed cases when visual evidence margin is weak/negative).
- Prompt scaffolds can create illusory multimodal gains that alignment fine-tuning (MPO) may suppress by collapsing performance rather than fixing grounding.

Theme: Evaluation methodology is brittle (judges, prompts, temperature, decomposition)

Why it matters: If evaluation pipelines are unstable, we can’t trust progress claims or safety regressions—especially when LLM judges are used for training signals and benchmarking.
Representative papers:
Common approach:
- Prompt-controlled comparisons (match rubric richness/token budgets; multiple frozen prompt variants).
- Treat temperature as a causal treatment (AIPW causal estimates; repeated seeds).
- Expand evaluation beyond outcomes to process + factuality verification (process logs → process metrics; statement-level verification over web+attachments).
Open questions / failure modes:
- Atomic decomposition isn’t automatically better; holistic rubrics can be more accurate and cheaper for incompleteness detection.
- Higher temperature increases parse errors and inconsistency, especially with CoT prompting.
- Process evaluation requires access to traces, limiting applicability to closed systems.

Theme: Verification-centric agents and test-time control under budgets

Why it matters: Long-horizon agents and reasoning systems fail via error propagation; budgeted compute must be spent on verification/repair and evidence acquisition, not just longer generations.
Representative papers:
Common approach:
- Add explicit verification loops (uniqueness checks in synthesis; verifier agents; discard-all resets).
- Use controllers to expand/prune/repair/stop/abstain under fixed budgets (UCB-style selection; process+outcome fusion).
- Use uncertainty/entropy signals to allocate compute (token-level entropy for early stopping; group-level entropy for long video).
Open questions / failure modes:
- Verification-heavy pipelines can be extremely token-expensive (PROCLAIM ~210.9K tokens/claim).
- Controllers depend on oracle quality; misjudgment can cause over-pruning or wasted repair.
- Training-free uncertainty heuristics may be threshold-sensitive across domains/models.

Theme: Security & privacy threats across modalities and data pipelines

Why it matters: Multimodal systems expand attack surfaces (encoders, fusion, instruction following), and new ML workflows (dataset condensation) create concentrated supply-chain artifacts that are attractive to attackers.
Representative papers:
Common approach:
- Organize threats by attacker objective (integrity, jailbreak, control/injection, poisoning/backdoors) and attacker knowledge.
- Demonstrate stealth + effectiveness tradeoffs (InkDrop multi-objective loss with LPIPS + label/feature alignment).
- Audit privacy with confound controls (audio MIA blind baselines to detect dataset separability; modality disentanglement).
Open questions / failure modes:
- Literature skew: attacks are vision-dominant (survey: 58/65 empirically characterized works involve image/video).
- MIA can be spuriously inflated by dataset artifacts (audio acoustic features can separate train/test near-perfectly).
- Long-form attribution methods (TRACE) are not tested against obfuscation/adversarial paraphrase.

3) Technical synthesis

Stage decomposition is recurring: kill-chain canaries decompose injection propagation; MiroEval decomposes outcome into synthesis/factuality/process; CoT2-Meta decomposes reasoning into expand/prune/repair/stop; PRCO decomposes training into Observer vs Solver roles.
Uncertainty/entropy is becoming a control signal across training and inference: ERPO uses token entropy to gate RL updates at “critical decision pivots”; AdaptToken uses response entropy for group allocation + early stopping; CoT2-Meta uses process/outcome scoring to decide actions.
“Priors vs evidence” shows up in multiple guises: commonsense-driven hallucination (paired counterfactuals), choice-induced priors (MCQA), and scaffold prompts (clinical VLM) all indicate that text context can dominate even when images are present.
Benchmark formats can change failure rates materially: CDH-Bench finds MC format worsens prior collapse vs binary QA; prompt-injection ASR ranges 0–100% for the same model depending on surface; judge temperature shifts consistency and parseability.
Verification is being operationalized as both training data hygiene and inference-time scaling: Marco’s uniqueness verification in synthesis and verifier-guided test-time scaling; PROCLAIM’s progressive retrieval + judicial panel; MiroEval’s agentic factuality verifier.
Defense evaluation must match threat models: kill-chain results show active defenses can fail catastrophically under surface mismatch; audio MIA shows privacy audits must control distribution shift; multimodal attack survey emphasizes attacker-knowledge regimes (black-box dominates).
Compute budgets are treated explicitly: CoT2-Meta counts all calls into a budget C; AdaptToken-Lite halves inference time via early stopping; ResAdapt targets pre-encoder pixel budget to trade spatial for temporal evidence.
Instrumentation + traceability is expanding beyond developers to end users: AgentTrace improves user comprehension and anomaly detection; GrantBox logs outbound requests/authorization parameters; NL/PL taxonomy enables taint/slicing decisions.

4) Top 5 papers (with “why now”)

1) Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Introduces stage-level prompt injection metrics (EXPOSED/PERSISTED/RELAYED/EXECUTED) that explain where defenses work.
Shows exposure is universal (100%); safety depends on downstream propagation.
Demonstrates surface dependence: same model can be 0% vs 100% ASR depending on injection surface (e.g., DeepSeek memory_poison vs tool_poison/propagation).
Skepticism: modest per-cell sample sizes and synthetic payloads; root cause of model differences (e.g., Claude write_memory filtering) not isolated.

2) CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Defines commonsense-driven hallucination and provides a paired counterfactual design that isolates prior collapse.
Reports systematic gaps: mean CFAD 16.39% (QA) / 25.20% (MC); 7/8 models degrade on counterfactuals.
Highlights that MC format amplifies prior-driven errors, relevant to many real product UIs.
Skepticism: synthetic images and limited scale (300 pairs); broader real-world anomaly coverage not shown.

3) Evaluating Privilege Usage of Agents on Real-World Tools

Provides GrantBox, integrating 10 real MCP servers / 122 privilege-sensitive tools in containers with logging.
Finds very high prompt-injection success: avg ASR 90.55% (ReAct) and 79.05% (Plan-and-Execute) in their setup.
Makes privilege misuse measurable via real outbound request logging rather than toy tools.
Skepticism: current evaluation focuses on “native” agent behavior without defenses; environment setup complexity may affect reproducibility.

4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Training-free controller that allocates budget across expand/prune/repair/stop/abstain using fused process+outcome scoring.
Reports consistent gains under matched budgets across 15 benchmarks and improved calibration (e.g., ECE ~0.035).
Provides a concrete failure taxonomy (search-not-converged, evaluator misjudgment, over-pruning) useful for debugging.
Skepticism: depends on quality of online process evaluation signals; hand-designed controller/meta-state may not generalize.

5) The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Shows that simply mentioning MRI availability can drive most of the “multimodal” gain (~70–80% of confidence shift), even when images add little signal.
Uses strong ablations (false-modality, swap images) and expert trace review to argue gains are often not evidence-based.
Demonstrates alignment intervention (MPO) can suppress MRI-referencing but collapses performance, highlighting difficulty of fixing the root cause.
Skepticism: limited to two cohorts and open-weight models; scaffold estimate derived from the most responsive models.

5) Practical next steps

Adopt stage-level injection telemetry in your agent stack (canary tokens + EXPOSED/PERSISTED/RELAYED/EXECUTED logging) and require defenses to report where they stop propagation, not just ASR.
Run multi-surface prompt-injection evaluations (memory poisoning, tool output poisoning, relay propagation, permission escalation) before shipping; treat “surface mismatch” as a primary failure mode.
Add privilege-usage auditing: log tool calls + authorization parameters (GrantBox-style) and build regression tests for “least privilege” violations under adversarial prompts.
Harden multimodal grounding evaluations by adding paired counterfactual-vs-commonsense cases (CDH-style) and prompt scaffolding ablations (e.g., “modality available” preambles, swap images).
For scientific/MCQA products, test choice-induced priors by comparing multimodal vs text-only logits; consider SCICON-style subtraction when visual evidence margin is strong, and measure harmed-case rate when priors are correct.
Stabilize LLM-judge pipelines: fix and report temperature; measure parse error and consistency across seeds; consider matched holistic rubrics when completeness/partial-support is key (and track token cost).
Instrument process quality for research agents (MiroEval-style): collect process logs, compute process metrics, and correlate with factuality/outcome to detect “good-looking reports from bad processes.”
Treat condensed/synthetic datasets as supply-chain artifacts: add backdoor scanning and provenance controls for dataset condensation outputs; assume stealthy triggers can be imperceptible (InkDrop).
For audio privacy audits, always run blind baselines (metadata/text/acoustic) and distribution-matched splits before concluding memorization-based leakage.
Budgeted inference: if you’re scaling test-time compute, prefer controllers (prune/repair/stop) and uncertainty-guided allocation (entropy-based early stopping) over uniform “more sampling,” and track calibration alongside accuracy.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-01

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Agent prompt-injection & privilege misuse need stage/surface localization

Theme: Multimodal grounding failures driven by priors and prompt scaffolds

Theme: Evaluation methodology is brittle (judges, prompts, temperature, decomposition)

Theme: Verification-centric agents and test-time control under budgets

Theme: Security & privacy threats across modalities and data pipelines

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps