Daily AI Paper Report (2026-04-01)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 223
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.28013Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
PDF
cs.CR, cs.AI, cs.LG95Stage-level prompt-injection tracking w/ canaries across models, surfaces, and safety tiersprompt-injection, agent-security, evaluation, canary-tokens, kill-chain, frontier-models
2603.28063Reward Hacking as Equilibrium under Finite Evaluation
PDF
cs.AI, cs.GT95Formal result: reward hacking emerges under finite evaluation; computable distortion index.reward-hacking, principal-agent, evaluation, alignment-theory, RLHF, DPO
2603.28650Information-Theoretic Limits of Safety Verification for Self-Improving Systems
PDF
cs.LG, cs.AI, stat.ML95Formal impossibility bounds for safety gates in self-improving systems; high AI safety relevanceAI safety, self-improvement, verification, risk bounds, theory, TPR/FPR, impossibility
2603.28166Evaluating Privilege Usage of Agents on Real-World Tools
PDF
cs.CR, cs.AI92GrantBox sandbox evaluates real-world tool privilege usage—core risk for autonomous agentsagent-security, tool-use, privilege, sandbox, benchmark, real-world-tools
2603.28345Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code
PDF
cs.SE, cs.AI92Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security.program-analysis, information-flow, taint-analysis, LLM-security, prompting, software-engineering
2603.28551"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents
PDF
cs.CR, cs.ET, cs.HC, cs.MA90Empirical study + corpus on computer-use agent risk awareness and post-hoc auditability.computer-use-agents, auditability, traceability, user-safety, agent-security, HCI
2603.28204ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
PDF
cs.LG, cs.AI90Token-level entropy regulation for RLVR/GRPO improves credit assignment in reasoning chainsLLM reasoning, RLVR, GRPO, post-training, credit assignment, entropy, optimization
2603.28054Who Wrote the Book? Detecting and Attributing LLM Ghostwriters
PDF
cs.CL90GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detection.authorship-attribution, model-fingerprinting, misuse-detection, dataset, robustness, OOD
2603.28407MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
PDF
cs.AI, cs.CL89MiroEval benchmarks multimodal deep-research agents on process + outcome with refreshable tasksagents, evaluation, deep-research, multimodal, process-metrics, benchmark
2603.28376Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
PDF
cs.CL, cs.AI88Verification-centric deep research agent design across synthesis/trajectories/test-time scaling.deep-research-agents, verification, tool-use, long-horizon, test-time-scaling, RAG
2603.27982CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
PDF
cs.CV, cs.AI, cs.CL88New benchmark for commonsense-driven hallucination when vision conflicts with priorsVLM, hallucination, evaluation, robustness, benchmarks, visual grounding, reliability
2603.28569CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
PDF
cs.LG, cs.AI, cs.IR, cs.PF87Real cloud-ticket agent benchmark measuring robustness and resolution efficiency beyond accuracyagents, evaluation, real-world, customer-support, long-horizon, reliability
2603.28387The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
PDF
cs.AI, cs.LG86Shows prompt framing can fake multimodal clinical gains; important eval artifact warning.evaluation, prompting, multimodal, VLM, clinical-AI, spurious-cues
2603.28430IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
PDF
cs.LG, cs.CL86Hardware-aligned SO(4) rotations for low-bit KV-cache compression; practical LLM efficiency gainLLM efficiency, KV cache, compression, quantization, inference, systems, long-context
2603.28304The Necessity of Setting Temperature in LLM-as-a-Judge
PDF
cs.CL86Shows temperature materially affects LLM-as-judge reliability; key for eval validity.LLM-as-a-judge, evaluation, reproducibility, temperature, meta-eval, benchmarking
2603.28618Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
PDF
cs.AI85Dual-role RLVR separates perception vs reasoning credit; targets evidence extraction failuresmultimodal, RLVR, credit assignment, VLM, reasoning, perception, reliability
2603.27918Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey
PDF
cs.CR, cs.AI84Comprehensive survey of adversarial attacks on MLLMs with taxonomy across modalities/settingsmultimodal, adversarial-attacks, survey, security, jailbreaks, threat-models
2603.28092InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
PDF
cs.LG84Stealthy backdoor attack on dataset condensation; highlights new data-poisoning surface.backdoors, data-poisoning, dataset-condensation, adversarial-ML, security
2603.28135CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
PDF
cs.AI84Training-free metacognitive control for test-time reasoning: prune/repair/abstain under budget.test-time-compute, reasoning, abstention, search, inference, reliability
2603.28301LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
PDF
cs.LG83Benchmark for paraphrase robustness in VLA robots; large drops under simple synonyms.robustness, paraphrase, VLA, robotics, benchmark, instruction-following
2603.28610ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
PDF
cs.CV, cs.AI, cs.CL83Learns input-side adaptive resolution via bandits to cut visual tokens while keeping reasoningmultimodal, efficiency, adaptive compute, token budget, context length, bandits, MLLM
2603.28005Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
PDF
cs.CL82Prompt-controlled test questions whether atomic decomposition truly helps LLM judges in QA evalLLM-judges, evaluation, factuality, reference-grounding, prompting, methodology
2603.28605Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
PDF
cs.CV, cs.CY, cs.LG82Automated privacy-risk detection + diffusion editing to anonymize images while preserving utility.privacy, data-sanitization, diffusion-editing, dataset-curation, VLM
2603.28696AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
PDF
cs.CV, cs.AI82Entropy-guided token budgeting for long-video MLLMs; principled stop/allocate mechanism.multimodal, long-context, video-understanding, efficiency, uncertainty, token-selection
2603.28488Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
PDF
cs.CL, cs.AI, cs.MA80Structured multi-agent debate + progressive RAG for controversial claim verification robustnessclaim-verification, RAG, multi-agent, debate, hallucinations, calibration
2603.28476With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems
PDF
cs.IR, cs.LG, cs.SI80Analyzes coordinated user manipulation against risk-controlling recommenders with safety guarantees.recommenders, adversarial-manipulation, conformal-risk-control, safety, social-computing
2603.28622Trust-Aware Routing for Distributed Generative AI Inference at the Edge
PDF
cs.DC, cs.AI, cs.NI80Trust-aware routing for distributed generative inference; robustness to misbehaving edge peersagentic systems, distributed inference, trust, robustness, security, edge AI, systems
2603.28026When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA
PDF
cs.AI80Training-free contrastive decoding to reduce answer-choice priors; improves figure grounding.multimodal-eval, grounding, decoding, bias, scientific-qa, robustness
2603.28378Membership Inference Attacks against Large Audio Language Models
PDF
cs.SD, cs.AI79First systematic MIA study for audio LMs; shows confounds and proposes distribution-matched evalprivacy, membership-inference, audio, evaluation, dataset-shift, leakage
2603.28662AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
PDF
cs.LG, cs.AI78AMIGO tests long-horizon multi-image grounding via constrained questioning under uncertaintymultimodal-agents, benchmark, interactive-eval, uncertainty, grounding, long-horizon

AI Paper Insight Brief

2026-04-01

0) Executive takeaways (read this first)

  • Evaluation is the new attack surface: multiple papers show that how we evaluate (prompt framing, temperature, decomposition style, MC vs QA format) can dominate measured “capability” and hide grounding failures—so benchmark design is now a first-class safety concern.
  • Agent security failures are pipeline- and surface-dependent, not model-dependent: prompt injection and privilege misuse vary wildly by injection surface and agent stage; outcome-only ASR is insufficient to choose defenses or architectures.
  • “Priors beat pixels” remains a core multimodal failure mode: VLMs often normalize anomalies to commonsense (CDH), follow answer-choice priors in scientific MCQA, or respond to “MRI available” scaffolds without using images—suggesting grounding interventions must explicitly counter prior dominance.
  • Verification-centric designs are emerging as a unifying reliability lever: from deep-research agents (verification at synthesis/trajectory/inference) to stage-level canaries and process-centric benchmarks, verification is being operationalized as instrumentation + control, not just post-hoc scoring.
  • Training-time and data-supply-chain risks are sharpening: stealthy backdoors in dataset condensation (InkDrop) and broad MLLM attack taxonomies highlight that multimodal systems inherit vulnerabilities across encoders, fusion, and instruction-following—often transferable in black-box settings.
  • Test-time compute is shifting from “more tokens” to “better control”: metacognitive controllers (CoT2-Meta) and long-video token/resolution allocators (AdaptToken, ResAdapt) show consistent gains under fixed budgets by allocating compute to high-value steps/frames.

2) Key themes (clusters)

Theme: Agent prompt-injection & privilege misuse need stage/surface localization

Theme: Multimodal grounding failures driven by priors and prompt scaffolds

Theme: Evaluation methodology is brittle (judges, prompts, temperature, decomposition)

Theme: Verification-centric agents and test-time control under budgets

Theme: Security & privacy threats across modalities and data pipelines

3) Technical synthesis

  • Stage decomposition is recurring: kill-chain canaries decompose injection propagation; MiroEval decomposes outcome into synthesis/factuality/process; CoT2-Meta decomposes reasoning into expand/prune/repair/stop; PRCO decomposes training into Observer vs Solver roles.
  • Uncertainty/entropy is becoming a control signal across training and inference: ERPO uses token entropy to gate RL updates at “critical decision pivots”; AdaptToken uses response entropy for group allocation + early stopping; CoT2-Meta uses process/outcome scoring to decide actions.
  • “Priors vs evidence” shows up in multiple guises: commonsense-driven hallucination (paired counterfactuals), choice-induced priors (MCQA), and scaffold prompts (clinical VLM) all indicate that text context can dominate even when images are present.
  • Benchmark formats can change failure rates materially: CDH-Bench finds MC format worsens prior collapse vs binary QA; prompt-injection ASR ranges 0–100% for the same model depending on surface; judge temperature shifts consistency and parseability.
  • Verification is being operationalized as both training data hygiene and inference-time scaling: Marco’s uniqueness verification in synthesis and verifier-guided test-time scaling; PROCLAIM’s progressive retrieval + judicial panel; MiroEval’s agentic factuality verifier.
  • Defense evaluation must match threat models: kill-chain results show active defenses can fail catastrophically under surface mismatch; audio MIA shows privacy audits must control distribution shift; multimodal attack survey emphasizes attacker-knowledge regimes (black-box dominates).
  • Compute budgets are treated explicitly: CoT2-Meta counts all calls into a budget C; AdaptToken-Lite halves inference time via early stopping; ResAdapt targets pre-encoder pixel budget to trade spatial for temporal evidence.
  • Instrumentation + traceability is expanding beyond developers to end users: AgentTrace improves user comprehension and anomaly detection; GrantBox logs outbound requests/authorization parameters; NL/PL taxonomy enables taint/slicing decisions.

4) Top 5 papers (with “why now”)

1) Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

  • Introduces stage-level prompt injection metrics (EXPOSED/PERSISTED/RELAYED/EXECUTED) that explain where defenses work.
  • Shows exposure is universal (100%); safety depends on downstream propagation.
  • Demonstrates surface dependence: same model can be 0% vs 100% ASR depending on injection surface (e.g., DeepSeek memory_poison vs tool_poison/propagation).
  • Skepticism: modest per-cell sample sizes and synthetic payloads; root cause of model differences (e.g., Claude write_memory filtering) not isolated.

2) CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

  • Defines commonsense-driven hallucination and provides a paired counterfactual design that isolates prior collapse.
  • Reports systematic gaps: mean CFAD 16.39% (QA) / 25.20% (MC); 7/8 models degrade on counterfactuals.
  • Highlights that MC format amplifies prior-driven errors, relevant to many real product UIs.
  • Skepticism: synthetic images and limited scale (300 pairs); broader real-world anomaly coverage not shown.

3) Evaluating Privilege Usage of Agents on Real-World Tools

  • Provides GrantBox, integrating 10 real MCP servers / 122 privilege-sensitive tools in containers with logging.
  • Finds very high prompt-injection success: avg ASR 90.55% (ReAct) and 79.05% (Plan-and-Execute) in their setup.
  • Makes privilege misuse measurable via real outbound request logging rather than toy tools.
  • Skepticism: current evaluation focuses on “native” agent behavior without defenses; environment setup complexity may affect reproducibility.

4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

  • Training-free controller that allocates budget across expand/prune/repair/stop/abstain using fused process+outcome scoring.
  • Reports consistent gains under matched budgets across 15 benchmarks and improved calibration (e.g., ECE ~0.035).
  • Provides a concrete failure taxonomy (search-not-converged, evaluator misjudgment, over-pruning) useful for debugging.
  • Skepticism: depends on quality of online process evaluation signals; hand-designed controller/meta-state may not generalize.

5) The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

  • Shows that simply mentioning MRI availability can drive most of the “multimodal” gain (~70–80% of confidence shift), even when images add little signal.
  • Uses strong ablations (false-modality, swap images) and expert trace review to argue gains are often not evidence-based.
  • Demonstrates alignment intervention (MPO) can suppress MRI-referencing but collapses performance, highlighting difficulty of fixing the root cause.
  • Skepticism: limited to two cohorts and open-weight models; scaffold estimate derived from the most responsive models.

5) Practical next steps

  • Adopt stage-level injection telemetry in your agent stack (canary tokens + EXPOSED/PERSISTED/RELAYED/EXECUTED logging) and require defenses to report where they stop propagation, not just ASR.
  • Run multi-surface prompt-injection evaluations (memory poisoning, tool output poisoning, relay propagation, permission escalation) before shipping; treat “surface mismatch” as a primary failure mode.
  • Add privilege-usage auditing: log tool calls + authorization parameters (GrantBox-style) and build regression tests for “least privilege” violations under adversarial prompts.
  • Harden multimodal grounding evaluations by adding paired counterfactual-vs-commonsense cases (CDH-style) and prompt scaffolding ablations (e.g., “modality available” preambles, swap images).
  • For scientific/MCQA products, test choice-induced priors by comparing multimodal vs text-only logits; consider SCICON-style subtraction when visual evidence margin is strong, and measure harmed-case rate when priors are correct.
  • Stabilize LLM-judge pipelines: fix and report temperature; measure parse error and consistency across seeds; consider matched holistic rubrics when completeness/partial-support is key (and track token cost).
  • Instrument process quality for research agents (MiroEval-style): collect process logs, compute process metrics, and correlate with factuality/outcome to detect “good-looking reports from bad processes.”
  • Treat condensed/synthetic datasets as supply-chain artifacts: add backdoor scanning and provenance controls for dataset condensation outputs; assume stealthy triggers can be imperceptible (InkDrop).
  • For audio privacy audits, always run blind baselines (metadata/text/acoustic) and distribution-matched splits before concluding memorization-based leakage.
  • Budgeted inference: if you’re scaling test-time compute, prefer controllers (prune/repair/stop) and uncertainty-guided allocation (entropy-based early stopping) over uniform “more sampling,” and track calibration alongside accuracy.

Generated from per-paper analyses; no external browsing.