Daily AI Paper Report (2026-03-31)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 223
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.28013Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
PDF
cs.CR, cs.AI, cs.LG95Stage-level prompt-injection tracking w/ canaries across agents; actionable defense insightsprompt-injection, agent-security, evaluation, kill-chain, canary-tokens, red-teaming
2603.28063Reward Hacking as Equilibrium under Finite Evaluation
PDF
cs.AI, cs.GT95Formal result: reward hacking emerges under finite evaluation; computable distortion index.reward-hacking, alignment-theory, evaluation, principal-agent, RLHF, DPO
2603.28650Information-Theoretic Limits of Safety Verification for Self-Improving Systems
PDF
cs.LG, cs.AI, stat.ML95Strong theoretical impossibility results for safety gates in self-improving systemsai-safety, self-improvement, verification, risk-bounds, theory, distribution-shift
2603.28166Evaluating Privilege Usage of Agents on Real-World Tools
PDF
cs.CR, cs.AI93GrantBox sandbox evaluates real-tool privilege usage; closer to real-world agent securityagents, tool-use, privilege, sandbox, security-eval, real-world-tools
2603.28345Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code
PDF
cs.SE, cs.AI92Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security.program-analysis, information-flow, LLM-security, prompting, taint-analysis, NL-PL-boundary
2603.28407MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
PDF
cs.AI, cs.CL90Deep research agent benchmark scoring process+outcome; multimodal, refreshable tasksagents, evaluation, deep-research, multimodal, benchmarks, process-metrics
2603.28204ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
PDF
cs.LG, cs.AI90Token-level RLVR/GRPO fix to prevent entropy collapse; targets reasoning qualityllm, rlvr, grpo, credit-assignment, reasoning, post-training
2603.28054Who Wrote the Book? Detecting and Attributing LLM Ghostwriters
PDF
cs.CL90GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detectionauthorship-attribution, misuse-detection, benchmark, OOD-robustness, fingerprinting, long-form-text
2603.28551"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents
PDF
cs.CR, cs.ET, cs.HC, cs.MA89Studies risk awareness + post-hoc auditability for computer-use agents; real incidents corpus.agent-safety, computer-use-agents, auditability, traceability, HCI, security
2603.28569CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
PDF
cs.LG, cs.AI, cs.IR, cs.PF88Real cloud-ticket agent benchmark; measures robustness and resolution efficiency beyond accuracyagents, evaluation, real-world, customer-support, long-horizon, efficiency
2603.27982CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
PDF
cs.CV, cs.AI, cs.CL88New benchmark for commonsense-driven hallucination in VLMs via evidence conflictsvlm, hallucination, evaluation, robustness, benchmarks, reliability
2603.28376Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
PDF
cs.CL, cs.AI87Verification-centric deep research agent design across data synthesis, trajectories, test-time.agents, verification, deep-research, tool-use, long-horizon, reliability
2603.28618Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
PDF
cs.AI86RLVR splits observer/solver to improve visual evidence extraction + reasoningmultimodal, rlvr, credit-assignment, evidence, reasoning, mllm
2603.28304The Necessity of Setting Temperature in LLM-as-a-Judge
PDF
cs.CL86Shows temperature materially affects LLM-as-judge reliability; important eval hygieneLLM-judge, evaluation, temperature, reliability, methodology, meta-eval
2603.27918Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey
PDF
cs.CR, cs.AI84Comprehensive survey of adversarial threats to MLLMs with taxonomy and vulnerability analysismultimodal, adversarial-attacks, survey, security, threat-models, jailbreaks
2603.28476With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems
PDF
cs.IR, cs.LG, cs.SI84Shows coordinated user manipulation can break risk-controlling recommenders with safety guarantees.recommenders, adversarial, safety-guarantees, conformal-risk-control, manipulation
2603.28430IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
PDF
cs.LG, cs.CL84Hardware-aligned KV-cache compression via SO(4) rotations; practical LLM efficiencyllm, inference, kv-cache, compression, quantization, systems
2603.28135CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
PDF
cs.AI84Training-free metacognitive control for budgeted reasoning incl. abstain/repair/prunetest-time-reasoning, inference-time-control, compute-budget, abstention, search, chain-of-thought
2603.28378Membership Inference Attacks against Large Audio Language Models
PDF
cs.SD, cs.AI83First MIA study for audio LMs; shows confounds and proposes distribution-matched evaluationprivacy, membership-inference, audio, evaluation, distribution-shift
2603.28005Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
PDF
cs.CL82Careful prompt-controlled study of atomic decomposition for LLM judges; eval reliability focusLLM-judges, evaluation, grounded-QA, rubrics, factuality, methodology
2603.28092InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
PDF
cs.LG82Stealthy backdoor attacks on dataset condensation; highlights a supply-chain vulnerability.backdoors, data-poisoning, dataset-condensation, ML-security, stealth
2603.28610ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
PDF
cs.CV, cs.AI, cs.CL82Adaptive input resolution to trade visual tokens vs context; bandit-trained allocatormllm, efficiency, long-context, vision-tokens, bandits, inference
2603.28696AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
PDF
cs.CV, cs.AI82Uses model uncertainty/entropy to allocate long-video token budget; scalable MLLM controlMLLM, long-context, video-understanding, token-selection, uncertainty, efficiency
2603.28662AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
PDF
cs.LG, cs.AI81Long-horizon multi-image grounding benchmark with strict protocol; probes uncertainty trackingmultimodal-agents, benchmark, interactive-eval, uncertainty, grounding
2603.28605Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
PDF
cs.CV, cs.CY, cs.LG81Automated anonymization via VLM+LLM-guided diffusion edits; privacy protection for training data.privacy, anonymization, diffusion-editing, dataset-safety, VLM, PII
2603.28301LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
PDF
cs.LG80Benchmark for paraphrase robustness in VLA robots; large drops under synonyms reveal brittleness.evaluation, robustness, paraphrase, VLA, robotics, instruction-following
2603.28488Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
PDF
cs.CL, cs.AI, cs.MA79Structured multi-agent debate + progressive RAG for claim verification; targets hallucinationsclaim-verification, RAG, multi-agent, debate, hallucinations, calibration
2603.28038Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners
PDF
cs.AI, cs.LG79Prompt-optimization study probes brittleness/transfer of scientific reasoning behaviorsreasoning, prompting, interpretability, robustness, scientific-tasks, behavior-analysis
2603.28730SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
PDF
cs.RO, cs.CL, cs.CV78Video-language reasoning model as sole RL reward; addresses reward exploitation under shift.robotics, RL, VLM, reward-modeling, distribution-shift, agentic-RL
2603.28622Trust-Aware Routing for Distributed Generative AI Inference at the Edge
PDF
cs.DC, cs.AI, cs.NI78Trust-aware routing for distributed generative inference; risk-bounded path selectionagent-systems, distributed-inference, trust, robustness, security, edge

AI Paper Insight Brief

2026-03-31

0) Executive takeaways (read this first)

  • Agent security evaluation is shifting from “did it fail?” to “where did it fail?” Stage-level prompt-injection tracking (EXPOSED→PERSISTED→RELAYED→EXECUTED) shows exposure can be universal while downstream execution differs sharply by model and pipeline stage—changing what “robust” architectures look like.
  • Real-tool privilege misuse is currently the norm, not the edge case. In a sandbox with real MCP servers/tools, prompt-injection privilege hijacks achieve very high ASR (avg 90.55% ReAct, 79.05% Plan-and-Execute), suggesting tool authorization and isolation are the immediate bottleneck.
  • Multimodal reliability failures are increasingly “prior overrides evidence,” not just fabrication. CDH-Bench finds VLMs often revert to commonsense priors when images contain atypical evidence (mean CFAD 16.39% QA, 25.20% MC), with counting anomalies especially hard.
  • Test-time compute is becoming controllable and auditable. CoT2-Meta shows training-free meta-control (expand/prune/repair/abstain) can improve accuracy under fixed budgets and materially improve calibration (reported ECE 0.035).
  • RL for multimodal reasoning is moving toward better credit assignment. PRCO’s Observer/Solver coevolution improves average accuracy by ~+7 points and reduces perception errors (e.g., WeMath perception errors −39.2%), directly targeting perception as the bottleneck.
  • Safety gating theory is hardening: classifier gates may be structurally insufficient for long-lived self-improvement. An information-theoretic result shows classifier-style gates can’t generally keep cumulative risk finite while allowing unbounded beneficial updates under common schedules; verification-style gates can escape (δ=0 with TPR>0 demonstrated on GPT-2 LoRA).

2) Key themes (clusters)

Theme: Prompt injection & privilege misuse in agent pipelines

Theme: Multimodal hallucination as “prior-driven normalization”

Theme: Evaluation reliability—judges, temperature, and decomposition myths

Theme: Test-time reasoning control & token-level credit assignment

Theme: Long-horizon multimodal efficiency (video) via adaptive allocation

  • Why it matters: Video reasoning is bottlenecked by visual token budgets; input-side and token-side adaptation can trade spatial fidelity for temporal coverage without changing backbones.
  • Representative papers:
  • Common approach:
    • Use lightweight front-ends or training-free signals (entropy, attention) to allocate compute across frames/groups.
    • Add early stopping when certainty is high to cut runtime.
    • Evaluate on long-video benchmarks and stress extremely long inputs (up to 10K frames in AdaptToken).
  • Open questions / failure modes:
    • Open-loop allocation (ResAdapt) can miss brief decisive cues and can’t revise after backbone inference.
    • Group-wise inference remains a runtime bottleneck; early stopping is key but depends on reliable entropy.
    • Interactive multi-image settings expose protocol-following failures (Skip violations, premature guesses).

Theme: Privacy, forensics, and dataset integrity attacks/defenses

Theme: Alignment theory & safety verification limits

  • Why it matters: Some failure modes (reward hacking, long-run safety gating) may be structural, not patchable by better prompts or more eval.
  • Representative papers:
  • Common approach:
    • Formalize evaluation as a projection of high-dimensional quality into limited signals; prove distortion is inevitable under optimization.
    • Provide computable diagnostics (distortion index via reward-model gradients) and scaling arguments (coverage vanishes with tool combinatorics unless eval scales quadratically).
    • Prove impossibility results for classifier gates under summability constraints; construct verification-based escapes.
  • Open questions / failure modes:
    • Empirical validation is largely pending for the reward-hacking equilibrium model.
    • Verification approaches depend on tractable certificates (e.g., Lipschitz bounds) that may be hard to compute tightly at scale.

3) Technical synthesis

  • Stage-level security instrumentation (kill-chain canaries) and NL/PL information-flow taxonomies both operationalize a shared idea: don’t treat LLM outputs as monolithic taint; model intermediate propagation and transformations.
  • Prompt-injection robustness is surface-dependent: the same model can be safe on memory poisoning yet fail completely on tool poisoning/propagation, implying benchmarks must be multi-surface.
  • Several works converge on uncertainty/entropy as a control signal: ERPO preserves entropy at critical tokens; AdaptToken uses response entropy for global token allocation and early stopping; CoT2-Meta fuses process and outcome confidence for control.
  • Multimodal RLVR is splitting into better credit assignment (PRCO’s Observer/Solver) versus better inference-time control (CoT2-Meta); both aim to reduce “fluent wrongness” but at different lifecycle stages.
  • Evaluation reliability is now treated as a first-class systems variable: temperature strongly affects judge consistency/error rates; prompt richness can confound “atomic decomposition” benefits.
  • Benchmarks are moving beyond final correctness into process and efficiency metrics: MiroEval (process↔report alignment), CirrusBench (NEI/LJ/latency), AMIGO (protocol compliance + verified accuracy).
  • Privacy auditing in audio shows a general lesson for safety evals: blind baselines can explain apparent model vulnerabilities; without controlling for dataset artifacts, conclusions can be wrong.
  • Theoretical alignment papers suggest a looming mismatch: as agents gain tools, evaluation coverage shrinks (reward hacking amplification), while classifier-style safety gates may face long-run impossibility, pushing toward verification/certification.

4) Top 5 papers (with “why now”)

1) Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

  • Introduces stage-level tracking (EXPOSED/PERSISTED/RELAYED/EXECUTED) that explains where defenses work, not just whether the final action happened.
  • Shows exposure can be 100% while execution varies widely (e.g., GPT-4o-mini 53% ASR, GPT-5-mini 3%, Claude variants 0% in reported no-defense runs).
  • Reveals extreme surface splits (e.g., DeepSeek 0% on memory_poison vs 100% on tool_poison/propagation in reported cells).
  • Skepticism: modest per-cell sample sizes and synthetic explicit payloads; mechanism behind “summarization-stage stripping” not isolated.

2) Evaluating Privilege Usage of Agents on Real-World Tools

  • Provides a real-tool sandbox (10 MCP servers, 122 privilege-sensitive tools) and auto-generated benign/malicious requests.
  • Reports very high privilege-hijack ASR averages (90.55% ReAct, 79.05% Plan-and-Execute) across four LLMs—strong evidence the problem is immediate.
  • Highlights that planning helps but doesn’t solve privilege misuse.
  • Skepticism: limited to 10 servers and four models; defenses not evaluated yet.

3) MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

  • Refreshable, user-grounded benchmark with process-centric evaluation and multimodal tasks.
  • Finds process quality strongly predicts outcome (reported r = 0.88), making “trace quality” a measurable target.
  • Shows multimodal tasks cause consistent drops (3–10 points) and rankings shift across synthesis/factuality/process.
  • Skepticism: process evaluation requires access to traces; absolute scores depend on LLM judges even if rankings are robust.

4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

  • Training-free controller that allocates inference budget across expand/prune/repair/stop/abstain using fused process+outcome signals.
  • Reports consistent gains across 15 benchmarks under matched budgets and improved calibration (reported ECE 0.035).
  • Provides interpretable controller traces and ablations tying gains to components.
  • Skepticism: depends on oracle/process-evaluator quality; misranking can cause premature pruning.

5) Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

  • Addresses RLVR’s blurred credit assignment by alternating Observer (evidence caption) and Solver (answer), with role-specific rewards and leakage suppression.
  • Reports ~+7 point average accuracy gains and large perception-error reductions (e.g., −39.2% on WeMath perception errors).
  • Demonstrates gains across multiple backbones including Qwen3-VL-8B-Instruct.
  • Skepticism: intermediate captions can be lossy; evaluated on concise verifiable-answer benchmarks, not open-ended generation.

5) Practical next steps

  • For agent security evals, replace single ASR with stage-level metrics (exposed/persisted/relayed/executed) and run across multiple injection surfaces (memory, tool outputs, propagation, permission escalation).
  • In tool-using systems, implement privilege minimization + per-tool allowlists and measure misuse using a GrantBox-like harness; compare ReAct vs Plan-and-Execute as a baseline mitigation.
  • Add NL/PL boundary flow labeling (placeholder preservation/modality taxonomy) to CI for LLM-integrated code; use it to prioritize which callsites need strict sanitization or structured output constraints.
  • For multimodal models, add a CDH-style paired evaluation (evidence vs prior conflict) and track CFAD/CCR to detect “normalization” failures that standard VQA misses.
  • When using LLM-as-a-judge, set temperature intentionally (very low T for consistency/parse stability) and report judge temperature + repeated-seed variance as part of benchmark methodology.
  • For test-time reasoning, prototype budgeted meta-control (prune/repair/abstain) and measure not just accuracy but ECE/selective prediction under fixed compute.
  • For multimodal RLVR, experiment with role-separated credit assignment (Observer/Solver) and explicitly measure perception vs reasoning error categories to ensure perception improves.
  • For privacy audits (especially audio), run blind baseline separability checks before claiming memorization; only then run MIAs on distribution-matched subsets.

Generated from per-paper analyses; no external browsing.