Daily AI Paper Report (2026-03-31)

Published: March 31, 2026

Chinese version: [中文]

Run stats

Candidates: 223
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.28013`	Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers PDF	cs.CR, cs.AI, cs.LG	95	Stage-level prompt-injection tracking w/ canaries across agents; actionable defense insights	prompt-injection, agent-security, evaluation, kill-chain, canary-tokens, red-teaming
`2603.28063`	Reward Hacking as Equilibrium under Finite Evaluation PDF	cs.AI, cs.GT	95	Formal result: reward hacking emerges under finite evaluation; computable distortion index.	reward-hacking, alignment-theory, evaluation, principal-agent, RLHF, DPO
`2603.28650`	Information-Theoretic Limits of Safety Verification for Self-Improving Systems PDF	cs.LG, cs.AI, stat.ML	95	Strong theoretical impossibility results for safety gates in self-improving systems	ai-safety, self-improvement, verification, risk-bounds, theory, distribution-shift
`2603.28166`	Evaluating Privilege Usage of Agents on Real-World Tools PDF	cs.CR, cs.AI	93	GrantBox sandbox evaluates real-tool privilege usage; closer to real-world agent security	agents, tool-use, privilege, sandbox, security-eval, real-world-tools
`2603.28345`	Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code PDF	cs.SE, cs.AI	92	Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security.	program-analysis, information-flow, LLM-security, prompting, taint-analysis, NL-PL-boundary
`2603.28407`	MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome PDF	cs.AI, cs.CL	90	Deep research agent benchmark scoring process+outcome; multimodal, refreshable tasks	agents, evaluation, deep-research, multimodal, benchmarks, process-metrics
`2603.28204`	ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models PDF	cs.LG, cs.AI	90	Token-level RLVR/GRPO fix to prevent entropy collapse; targets reasoning quality	llm, rlvr, grpo, credit-assignment, reasoning, post-training
`2603.28054`	Who Wrote the Book? Detecting and Attributing LLM Ghostwriters PDF	cs.CL	90	GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detection	authorship-attribution, misuse-detection, benchmark, OOD-robustness, fingerprinting, long-form-text
`2603.28551`	"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents PDF	cs.CR, cs.ET, cs.HC, cs.MA	89	Studies risk awareness + post-hoc auditability for computer-use agents; real incidents corpus.	agent-safety, computer-use-agents, auditability, traceability, HCI, security
`2603.28569`	CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments PDF	cs.LG, cs.AI, cs.IR, cs.PF	88	Real cloud-ticket agent benchmark; measures robustness and resolution efficiency beyond accuracy	agents, evaluation, real-world, customer-support, long-horizon, efficiency
`2603.27982`	CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models PDF	cs.CV, cs.AI, cs.CL	88	New benchmark for commonsense-driven hallucination in VLMs via evidence conflicts	vlm, hallucination, evaluation, robustness, benchmarks, reliability
`2603.28376`	Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design PDF	cs.CL, cs.AI	87	Verification-centric deep research agent design across data synthesis, trajectories, test-time.	agents, verification, deep-research, tool-use, long-horizon, reliability
`2603.28618`	Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning PDF	cs.AI	86	RLVR splits observer/solver to improve visual evidence extraction + reasoning	multimodal, rlvr, credit-assignment, evidence, reasoning, mllm
`2603.28304`	The Necessity of Setting Temperature in LLM-as-a-Judge PDF	cs.CL	86	Shows temperature materially affects LLM-as-judge reliability; important eval hygiene	LLM-judge, evaluation, temperature, reliability, methodology, meta-eval
`2603.27918`	Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey PDF	cs.CR, cs.AI	84	Comprehensive survey of adversarial threats to MLLMs with taxonomy and vulnerability analysis	multimodal, adversarial-attacks, survey, security, threat-models, jailbreaks
`2603.28476`	With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems PDF	cs.IR, cs.LG, cs.SI	84	Shows coordinated user manipulation can break risk-controlling recommenders with safety guarantees.	recommenders, adversarial, safety-guarantees, conformal-risk-control, manipulation
`2603.28430`	IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression PDF	cs.LG, cs.CL	84	Hardware-aligned KV-cache compression via SO(4) rotations; practical LLM efficiency	llm, inference, kv-cache, compression, quantization, systems
`2603.28135`	CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning PDF	cs.AI	84	Training-free metacognitive control for budgeted reasoning incl. abstain/repair/prune	test-time-reasoning, inference-time-control, compute-budget, abstention, search, chain-of-thought
`2603.28378`	Membership Inference Attacks against Large Audio Language Models PDF	cs.SD, cs.AI	83	First MIA study for audio LMs; shows confounds and proposes distribution-matched evaluation	privacy, membership-inference, audio, evaluation, distribution-shift
`2603.28005`	Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation PDF	cs.CL	82	Careful prompt-controlled study of atomic decomposition for LLM judges; eval reliability focus	LLM-judges, evaluation, grounded-QA, rubrics, factuality, methodology
`2603.28092`	InkDrop: Invisible Backdoor Attacks Against Dataset Condensation PDF	cs.LG	82	Stealthy backdoor attacks on dataset condensation; highlights a supply-chain vulnerability.	backdoors, data-poisoning, dataset-condensation, ML-security, stealth
`2603.28610`	ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning PDF	cs.CV, cs.AI, cs.CL	82	Adaptive input resolution to trade visual tokens vs context; bandit-trained allocator	mllm, efficiency, long-context, vision-tokens, bandits, inference
`2603.28696`	AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding PDF	cs.CV, cs.AI	82	Uses model uncertainty/entropy to allocate long-video token budget; scalable MLLM control	MLLM, long-context, video-understanding, token-selection, uncertainty, efficiency
`2603.28662`	AMIGO: Agentic Multi-Image Grounding Oracle Benchmark PDF	cs.LG, cs.AI	81	Long-horizon multi-image grounding benchmark with strict protocol; probes uncertainty tracking	multimodal-agents, benchmark, interactive-eval, uncertainty, grounding
`2603.28605`	Unsafe2Safe: Controllable Image Anonymization for Downstream Utility PDF	cs.CV, cs.CY, cs.LG	81	Automated anonymization via VLM+LLM-guided diffusion edits; privacy protection for training data.	privacy, anonymization, diffusion-editing, dataset-safety, VLM, PII
`2603.28301`	LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models PDF	cs.LG	80	Benchmark for paraphrase robustness in VLA robots; large drops under synonyms reveal brittleness.	evaluation, robustness, paraphrase, VLA, robotics, instruction-following
`2603.28488`	Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification PDF	cs.CL, cs.AI, cs.MA	79	Structured multi-agent debate + progressive RAG for claim verification; targets hallucinations	claim-verification, RAG, multi-agent, debate, hallucinations, calibration
`2603.28038`	Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners PDF	cs.AI, cs.LG	79	Prompt-optimization study probes brittleness/transfer of scientific reasoning behaviors	reasoning, prompting, interpretability, robustness, scientific-tasks, behavior-analysis
`2603.28730`	SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning PDF	cs.RO, cs.CL, cs.CV	78	Video-language reasoning model as sole RL reward; addresses reward exploitation under shift.	robotics, RL, VLM, reward-modeling, distribution-shift, agentic-RL
`2603.28622`	Trust-Aware Routing for Distributed Generative AI Inference at the Edge PDF	cs.DC, cs.AI, cs.NI	78	Trust-aware routing for distributed generative inference; risk-bounded path selection	agent-systems, distributed-inference, trust, robustness, security, edge

AI Paper Insight Brief

2026-03-31

0) Executive takeaways (read this first)

Agent security evaluation is shifting from “did it fail?” to “where did it fail?” Stage-level prompt-injection tracking (EXPOSED→PERSISTED→RELAYED→EXECUTED) shows exposure can be universal while downstream execution differs sharply by model and pipeline stage—changing what “robust” architectures look like.
Real-tool privilege misuse is currently the norm, not the edge case. In a sandbox with real MCP servers/tools, prompt-injection privilege hijacks achieve very high ASR (avg 90.55% ReAct, 79.05% Plan-and-Execute), suggesting tool authorization and isolation are the immediate bottleneck.
Multimodal reliability failures are increasingly “prior overrides evidence,” not just fabrication. CDH-Bench finds VLMs often revert to commonsense priors when images contain atypical evidence (mean CFAD 16.39% QA, 25.20% MC), with counting anomalies especially hard.
Test-time compute is becoming controllable and auditable. CoT2-Meta shows training-free meta-control (expand/prune/repair/abstain) can improve accuracy under fixed budgets and materially improve calibration (reported ECE 0.035).
RL for multimodal reasoning is moving toward better credit assignment. PRCO’s Observer/Solver coevolution improves average accuracy by ~+7 points and reduces perception errors (e.g., WeMath perception errors −39.2%), directly targeting perception as the bottleneck.
Safety gating theory is hardening: classifier gates may be structurally insufficient for long-lived self-improvement. An information-theoretic result shows classifier-style gates can’t generally keep cumulative risk finite while allowing unbounded beneficial updates under common schedules; verification-style gates can escape (δ=0 with TPR>0 demonstrated on GPT-2 LoRA).

2) Key themes (clusters)

Theme: Prompt injection & privilege misuse in agent pipelines

Why it matters: Tool-using agents fail in multi-step ways; outcome-only ASR hides whether defenses sanitize memory/relay or merely refuse at execution. Real-world privilege misuse suggests current “agent safety” is not deployment-ready.
Representative papers:
Common approach:
- Instrument pipelines to localize propagation (stage-level canaries; taint-style placeholder flow taxonomies).
- Evaluate across multiple attack surfaces (memory/tool/propagation/permission escalation) rather than a single benchmark surface.
- Treat “sanitizing relays” and “execution refusals” as distinct defense loci with different composability.
Open questions / failure modes:
- Surface mismatch: defenses can look effective on one surface yet fail catastrophically on another (including reported 100% ASR cells under some “defense” settings).
- How to standardize logging/process access for closed systems where process traces are unavailable or incomplete.
- Whether stage-level decontamination (e.g., memory write filtering) is robust to more obfuscated payloads than explicit canaries.

Theme: Multimodal hallucination as “prior-driven normalization”

Why it matters: In high-stakes perception (medical/inspection/forensics), the dangerous failure is confidently reporting the typical case instead of the observed anomaly.
Representative papers:
Common approach:
- Construct controlled conflicts between evidence and priors (paired counterfactual vs commonsense images).
- Use metrics that isolate “prior collapse” (CFAD, CCR, RPD) rather than generic accuracy.
- Improve grounding via training signals that explicitly reward better evidence extraction (Observer/Solver separation).
Open questions / failure modes:
- Synthetic-image benchmarks may not fully represent real anomaly distributions.
- Multiple-choice formats can amplify prior collapse (reported CFAD increase and CF-Acc drop).
- Caption-based intermediate evidence (PRCO) may be lossy for fine-grained spatial structure.

Theme: Evaluation reliability—judges, temperature, and decomposition myths

Why it matters: If evaluation is unstable, progress claims (and safety claims) become non-reproducible; cost and prompt length can masquerade as “better judging.”
Representative papers:
Common approach:
- Control prompt richness/structure and aggregate across prompt variants to avoid “prompt artifact” conclusions.
- Measure stability (consistency, parse errors) in addition to agreement.
- Add process-centric evaluation (not just final report) for long-horizon agents.
Open questions / failure modes:
- High temperature can sharply degrade judge consistency and parsing (model-dependent), especially with CoT prompting.
- Atomic decomposition benefits may depend on decomposition source (self-decomposed vs external vs multi-stage), not tested broadly.
- Process evaluation may be blocked when systems don’t expose traces.

Theme: Test-time reasoning control & token-level credit assignment

Why it matters: Frontier gains increasingly come from how compute is allocated (search/control/entropy preservation), not just more tokens—affecting cost, calibration, and robustness.
Representative papers:
Common approach:
- Use process signals (step health, entropy, verifier confidence) to guide expansion/pruning/repair or to weight RL updates.
- Prevent entropy collapse by emphasizing high-uncertainty “decision pivots.”
- Analyze transfer: prompt-optimized heuristics can be model-specific and brittle across architectures.
Open questions / failure modes:
- Controller/oracle misranking can cause premature pruning or wasted compute.
- Token-level RL improvements shown mainly on math; generalization beyond verifiable domains remains open.
- Prompt evolution can overfit to a model’s “local logic,” limiting interoperability.

Theme: Long-horizon multimodal efficiency (video) via adaptive allocation

Why it matters: Video reasoning is bottlenecked by visual token budgets; input-side and token-side adaptation can trade spatial fidelity for temporal coverage without changing backbones.
Representative papers:
Common approach:
- Use lightweight front-ends or training-free signals (entropy, attention) to allocate compute across frames/groups.
- Add early stopping when certainty is high to cut runtime.
- Evaluate on long-video benchmarks and stress extremely long inputs (up to 10K frames in AdaptToken).
Open questions / failure modes:
- Open-loop allocation (ResAdapt) can miss brief decisive cues and can’t revise after backbone inference.
- Group-wise inference remains a runtime bottleneck; early stopping is key but depends on reliable entropy.
- Interactive multi-image settings expose protocol-following failures (Skip violations, premature guesses).

Theme: Privacy, forensics, and dataset integrity attacks/defenses

Why it matters: As datasets and condensed artifacts are shared, privacy leakage and stealthy poisoning/backdoors become high-leverage; auditing must control for confounders.
Representative papers:
Common approach:
- Emphasize stealth (perceptual constraints like LPIPS) and transferability for attacks (InkDrop).
- Use blind baselines to detect dataset artifacts that confound privacy audits (audio MIA).
- Build interpretable fingerprints (token transition rank/entropy) for long-form attribution (TRACE).
- Automate anonymization via VLM inspection + LLM instruction + diffusion editing; evaluate privacy–utility tradeoffs.
Open questions / failure modes:
- Audio MIA can be dominated by dataset shift; “high AUC” may not imply memorization.
- Condensed datasets are small and inspectable—stealthy attacks raise the bar for defenses.
- Anonymization provides empirical privacy metrics but no formal guarantees; policy choices remain external.

Theme: Alignment theory & safety verification limits

Why it matters: Some failure modes (reward hacking, long-run safety gating) may be structural, not patchable by better prompts or more eval.
Representative papers:
- Reward Hacking as Equilibrium under Finite Evaluation
- Information-Theoretic Limits of Safety Verification for Self-Improving Systems
Common approach:
- Formalize evaluation as a projection of high-dimensional quality into limited signals; prove distortion is inevitable under optimization.
- Provide computable diagnostics (distortion index via reward-model gradients) and scaling arguments (coverage vanishes with tool combinatorics unless eval scales quadratically).
- Prove impossibility results for classifier gates under summability constraints; construct verification-based escapes.
Open questions / failure modes:
- Empirical validation is largely pending for the reward-hacking equilibrium model.
- Verification approaches depend on tractable certificates (e.g., Lipschitz bounds) that may be hard to compute tightly at scale.

3) Technical synthesis

Stage-level security instrumentation (kill-chain canaries) and NL/PL information-flow taxonomies both operationalize a shared idea: don’t treat LLM outputs as monolithic taint; model intermediate propagation and transformations.
Prompt-injection robustness is surface-dependent: the same model can be safe on memory poisoning yet fail completely on tool poisoning/propagation, implying benchmarks must be multi-surface.
Several works converge on uncertainty/entropy as a control signal: ERPO preserves entropy at critical tokens; AdaptToken uses response entropy for global token allocation and early stopping; CoT2-Meta fuses process and outcome confidence for control.
Multimodal RLVR is splitting into better credit assignment (PRCO’s Observer/Solver) versus better inference-time control (CoT2-Meta); both aim to reduce “fluent wrongness” but at different lifecycle stages.
Evaluation reliability is now treated as a first-class systems variable: temperature strongly affects judge consistency/error rates; prompt richness can confound “atomic decomposition” benefits.
Benchmarks are moving beyond final correctness into process and efficiency metrics: MiroEval (process↔report alignment), CirrusBench (NEI/LJ/latency), AMIGO (protocol compliance + verified accuracy).
Privacy auditing in audio shows a general lesson for safety evals: blind baselines can explain apparent model vulnerabilities; without controlling for dataset artifacts, conclusions can be wrong.
Theoretical alignment papers suggest a looming mismatch: as agents gain tools, evaluation coverage shrinks (reward hacking amplification), while classifier-style safety gates may face long-run impossibility, pushing toward verification/certification.

4) Top 5 papers (with “why now”)

1) Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Introduces stage-level tracking (EXPOSED/PERSISTED/RELAYED/EXECUTED) that explains where defenses work, not just whether the final action happened.
Shows exposure can be 100% while execution varies widely (e.g., GPT-4o-mini 53% ASR, GPT-5-mini 3%, Claude variants 0% in reported no-defense runs).
Reveals extreme surface splits (e.g., DeepSeek 0% on memory_poison vs 100% on tool_poison/propagation in reported cells).
Skepticism: modest per-cell sample sizes and synthetic explicit payloads; mechanism behind “summarization-stage stripping” not isolated.

2) Evaluating Privilege Usage of Agents on Real-World Tools

Provides a real-tool sandbox (10 MCP servers, 122 privilege-sensitive tools) and auto-generated benign/malicious requests.
Reports very high privilege-hijack ASR averages (90.55% ReAct, 79.05% Plan-and-Execute) across four LLMs—strong evidence the problem is immediate.
Highlights that planning helps but doesn’t solve privilege misuse.
Skepticism: limited to 10 servers and four models; defenses not evaluated yet.

3) MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Refreshable, user-grounded benchmark with process-centric evaluation and multimodal tasks.
Finds process quality strongly predicts outcome (reported r = 0.88), making “trace quality” a measurable target.
Shows multimodal tasks cause consistent drops (3–10 points) and rankings shift across synthesis/factuality/process.
Skepticism: process evaluation requires access to traces; absolute scores depend on LLM judges even if rankings are robust.

4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Training-free controller that allocates inference budget across expand/prune/repair/stop/abstain using fused process+outcome signals.
Reports consistent gains across 15 benchmarks under matched budgets and improved calibration (reported ECE 0.035).
Provides interpretable controller traces and ablations tying gains to components.
Skepticism: depends on oracle/process-evaluator quality; misranking can cause premature pruning.

5) Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Addresses RLVR’s blurred credit assignment by alternating Observer (evidence caption) and Solver (answer), with role-specific rewards and leakage suppression.
Reports ~+7 point average accuracy gains and large perception-error reductions (e.g., −39.2% on WeMath perception errors).
Demonstrates gains across multiple backbones including Qwen3-VL-8B-Instruct.
Skepticism: intermediate captions can be lossy; evaluated on concise verifiable-answer benchmarks, not open-ended generation.

5) Practical next steps

For agent security evals, replace single ASR with stage-level metrics (exposed/persisted/relayed/executed) and run across multiple injection surfaces (memory, tool outputs, propagation, permission escalation).
In tool-using systems, implement privilege minimization + per-tool allowlists and measure misuse using a GrantBox-like harness; compare ReAct vs Plan-and-Execute as a baseline mitigation.
Add NL/PL boundary flow labeling (placeholder preservation/modality taxonomy) to CI for LLM-integrated code; use it to prioritize which callsites need strict sanitization or structured output constraints.
For multimodal models, add a CDH-style paired evaluation (evidence vs prior conflict) and track CFAD/CCR to detect “normalization” failures that standard VQA misses.
When using LLM-as-a-judge, set temperature intentionally (very low T for consistency/parse stability) and report judge temperature + repeated-seed variance as part of benchmark methodology.
For test-time reasoning, prototype budgeted meta-control (prune/repair/abstain) and measure not just accuracy but ECE/selective prediction under fixed compute.
For multimodal RLVR, experiment with role-separated credit assignment (Observer/Solver) and explicitly measure perception vs reasoning error categories to ensure perception improves.
For privacy audits (especially audio), run blind baseline separability checks before claiming memorization; only then run MIAs on distribution-matched subsets.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-31

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Prompt injection & privilege misuse in agent pipelines

Theme: Multimodal hallucination as “prior-driven normalization”

Theme: Evaluation reliability—judges, temperature, and decomposition myths

Theme: Test-time reasoning control & token-level credit assignment

Theme: Long-horizon multimodal efficiency (video) via adaptive allocation

Theme: Privacy, forensics, and dataset integrity attacks/defenses

Theme: Alignment theory & safety verification limits

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps