Daily AI Paper Report (2026-03-31)
Published:
Chinese version: [中文]
Run stats
- Candidates: 223
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-30T00:00:00Z → 2026-03-31T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.28013 | Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers | cs.CR, cs.AI, cs.LG | 95 | Stage-level prompt-injection tracking w/ canaries across agents; actionable defense insights | prompt-injection, agent-security, evaluation, kill-chain, canary-tokens, red-teaming |
2603.28063 | Reward Hacking as Equilibrium under Finite Evaluation | cs.AI, cs.GT | 95 | Formal result: reward hacking emerges under finite evaluation; computable distortion index. | reward-hacking, alignment-theory, evaluation, principal-agent, RLHF, DPO |
2603.28650 | Information-Theoretic Limits of Safety Verification for Self-Improving Systems | cs.LG, cs.AI, stat.ML | 95 | Strong theoretical impossibility results for safety gates in self-improving systems | ai-safety, self-improvement, verification, risk-bounds, theory, distribution-shift |
2603.28166 | Evaluating Privilege Usage of Agents on Real-World Tools | cs.CR, cs.AI | 93 | GrantBox sandbox evaluates real-tool privilege usage; closer to real-world agent security | agents, tool-use, privilege, sandbox, security-eval, real-world-tools |
2603.28345 | Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code | cs.SE, cs.AI | 92 | Bridges NL/PL boundary for info-flow/taint across LLM calls; key for LLM app security. | program-analysis, information-flow, LLM-security, prompting, taint-analysis, NL-PL-boundary |
2603.28407 | MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome | cs.AI, cs.CL | 90 | Deep research agent benchmark scoring process+outcome; multimodal, refreshable tasks | agents, evaluation, deep-research, multimodal, benchmarks, process-metrics |
2603.28204 | ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models | cs.LG, cs.AI | 90 | Token-level RLVR/GRPO fix to prevent entropy collapse; targets reasoning quality | llm, rlvr, grpo, credit-assignment, reasoning, post-training |
2603.28054 | Who Wrote the Book? Detecting and Attributing LLM Ghostwriters | cs.CL | 90 | GhostWriteBench + robust OOD LLM authorship attribution; practical for misuse detection | authorship-attribution, misuse-detection, benchmark, OOD-robustness, fingerprinting, long-form-text |
2603.28551 | "What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents | cs.CR, cs.ET, cs.HC, cs.MA | 89 | Studies risk awareness + post-hoc auditability for computer-use agents; real incidents corpus. | agent-safety, computer-use-agents, auditability, traceability, HCI, security |
2603.28569 | CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments | cs.LG, cs.AI, cs.IR, cs.PF | 88 | Real cloud-ticket agent benchmark; measures robustness and resolution efficiency beyond accuracy | agents, evaluation, real-world, customer-support, long-horizon, efficiency |
2603.27982 | CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models | cs.CV, cs.AI, cs.CL | 88 | New benchmark for commonsense-driven hallucination in VLMs via evidence conflicts | vlm, hallucination, evaluation, robustness, benchmarks, reliability |
2603.28376 | Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design | cs.CL, cs.AI | 87 | Verification-centric deep research agent design across data synthesis, trajectories, test-time. | agents, verification, deep-research, tool-use, long-horizon, reliability |
2603.28618 | Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning | cs.AI | 86 | RLVR splits observer/solver to improve visual evidence extraction + reasoning | multimodal, rlvr, credit-assignment, evidence, reasoning, mllm |
2603.28304 | The Necessity of Setting Temperature in LLM-as-a-Judge | cs.CL | 86 | Shows temperature materially affects LLM-as-judge reliability; important eval hygiene | LLM-judge, evaluation, temperature, reliability, methodology, meta-eval |
2603.27918 | Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey | cs.CR, cs.AI | 84 | Comprehensive survey of adversarial threats to MLLMs with taxonomy and vulnerability analysis | multimodal, adversarial-attacks, survey, security, threat-models, jailbreaks |
2603.28476 | With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems | cs.IR, cs.LG, cs.SI | 84 | Shows coordinated user manipulation can break risk-controlling recommenders with safety guarantees. | recommenders, adversarial, safety-guarantees, conformal-risk-control, manipulation |
2603.28430 | IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression | cs.LG, cs.CL | 84 | Hardware-aligned KV-cache compression via SO(4) rotations; practical LLM efficiency | llm, inference, kv-cache, compression, quantization, systems |
2603.28135 | CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning | cs.AI | 84 | Training-free metacognitive control for budgeted reasoning incl. abstain/repair/prune | test-time-reasoning, inference-time-control, compute-budget, abstention, search, chain-of-thought |
2603.28378 | Membership Inference Attacks against Large Audio Language Models | cs.SD, cs.AI | 83 | First MIA study for audio LMs; shows confounds and proposes distribution-matched evaluation | privacy, membership-inference, audio, evaluation, distribution-shift |
2603.28005 | Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation | cs.CL | 82 | Careful prompt-controlled study of atomic decomposition for LLM judges; eval reliability focus | LLM-judges, evaluation, grounded-QA, rubrics, factuality, methodology |
2603.28092 | InkDrop: Invisible Backdoor Attacks Against Dataset Condensation | cs.LG | 82 | Stealthy backdoor attacks on dataset condensation; highlights a supply-chain vulnerability. | backdoors, data-poisoning, dataset-condensation, ML-security, stealth |
2603.28610 | ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning | cs.CV, cs.AI, cs.CL | 82 | Adaptive input resolution to trade visual tokens vs context; bandit-trained allocator | mllm, efficiency, long-context, vision-tokens, bandits, inference |
2603.28696 | AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding | cs.CV, cs.AI | 82 | Uses model uncertainty/entropy to allocate long-video token budget; scalable MLLM control | MLLM, long-context, video-understanding, token-selection, uncertainty, efficiency |
2603.28662 | AMIGO: Agentic Multi-Image Grounding Oracle Benchmark | cs.LG, cs.AI | 81 | Long-horizon multi-image grounding benchmark with strict protocol; probes uncertainty tracking | multimodal-agents, benchmark, interactive-eval, uncertainty, grounding |
2603.28605 | Unsafe2Safe: Controllable Image Anonymization for Downstream Utility | cs.CV, cs.CY, cs.LG | 81 | Automated anonymization via VLM+LLM-guided diffusion edits; privacy protection for training data. | privacy, anonymization, diffusion-editing, dataset-safety, VLM, PII |
2603.28301 | LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models | cs.LG | 80 | Benchmark for paraphrase robustness in VLA robots; large drops under synonyms reveal brittleness. | evaluation, robustness, paraphrase, VLA, robotics, instruction-following |
2603.28488 | Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification | cs.CL, cs.AI, cs.MA | 79 | Structured multi-agent debate + progressive RAG for claim verification; targets hallucinations | claim-verification, RAG, multi-agent, debate, hallucinations, calibration |
2603.28038 | Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners | cs.AI, cs.LG | 79 | Prompt-optimization study probes brittleness/transfer of scientific reasoning behaviors | reasoning, prompting, interpretability, robustness, scientific-tasks, behavior-analysis |
2603.28730 | SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning | cs.RO, cs.CL, cs.CV | 78 | Video-language reasoning model as sole RL reward; addresses reward exploitation under shift. | robotics, RL, VLM, reward-modeling, distribution-shift, agentic-RL |
2603.28622 | Trust-Aware Routing for Distributed Generative AI Inference at the Edge | cs.DC, cs.AI, cs.NI | 78 | Trust-aware routing for distributed generative inference; risk-bounded path selection | agent-systems, distributed-inference, trust, robustness, security, edge |
AI Paper Insight Brief
2026-03-31
0) Executive takeaways (read this first)
- Agent security evaluation is shifting from “did it fail?” to “where did it fail?” Stage-level prompt-injection tracking (EXPOSED→PERSISTED→RELAYED→EXECUTED) shows exposure can be universal while downstream execution differs sharply by model and pipeline stage—changing what “robust” architectures look like.
- Real-tool privilege misuse is currently the norm, not the edge case. In a sandbox with real MCP servers/tools, prompt-injection privilege hijacks achieve very high ASR (avg 90.55% ReAct, 79.05% Plan-and-Execute), suggesting tool authorization and isolation are the immediate bottleneck.
- Multimodal reliability failures are increasingly “prior overrides evidence,” not just fabrication. CDH-Bench finds VLMs often revert to commonsense priors when images contain atypical evidence (mean CFAD 16.39% QA, 25.20% MC), with counting anomalies especially hard.
- Test-time compute is becoming controllable and auditable. CoT2-Meta shows training-free meta-control (expand/prune/repair/abstain) can improve accuracy under fixed budgets and materially improve calibration (reported ECE 0.035).
- RL for multimodal reasoning is moving toward better credit assignment. PRCO’s Observer/Solver coevolution improves average accuracy by ~+7 points and reduces perception errors (e.g., WeMath perception errors −39.2%), directly targeting perception as the bottleneck.
- Safety gating theory is hardening: classifier gates may be structurally insufficient for long-lived self-improvement. An information-theoretic result shows classifier-style gates can’t generally keep cumulative risk finite while allowing unbounded beneficial updates under common schedules; verification-style gates can escape (δ=0 with TPR>0 demonstrated on GPT-2 LoRA).
2) Key themes (clusters)
Theme: Prompt injection & privilege misuse in agent pipelines
- Why it matters: Tool-using agents fail in multi-step ways; outcome-only ASR hides whether defenses sanitize memory/relay or merely refuse at execution. Real-world privilege misuse suggests current “agent safety” is not deployment-ready.
- Representative papers:
- Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
- Evaluating Privilege Usage of Agents on Real-World Tools
- Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code
- Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey
- Common approach:
- Instrument pipelines to localize propagation (stage-level canaries; taint-style placeholder flow taxonomies).
- Evaluate across multiple attack surfaces (memory/tool/propagation/permission escalation) rather than a single benchmark surface.
- Treat “sanitizing relays” and “execution refusals” as distinct defense loci with different composability.
- Open questions / failure modes:
- Surface mismatch: defenses can look effective on one surface yet fail catastrophically on another (including reported 100% ASR cells under some “defense” settings).
- How to standardize logging/process access for closed systems where process traces are unavailable or incomplete.
- Whether stage-level decontamination (e.g., memory write filtering) is robust to more obfuscated payloads than explicit canaries.
Theme: Multimodal hallucination as “prior-driven normalization”
- Why it matters: In high-stakes perception (medical/inspection/forensics), the dangerous failure is confidently reporting the typical case instead of the observed anomaly.
- Representative papers:
- Common approach:
- Construct controlled conflicts between evidence and priors (paired counterfactual vs commonsense images).
- Use metrics that isolate “prior collapse” (CFAD, CCR, RPD) rather than generic accuracy.
- Improve grounding via training signals that explicitly reward better evidence extraction (Observer/Solver separation).
- Open questions / failure modes:
- Synthetic-image benchmarks may not fully represent real anomaly distributions.
- Multiple-choice formats can amplify prior collapse (reported CFAD increase and CF-Acc drop).
- Caption-based intermediate evidence (PRCO) may be lossy for fine-grained spatial structure.
Theme: Evaluation reliability—judges, temperature, and decomposition myths
- Why it matters: If evaluation is unstable, progress claims (and safety claims) become non-reproducible; cost and prompt length can masquerade as “better judging.”
- Representative papers:
- Common approach:
- Control prompt richness/structure and aggregate across prompt variants to avoid “prompt artifact” conclusions.
- Measure stability (consistency, parse errors) in addition to agreement.
- Add process-centric evaluation (not just final report) for long-horizon agents.
- Open questions / failure modes:
- High temperature can sharply degrade judge consistency and parsing (model-dependent), especially with CoT prompting.
- Atomic decomposition benefits may depend on decomposition source (self-decomposed vs external vs multi-stage), not tested broadly.
- Process evaluation may be blocked when systems don’t expose traces.
Theme: Test-time reasoning control & token-level credit assignment
- Why it matters: Frontier gains increasingly come from how compute is allocated (search/control/entropy preservation), not just more tokens—affecting cost, calibration, and robustness.
- Representative papers:
- Common approach:
- Use process signals (step health, entropy, verifier confidence) to guide expansion/pruning/repair or to weight RL updates.
- Prevent entropy collapse by emphasizing high-uncertainty “decision pivots.”
- Analyze transfer: prompt-optimized heuristics can be model-specific and brittle across architectures.
- Open questions / failure modes:
- Controller/oracle misranking can cause premature pruning or wasted compute.
- Token-level RL improvements shown mainly on math; generalization beyond verifiable domains remains open.
- Prompt evolution can overfit to a model’s “local logic,” limiting interoperability.
Theme: Long-horizon multimodal efficiency (video) via adaptive allocation
- Why it matters: Video reasoning is bottlenecked by visual token budgets; input-side and token-side adaptation can trade spatial fidelity for temporal coverage without changing backbones.
- Representative papers:
- Common approach:
- Use lightweight front-ends or training-free signals (entropy, attention) to allocate compute across frames/groups.
- Add early stopping when certainty is high to cut runtime.
- Evaluate on long-video benchmarks and stress extremely long inputs (up to 10K frames in AdaptToken).
- Open questions / failure modes:
- Open-loop allocation (ResAdapt) can miss brief decisive cues and can’t revise after backbone inference.
- Group-wise inference remains a runtime bottleneck; early stopping is key but depends on reliable entropy.
- Interactive multi-image settings expose protocol-following failures (Skip violations, premature guesses).
Theme: Privacy, forensics, and dataset integrity attacks/defenses
- Why it matters: As datasets and condensed artifacts are shared, privacy leakage and stealthy poisoning/backdoors become high-leverage; auditing must control for confounders.
- Representative papers:
- Common approach:
- Emphasize stealth (perceptual constraints like LPIPS) and transferability for attacks (InkDrop).
- Use blind baselines to detect dataset artifacts that confound privacy audits (audio MIA).
- Build interpretable fingerprints (token transition rank/entropy) for long-form attribution (TRACE).
- Automate anonymization via VLM inspection + LLM instruction + diffusion editing; evaluate privacy–utility tradeoffs.
- Open questions / failure modes:
- Audio MIA can be dominated by dataset shift; “high AUC” may not imply memorization.
- Condensed datasets are small and inspectable—stealthy attacks raise the bar for defenses.
- Anonymization provides empirical privacy metrics but no formal guarantees; policy choices remain external.
Theme: Alignment theory & safety verification limits
- Why it matters: Some failure modes (reward hacking, long-run safety gating) may be structural, not patchable by better prompts or more eval.
- Representative papers:
- Common approach:
- Formalize evaluation as a projection of high-dimensional quality into limited signals; prove distortion is inevitable under optimization.
- Provide computable diagnostics (distortion index via reward-model gradients) and scaling arguments (coverage vanishes with tool combinatorics unless eval scales quadratically).
- Prove impossibility results for classifier gates under summability constraints; construct verification-based escapes.
- Open questions / failure modes:
- Empirical validation is largely pending for the reward-hacking equilibrium model.
- Verification approaches depend on tractable certificates (e.g., Lipschitz bounds) that may be hard to compute tightly at scale.
3) Technical synthesis
- Stage-level security instrumentation (kill-chain canaries) and NL/PL information-flow taxonomies both operationalize a shared idea: don’t treat LLM outputs as monolithic taint; model intermediate propagation and transformations.
- Prompt-injection robustness is surface-dependent: the same model can be safe on memory poisoning yet fail completely on tool poisoning/propagation, implying benchmarks must be multi-surface.
- Several works converge on uncertainty/entropy as a control signal: ERPO preserves entropy at critical tokens; AdaptToken uses response entropy for global token allocation and early stopping; CoT2-Meta fuses process and outcome confidence for control.
- Multimodal RLVR is splitting into better credit assignment (PRCO’s Observer/Solver) versus better inference-time control (CoT2-Meta); both aim to reduce “fluent wrongness” but at different lifecycle stages.
- Evaluation reliability is now treated as a first-class systems variable: temperature strongly affects judge consistency/error rates; prompt richness can confound “atomic decomposition” benefits.
- Benchmarks are moving beyond final correctness into process and efficiency metrics: MiroEval (process↔report alignment), CirrusBench (NEI/LJ/latency), AMIGO (protocol compliance + verified accuracy).
- Privacy auditing in audio shows a general lesson for safety evals: blind baselines can explain apparent model vulnerabilities; without controlling for dataset artifacts, conclusions can be wrong.
- Theoretical alignment papers suggest a looming mismatch: as agents gain tools, evaluation coverage shrinks (reward hacking amplification), while classifier-style safety gates may face long-run impossibility, pushing toward verification/certification.
4) Top 5 papers (with “why now”)
- Introduces stage-level tracking (EXPOSED/PERSISTED/RELAYED/EXECUTED) that explains where defenses work, not just whether the final action happened.
- Shows exposure can be 100% while execution varies widely (e.g., GPT-4o-mini 53% ASR, GPT-5-mini 3%, Claude variants 0% in reported no-defense runs).
- Reveals extreme surface splits (e.g., DeepSeek 0% on memory_poison vs 100% on tool_poison/propagation in reported cells).
- Skepticism: modest per-cell sample sizes and synthetic explicit payloads; mechanism behind “summarization-stage stripping” not isolated.
2) Evaluating Privilege Usage of Agents on Real-World Tools
- Provides a real-tool sandbox (10 MCP servers, 122 privilege-sensitive tools) and auto-generated benign/malicious requests.
- Reports very high privilege-hijack ASR averages (90.55% ReAct, 79.05% Plan-and-Execute) across four LLMs—strong evidence the problem is immediate.
- Highlights that planning helps but doesn’t solve privilege misuse.
- Skepticism: limited to 10 servers and four models; defenses not evaluated yet.
3) MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
- Refreshable, user-grounded benchmark with process-centric evaluation and multimodal tasks.
- Finds process quality strongly predicts outcome (reported r = 0.88), making “trace quality” a measurable target.
- Shows multimodal tasks cause consistent drops (3–10 points) and rankings shift across synthesis/factuality/process.
- Skepticism: process evaluation requires access to traces; absolute scores depend on LLM judges even if rankings are robust.
4) CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
- Training-free controller that allocates inference budget across expand/prune/repair/stop/abstain using fused process+outcome signals.
- Reports consistent gains across 15 benchmarks under matched budgets and improved calibration (reported ECE 0.035).
- Provides interpretable controller traces and ablations tying gains to components.
- Skepticism: depends on oracle/process-evaluator quality; misranking can cause premature pruning.
5) Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
- Addresses RLVR’s blurred credit assignment by alternating Observer (evidence caption) and Solver (answer), with role-specific rewards and leakage suppression.
- Reports ~+7 point average accuracy gains and large perception-error reductions (e.g., −39.2% on WeMath perception errors).
- Demonstrates gains across multiple backbones including Qwen3-VL-8B-Instruct.
- Skepticism: intermediate captions can be lossy; evaluated on concise verifiable-answer benchmarks, not open-ended generation.
5) Practical next steps
- For agent security evals, replace single ASR with stage-level metrics (exposed/persisted/relayed/executed) and run across multiple injection surfaces (memory, tool outputs, propagation, permission escalation).
- In tool-using systems, implement privilege minimization + per-tool allowlists and measure misuse using a GrantBox-like harness; compare ReAct vs Plan-and-Execute as a baseline mitigation.
- Add NL/PL boundary flow labeling (placeholder preservation/modality taxonomy) to CI for LLM-integrated code; use it to prioritize which callsites need strict sanitization or structured output constraints.
- For multimodal models, add a CDH-style paired evaluation (evidence vs prior conflict) and track CFAD/CCR to detect “normalization” failures that standard VQA misses.
- When using LLM-as-a-judge, set temperature intentionally (very low T for consistency/parse stability) and report judge temperature + repeated-seed variance as part of benchmark methodology.
- For test-time reasoning, prototype budgeted meta-control (prune/repair/abstain) and measure not just accuracy but ECE/selective prediction under fixed compute.
- For multimodal RLVR, experiment with role-separated credit assignment (Observer/Solver) and explicitly measure perception vs reasoning error categories to ensure perception improves.
- For privacy audits (especially audio), run blind baseline separability checks before claiming memorization; only then run MIAs on distribution-matched subsets.
Generated from per-paper analyses; no external browsing.
