Daily AI Paper Report (2026-03-27)
Published:
Chinese version: [中文]
Run stats
- Candidates: 216
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-25T00:00:00Z → 2026-03-26T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.24511 | Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs | cs.LG, cs.AI, cs.CR | 96 | Autonomous autoresearch finds stronger jailbreak/prompt-injection attack algorithms; big eval gains vs 30+ baselines | agentic-research, jailbreaks, prompt-injection, adversarial-attacks, red-teaming, white-box, evaluation |
2603.23801 | AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols | cs.CR | 94 | Security principles + protocol stack + conformance tests for agent protocols (MCP/A2A/etc); formal invariants (TLA+) | agent-security, protocols, MCP, formal-methods, TLA+, conformance-testing, security-principles |
2603.24080 | LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale | cs.CL, cs.DB | 94 | Scales factuality auditing via 1M generated articles; shows big gap vs MMLU-style benchmarks | factuality, evaluation, parametric-knowledge, benchmarking, hallucinations, knowledge-auditing |
2603.24203 | Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search | cs.CR, cs.AI | 92 | Black-box stealthy indirect prompt injection for MCP tool responses; adaptive search to bypass defenses | prompt-injection, tool-security, MCP, black-box-attacks, agent-security, adversarial-search |
2603.23806 | Willful Disobedience: Automatically Detecting Failures in Agentic Traces | cs.SE, cs.AI | 92 | Automated compliance checking of agentic traces; catches procedural/unsafe failures beyond outcomes | agents, trace-evaluation, oversight, specification, tool-use, safety-monitoring, auditing |
2603.24533 | UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience | cs.LG, cs.AI, cs.CV | 92 | Self-evolving GUI agent w/ RFT+step-level distillation from failures; strong agentic reliability signal. | agents, GUI-agents, self-improvement, rejection-finetuning, distillation, long-horizon, evaluation |
2603.24079 | When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm | cs.CV, cs.AI, cs.CR | 90 | Finds MLLM image generators produce more unsafe/fake images than diffusion; important new risk surface | multimodal, image-generation, safety, unsafe-content, misinformation, evaluation |
2603.23844 | Language Model Planners do not Scale, but do Formalizers? | cs.CL | 90 | Shows LLM formalizers scale on planning via solver programs; key for agent planning + verification. | planning, program-synthesis, formalization, LLM-reasoning, solver, scaling, BlocksWorld |
2603.24414 | ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers | cs.CR, cs.AI | 89 | Holistic runtime security for tool-using agents (skills/plugins/watchers) targeting leakage/escalation risks | agent-runtime, sandboxing, permissions, tool-use, plugins, security-framework, data-leakage |
2603.24329 | GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents | cs.CL, cs.AI, cs.CV | 88 | Decision-dense POV-synced multi-video benchmark for agent perception/reasoning in 3D multi-agent play | benchmark, multimodal, video-understanding, agents, multi-agent, evaluation, POV |
2603.23934 | Revealing Multi-View Hallucination in Large Vision-Language Models | cs.CV, cs.AI | 88 | MVH-Bench exposes multi-view VLM hallucinations; adds benchmark + training-free decoding mitigation. | VLM, hallucination, benchmark, multiview, evaluation, decoding, robustness |
2603.24543 | Analysing the Safety Pitfalls of Steering Vectors | cs.CR, cs.CL | 87 | Safety audit shows steering vectors can sharply raise/lower jailbreak ASR; highlights activation-steering risk surface | activation-steering, CAA, jailbreaks, robustness, model-editing, safety-evaluation |
2603.24124 | The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation | cs.LG, cs.AI, cs.CL | 86 | Finds RLHF/DPO causes response homogenization that breaks sampling-based uncertainty; important reliability insight | RLHF, DPO, uncertainty, calibration, reliability, TruthfulQA, evaluation |
2603.23909 | DUPLEX: Agentic Dual-System Planning via LLM-Driven Information Extraction | cs.AI | 86 | Neuro-symbolic planning that confines LLM to schema-guided extraction to reduce hallucinated plans | agents, planning, neuro-symbolic, PDDL, reliability, hallucination-mitigation, robotics |
2603.23841 | PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay | cs.CL, cs.AI | 86 | New multi-turn roleplay benchmark to measure political values/bias drift across major LLMs | benchmark, bias, politics, multi-turn, evaluation, roleplay |
2603.23867 | Can VLMs Reason Robustly? A Neuro-Symbolic Investigation | cs.LG, cs.AI, cs.CV | 86 | Finds VLM reasoning brittle under covariate shift; neuro-symbolic angle for robust generalization. | VLM, robustness, distribution-shift, neuro-symbolic, reasoning, evaluation |
2603.23848 | BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents | cs.CL, cs.CY | 84 | Longitudinal benchmark for belief consistency/drift and evidence-driven revision in multi-session LLM agents | benchmarks, agent-memory, longitudinal-eval, belief-dynamics, consistency, over-alignment |
2603.24582 | The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence | cs.AI | 84 | Markov framework to audit reliability vs oversight cost for stochastic agent workflows pre-deployment | agentic-systems, reliability, oversight, governance, risk-metrics, workflow-auditing |
2603.23996 | Forensic Implications of Localized AI: Artifact Analysis of Ollama, LM Studio, and llama.cpp | cs.CR | 84 | Forensic artifact study of local LLM runners; relevant to auditing, incident response, and misuse. | security, forensics, local-LLMs, ollama, llama.cpp, LM-Studio, auditability |
2603.24440 | CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents | cs.LG, cs.AI, cs.CV | 83 | Large human-annotated continuous video demos for computer-use agents; likely high leverage dataset for agent training | computer-use-agents, datasets, demonstrations, video, tool-use, agent-training, evaluation |
2603.24125 | Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study | cs.CL | 83 | Finds alignment reduces expressed but not encoded gender bias; unified intrinsic/extrinsic analysis | bias, alignment, representation, fairness, evaluation, interpretability |
2603.23990 | From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring | cs.CY, cs.AI | 82 | Interpretable orchestrator + specialized LLMs for tutoring; improves controllability and constraint adherence | agent-architecture, controllability, education, orchestration, reliability, governance |
2603.23878 | The Luna Bound Propagator for Formal Analysis of Neural Networks | cs.LG, cs.AI, cs.LO | 82 | C++ alpha-CROWN bound propagator for NN verification; improves deployability of formal methods. | verification, alpha-CROWN, robustness, formal-methods, tooling, C++ |
2603.24580 | Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA | cs.CL, cs.AI, cs.CY, cs.IR, cs.LG | 81 | Shows better retrieval may not improve answers in AI policy QA; domain RAG study on AGORA with preferences/DPO | RAG, evaluation, AI-policy, retrieval, grounding, preference-learning, DPO |
2603.24586 | Comparing Developer and LLM Biases in Code Evaluation | cs.SE, cs.CL | 81 | TRACE measures LLM-judge vs developer preference gaps; extracts rubric biases across coding settings | evaluation, LLM-judges, human-preferences, bias, code, rubrics, reliability |
2603.23889 | Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration | cs.LG, cs.RO | 80 | Off-policy safe RL with constrained optimistic exploration; targets constraint violations directly. | safe-RL, constraints, off-policy, exploration, robotics, reliability |
2603.23853 | SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems | cs.AI, cs.MA | 79 | Training-free uncertainty pooling across multiple VLMs improves hallucination detection/abstention | uncertainty, hallucination-detection, VLM, ensembles, abstention, calibration |
2603.23840 | VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents | cs.AI, cs.CL | 78 | Executable benchmark for multi-user long-term memory + tool interaction in in-vehicle agents; tests conflicts over time | benchmarks, agent-memory, long-context, tool-use, multi-user, simulation, reliability |
2603.24282 | Software Supply Chain Smells: Lightweight Analysis for Secure Dependency Management | cs.SE, cs.CR | 78 | Lightweight tool to detect supply-chain security 'smells' in Maven/NPM; practical security signal | security, software-supply-chain, dependency-management, tooling, risk-detection |
2603.24518 | TuneShift-KD: Knowledge Distillation and Transfer for Fine-tuned Models | cs.LG | 78 | Transfers specialized knowledge across models via few-example distillation when data unavailable. | knowledge-distillation, model-transfer, fine-tuning, LoRA, data-privacy, LLMs |
AI Paper Insight Brief
2026-03-27
0) Executive takeaways (read this first)
- Agent “control planes” are becoming the new security perimeter—and composition is the brittle point. Formal conformance checking across agent protocols finds many spec/implementation violations, and composition breaks properties that hold in isolation (20/21 composition-safety invariant violations in composed models).
- Outcome-only evaluation is increasingly misleading for agents. Trace-level compliance checking finds that even “perfect outcome” traces can hide procedural violations (e.g., 83% of Claude traces with perfect τ2 reward still had at least one violation).
- Long-horizon reliability bottlenecks are shifting from “tool execution” to “state/memory correctness.” In an executable in-vehicle benchmark, errors are dominated by memory construction/retrieval (63.9% of errors), with large drops from gold-memory to autonomous-memory settings.
- For planning, “LLM as planner” still doesn’t scale; “LLM as formalizer/extractor” does. Multiple papers converge on constraining the LLM to information extraction / formalization and delegating search/verification to symbolic solvers, yielding large success-rate gains on large BlocksWorld and IPC/household planning.
- Alignment interventions can degrade safety instrumentation. Two distinct “alignment side effects” show up: (i) response homogenization can break sampling-based uncertainty (high single-cluster rates), and (ii) activation steering vectors can substantially increase jailbreak success depending on overlap with refusal directions.
- Multimodal safety is facing a paradigm shift. MLLM-based image generators produce higher unsafe-image rates than diffusion baselines and are harder for existing detectors to flag unless detectors are retrained on paradigm-inclusive data; multi-view LVLMs show a distinct hallucination mode with a decoding-time mitigation.
2) Key themes (clusters)
Theme: Agent protocol security & supply-chain style injection
- Why it matters: Agent protocols and tool-response channels carry semantic payloads interpreted at runtime; failures propagate through delegation chains and bridges. This creates a security surface that classic API/message security doesn’t cover.
- Representative papers:
- Common approach:
- Treat agent/tool interactions as a protocol with explicit invariants (formal models or policy rules).
- Evaluate end-to-end: spec → model checking / adversarial generation → executable tests or runtime enforcement.
- Emphasize composition and supply-chain threats (bridges, third-party servers, skills/plugins).
- Open questions / failure modes:
- How to standardize composition safety requirements for bridges/proxies so properties survive protocol mixing.
- Whether defenses (sanitizers, detectors) can keep up with adaptive black-box payload generation.
- Operational trade-offs: privacy and latency when using external “watcher” oversight.
Theme: Trace-level auditing beats outcome-only scoring for agents
- Why it matters: Agents fail procedurally (skipped checks, forbidden tool edges, rule violations) even when final outcomes look correct—undermining safety and auditability.
- Representative papers:
- Common approach:
- Extract checkable specifications from prompts/policies and score compliance across traces.
- Quantify uncertainty/ambiguity and route to HITL escalation (entropy/support thresholds; gated aggregations).
- Diagnose evaluator misalignment (LLM judges vs humans) with interpretable rubric axes.
- Open questions / failure modes:
- Cost and scalability of multi-LLM judging pipelines; risk of judge bias and positional bias.
- How to localize root causes (which step/spec caused failure) without double-penalizing correlated violations.
- Translating log-based “support” audits into real online guarantees (logs are observational).
Theme: Memory & longitudinal belief dynamics as the next reliability frontier
- Why it matters: Persistent agents must manage evolving preferences/beliefs across sessions and users; failures manifest as drift, contradiction mishandling, or incorrect state updates.
- Representative papers:
- Common approach:
- Long-horizon, multi-session trajectories with objective metrics (state-based evaluation; belief-state vectors).
- Separate “gold memory” vs autonomous memory construction to isolate the bottleneck.
- Measure stability–adaptability trade-offs (evidence-driven revision vs drift resistance).
- Open questions / failure modes:
- Retrieval helps recall but may not prevent model-induced nudging (RAG improved revision/CRR but barely changed drift coherence).
- Hard cases: conditional constraints and multi-user conflict resolution in executable environments.
- Scaling contradiction-resolution evaluation beyond human scoring.
Theme: Constrain LLMs to formalization/extraction; let solvers verify/search
- Why it matters: End-to-end LLM planning degrades sharply with combinatorial complexity; formalization + solver pipelines improve scalability and verifiability.
- Representative papers:
- Common approach:
- Use LLMs for schema-guided extraction or translation to solver-friendly representations (e.g., PDDL).
- Add deterministic mapping/validation layers to avoid brittle codegen.
- Use failure-triggered reflection/repair loops driven by solver diagnostics.
- Open questions / failure modes:
- Domain dependence (results concentrated in BlocksWorld / PDDL-authored domains).
- “Unraveling” compression: NL descriptions that expand to huge formal specs; higher-order generator approaches help but need broader validation.
- Latency of iterative repair loops in real-time embodied settings.
Theme: Multimodal hallucination & uncertainty—benchmarks + training-free mitigations
- Why it matters: Multimodal systems are increasingly deployed in multi-view and ensemble settings; hallucinations and mis-grounding can be systematic and architecture-dependent.
- Representative papers:
- Revealing Multi-View Hallucination in Large Vision-Language Models
- SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems
- When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
- GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
- Common approach:
- Build targeted benchmarks for specific failure modes (multi-view interference; decision-dense multi-POV temporal grounding).
- Prefer training-free mitigations when possible (contrastive decoding; entropy-weighted pooling).
- Evaluate abstention/hallucination detection explicitly (AUROC/AURAC; paired-view metrics).
- Open questions / failure modes:
- Extending beyond multiple-choice settings (SCoOP) and beyond paired-view setups (MVH-Bench).
- Detector brittleness under paradigm shifts (diffusion → MLLM generators) and prompt enrichment.
- Whether decoding-time fixes generalize across LVLM architectures.
Theme: Alignment side effects: uncertainty collapse, steering risks, and “encoded vs expressed” bias
- Why it matters: Alignment can change not just outputs but the reliability tooling around models (uncertainty estimation, jailbreak robustness) and can suppress bias without removing it.
- Representative papers:
- Common approach:
- Measure latent/representation-level phenomena (single-cluster rates; refusal directions; gender directions).
- Use causal-ish ablations (remove projection onto a direction) to test functional contribution.
- Propose routing/cascades or “safety-aware” constraints rather than relying on one estimator.
- Open questions / failure modes:
- One-dimensional direction approximations may be insufficient (refusal/bias likely multi-dimensional).
- Sampling-based uncertainty can become structurally uninformative under homogenization; need multi-signal designs.
- Debiasing may not generalize to open-ended tasks (e.g., story generation).
3) Technical synthesis
- Formal invariants + executable replay is emerging as a practical security workflow: AgentRFC ties prose specs → typed IR → TLA+ invariants → counterexample traces → SDK-level tests, bridging “paper security” to implementation failures.
- Composition is the recurring Achilles’ heel across both protocols and agents: protocol bridges break invariants; agent systems combine tools/memory/judges where local correctness doesn’t imply global safety.
- “Constrain the LLM” is a cross-domain pattern: DUPLEX confines LLMs to schema-guided IE; BlocksWorld work uses LLM-as-formalizer/higher-order generator; VLC decouples perception from exact symbolic execution.
- Evaluation is moving from single outputs to trajectories and distributions: AgentPex scores multi-evaluator compliance; BeliefShift evaluates belief-state sequences; SCoOP builds system-level distributions from sampled outputs.
- RAG helps recall but not necessarily “anti-drift”: BeliefShift shows RAG improves belief revision and contradiction handling but barely changes drift coherence; policy RAG shows retrieval metric gains don’t guarantee better answers.
- Training-free inference-time interventions are gaining traction: RSCD (attention-mask contrastive decoding) and SCoOP (entropy-weighted pooling) improve robustness without retraining, but rely on architectural assumptions and sampling cost.
- Alignment can undermine common safety heuristics: response homogenization breaks semantic-entropy/self-consistency; activation steering can increase jailbreak ASR depending on geometric overlap with refusal directions.
- Benchmarks are becoming more executable and state-based (VehicleMemBench) and more diagnostic via distractors and paired designs (GameplayQA, MVH-Bench), reducing reliance on subjective judging.
- Automated “research agents” are now a security factor: Claudini shows LLM-driven algorithm search can materially improve white-box jailbreak optimizers, raising the bar for defense evaluation.
4) Top 5 papers (with “why now”)
1) AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols
- Provides a 6-layer Agent Protocol Stack to make protocol completeness explicit across MCP/A2A/ANP/ACP.
- Defines 11 agent-agnostic security principles as TLA+ invariants with a taxonomy separating spec-mandated vs hardening.
- Delivers an end-to-end spec→IR→TLA+→counterexample→SDK test pipeline; finds 33 spec-level violations and confirms implementation violations via 42 tests.
- Why now: agent protocols are rapidly deployed and composed; the paper shows composition can break properties (20/21 CS invariant violations).
- Skeptical about: bounded model checking (small bounds) and manual normative-clause extraction.
2) Willful Disobedience: Automatically Detecting Failures in Agentic Traces
- Introduces AgentPex: explicit rule extraction from prompts/tool schemas + multi-evaluator trace auditing with gated-min aggregation.
- Shows outcome-only success can hide failures: 83% of “perfect reward” Claude traces still had procedural violations.
- Surfaces model-specific procedural failure modes (e.g., simultaneous text+tool-call violations).
- Why now: production agents generate massive traces; automated procedural auditing is becoming necessary.
- Skeptical about: cost (multiple LLM calls per trace) and benchmark/domain generality (τ2-bench focus).
3) VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
- Executable simulator + state-based evaluation for multi-user preference evolution with 111 APIs.
- Quantifies the gold-memory → autonomous-memory performance drop (e.g., 90.60 → 64.80 ESM for a strong model under one memory method).
- Finds memory errors dominate (63.9% of errors), not tool execution.
- Why now: memory is the bottleneck for long-horizon agents; this gives an objective harness.
- Skeptical about: scenario scope and extension to longer/more complex real driving contexts.
4) Language Model Planners do not Scale, but do Formalizers?
- Controlled scaling study: LLM-as-planner degrades to ~20% by ~30 blocks, while formalizers scale much better (e.g., one model at 100% up to 100 blocks).
- Practical techniques: divide-and-conquer formalization and higher-order formalizers that output generator programs for “unraveling” inputs.
- Why now: planning is central to agents; this clarifies where LLMs should sit in the stack.
- Skeptical about: domain narrowness (BlocksWorld) and single-run evaluation / solver-crash handling.
- Diagnoses response homogenization (high single-cluster rates) that makes sampling-based uncertainty collapse on many queries.
- Attributes much of the effect to preference optimization (DPO) via stage-wise ablations and base-vs-instruct comparisons.
- Proposes UCBD, a cheapest-first cascade that resolves many queries with cheap signals (token entropy) and escalates selectively.
- Why now: many safety stacks rely on self-consistency/semantic entropy; this shows when they fail structurally.
- Skeptical about: limited to open 3B–14B families and imperfect judge labels; cascade lacks formal guarantees.
5) Practical next steps
- If you ship agent protocols: adopt an APS-style checklist and add composition tests (bridge/proxy scenarios) as first-class CI artifacts; treat “holds in isolation” as insufficient.
- For MCP/tool ecosystems: assume structured tool outputs are an injection channel; add provenance, consent enforcement, and audit completeness checks, and test against adaptive payload generation (TIP-like).
- For agent evaluation: add trace-level compliance scoring (explicit rule extraction + forbidden edges + argument checks) alongside outcome metrics; track “procedural violation rate among successful outcomes.”
- For long-horizon assistants: measure memory as a stateful system (gold vs autonomous memory), and explicitly report memory-error breakdowns; prioritize conditional-constraint and conflict cases.
- For planning/automation: refactor stacks so LLMs do schema-guided extraction/formalization and deterministic mapping; use solver diagnostics to trigger targeted repair loops rather than free-form replanning.
- For uncertainty/abstention: don’t rely solely on sampling-based semantic entropy; add cheap single-pass signals (e.g., token entropy) and route to heavier checks only when needed.
- For multimodal deployments: retrain or at least validate fake-image detectors on MLLM-generated images (paradigm-inclusive data), and add multi-view-specific evaluations (paired-view grounding) if you use multi-camera inputs.
- For alignment interventions (steering, adapters): include jailbreak ASR regression tests as part of release gating; check geometric overlap with refusal directions and evaluate whether steering increases ASR under simple templates.
Generated from per-paper analyses; no external browsing.
