Daily AI Paper Report (2026-03-27)

Published: March 27, 2026

Chinese version: [中文]

Run stats

Candidates: 216
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-25T00:00:00Z → 2026-03-26T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.24511`	Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs PDF	cs.LG, cs.AI, cs.CR	96	Autonomous autoresearch finds stronger jailbreak/prompt-injection attack algorithms; big eval gains vs 30+ baselines	agentic-research, jailbreaks, prompt-injection, adversarial-attacks, red-teaming, white-box, evaluation
`2603.23801`	AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols PDF	cs.CR	94	Security principles + protocol stack + conformance tests for agent protocols (MCP/A2A/etc); formal invariants (TLA+)	agent-security, protocols, MCP, formal-methods, TLA+, conformance-testing, security-principles
`2603.24080`	LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale PDF	cs.CL, cs.DB	94	Scales factuality auditing via 1M generated articles; shows big gap vs MMLU-style benchmarks	factuality, evaluation, parametric-knowledge, benchmarking, hallucinations, knowledge-auditing
`2603.24203`	Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search PDF	cs.CR, cs.AI	92	Black-box stealthy indirect prompt injection for MCP tool responses; adaptive search to bypass defenses	prompt-injection, tool-security, MCP, black-box-attacks, agent-security, adversarial-search
`2603.23806`	Willful Disobedience: Automatically Detecting Failures in Agentic Traces PDF	cs.SE, cs.AI	92	Automated compliance checking of agentic traces; catches procedural/unsafe failures beyond outcomes	agents, trace-evaluation, oversight, specification, tool-use, safety-monitoring, auditing
`2603.24533`	UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience PDF	cs.LG, cs.AI, cs.CV	92	Self-evolving GUI agent w/ RFT+step-level distillation from failures; strong agentic reliability signal.	agents, GUI-agents, self-improvement, rejection-finetuning, distillation, long-horizon, evaluation
`2603.24079`	When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm PDF	cs.CV, cs.AI, cs.CR	90	Finds MLLM image generators produce more unsafe/fake images than diffusion; important new risk surface	multimodal, image-generation, safety, unsafe-content, misinformation, evaluation
`2603.23844`	Language Model Planners do not Scale, but do Formalizers? PDF	cs.CL	90	Shows LLM formalizers scale on planning via solver programs; key for agent planning + verification.	planning, program-synthesis, formalization, LLM-reasoning, solver, scaling, BlocksWorld
`2603.24414`	ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers PDF	cs.CR, cs.AI	89	Holistic runtime security for tool-using agents (skills/plugins/watchers) targeting leakage/escalation risks	agent-runtime, sandboxing, permissions, tool-use, plugins, security-framework, data-leakage
`2603.24329`	GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents PDF	cs.CL, cs.AI, cs.CV	88	Decision-dense POV-synced multi-video benchmark for agent perception/reasoning in 3D multi-agent play	benchmark, multimodal, video-understanding, agents, multi-agent, evaluation, POV
`2603.23934`	Revealing Multi-View Hallucination in Large Vision-Language Models PDF	cs.CV, cs.AI	88	MVH-Bench exposes multi-view VLM hallucinations; adds benchmark + training-free decoding mitigation.	VLM, hallucination, benchmark, multiview, evaluation, decoding, robustness
`2603.24543`	Analysing the Safety Pitfalls of Steering Vectors PDF	cs.CR, cs.CL	87	Safety audit shows steering vectors can sharply raise/lower jailbreak ASR; highlights activation-steering risk surface	activation-steering, CAA, jailbreaks, robustness, model-editing, safety-evaluation
`2603.24124`	The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation PDF	cs.LG, cs.AI, cs.CL	86	Finds RLHF/DPO causes response homogenization that breaks sampling-based uncertainty; important reliability insight	RLHF, DPO, uncertainty, calibration, reliability, TruthfulQA, evaluation
`2603.23909`	DUPLEX: Agentic Dual-System Planning via LLM-Driven Information Extraction PDF	cs.AI	86	Neuro-symbolic planning that confines LLM to schema-guided extraction to reduce hallucinated plans	agents, planning, neuro-symbolic, PDDL, reliability, hallucination-mitigation, robotics
`2603.23841`	PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay PDF	cs.CL, cs.AI	86	New multi-turn roleplay benchmark to measure political values/bias drift across major LLMs	benchmark, bias, politics, multi-turn, evaluation, roleplay
`2603.23867`	Can VLMs Reason Robustly? A Neuro-Symbolic Investigation PDF	cs.LG, cs.AI, cs.CV	86	Finds VLM reasoning brittle under covariate shift; neuro-symbolic angle for robust generalization.	VLM, robustness, distribution-shift, neuro-symbolic, reasoning, evaluation
`2603.23848`	BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents PDF	cs.CL, cs.CY	84	Longitudinal benchmark for belief consistency/drift and evidence-driven revision in multi-session LLM agents	benchmarks, agent-memory, longitudinal-eval, belief-dynamics, consistency, over-alignment
`2603.24582`	The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence PDF	cs.AI	84	Markov framework to audit reliability vs oversight cost for stochastic agent workflows pre-deployment	agentic-systems, reliability, oversight, governance, risk-metrics, workflow-auditing
`2603.23996`	Forensic Implications of Localized AI: Artifact Analysis of Ollama, LM Studio, and llama.cpp PDF	cs.CR	84	Forensic artifact study of local LLM runners; relevant to auditing, incident response, and misuse.	security, forensics, local-LLMs, ollama, llama.cpp, LM-Studio, auditability
`2603.24440`	CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents PDF	cs.LG, cs.AI, cs.CV	83	Large human-annotated continuous video demos for computer-use agents; likely high leverage dataset for agent training	computer-use-agents, datasets, demonstrations, video, tool-use, agent-training, evaluation
`2603.24125`	Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study PDF	cs.CL	83	Finds alignment reduces expressed but not encoded gender bias; unified intrinsic/extrinsic analysis	bias, alignment, representation, fairness, evaluation, interpretability
`2603.23990`	From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring PDF	cs.CY, cs.AI	82	Interpretable orchestrator + specialized LLMs for tutoring; improves controllability and constraint adherence	agent-architecture, controllability, education, orchestration, reliability, governance
`2603.23878`	The Luna Bound Propagator for Formal Analysis of Neural Networks PDF	cs.LG, cs.AI, cs.LO	82	C++ alpha-CROWN bound propagator for NN verification; improves deployability of formal methods.	verification, alpha-CROWN, robustness, formal-methods, tooling, C++
`2603.24580`	Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA PDF	cs.CL, cs.AI, cs.CY, cs.IR, cs.LG	81	Shows better retrieval may not improve answers in AI policy QA; domain RAG study on AGORA with preferences/DPO	RAG, evaluation, AI-policy, retrieval, grounding, preference-learning, DPO
`2603.24586`	Comparing Developer and LLM Biases in Code Evaluation PDF	cs.SE, cs.CL	81	TRACE measures LLM-judge vs developer preference gaps; extracts rubric biases across coding settings	evaluation, LLM-judges, human-preferences, bias, code, rubrics, reliability
`2603.23889`	Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration PDF	cs.LG, cs.RO	80	Off-policy safe RL with constrained optimistic exploration; targets constraint violations directly.	safe-RL, constraints, off-policy, exploration, robotics, reliability
`2603.23853`	SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems PDF	cs.AI, cs.MA	79	Training-free uncertainty pooling across multiple VLMs improves hallucination detection/abstention	uncertainty, hallucination-detection, VLM, ensembles, abstention, calibration
`2603.23840`	VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents PDF	cs.AI, cs.CL	78	Executable benchmark for multi-user long-term memory + tool interaction in in-vehicle agents; tests conflicts over time	benchmarks, agent-memory, long-context, tool-use, multi-user, simulation, reliability
`2603.24282`	Software Supply Chain Smells: Lightweight Analysis for Secure Dependency Management PDF	cs.SE, cs.CR	78	Lightweight tool to detect supply-chain security 'smells' in Maven/NPM; practical security signal	security, software-supply-chain, dependency-management, tooling, risk-detection
`2603.24518`	TuneShift-KD: Knowledge Distillation and Transfer for Fine-tuned Models PDF	cs.LG	78	Transfers specialized knowledge across models via few-example distillation when data unavailable.	knowledge-distillation, model-transfer, fine-tuning, LoRA, data-privacy, LLMs

AI Paper Insight Brief

2026-03-27

0) Executive takeaways (read this first)

Agent “control planes” are becoming the new security perimeter—and composition is the brittle point. Formal conformance checking across agent protocols finds many spec/implementation violations, and composition breaks properties that hold in isolation (20/21 composition-safety invariant violations in composed models).
Outcome-only evaluation is increasingly misleading for agents. Trace-level compliance checking finds that even “perfect outcome” traces can hide procedural violations (e.g., 83% of Claude traces with perfect τ2 reward still had at least one violation).
Long-horizon reliability bottlenecks are shifting from “tool execution” to “state/memory correctness.” In an executable in-vehicle benchmark, errors are dominated by memory construction/retrieval (63.9% of errors), with large drops from gold-memory to autonomous-memory settings.
For planning, “LLM as planner” still doesn’t scale; “LLM as formalizer/extractor” does. Multiple papers converge on constraining the LLM to information extraction / formalization and delegating search/verification to symbolic solvers, yielding large success-rate gains on large BlocksWorld and IPC/household planning.
Alignment interventions can degrade safety instrumentation. Two distinct “alignment side effects” show up: (i) response homogenization can break sampling-based uncertainty (high single-cluster rates), and (ii) activation steering vectors can substantially increase jailbreak success depending on overlap with refusal directions.
Multimodal safety is facing a paradigm shift. MLLM-based image generators produce higher unsafe-image rates than diffusion baselines and are harder for existing detectors to flag unless detectors are retrained on paradigm-inclusive data; multi-view LVLMs show a distinct hallucination mode with a decoding-time mitigation.

2) Key themes (clusters)

Theme: Agent protocol security & supply-chain style injection

Why it matters: Agent protocols and tool-response channels carry semantic payloads interpreted at runtime; failures propagate through delegation chains and bridges. This creates a security surface that classic API/message security doesn’t cover.
Representative papers:
Common approach:
- Treat agent/tool interactions as a protocol with explicit invariants (formal models or policy rules).
- Evaluate end-to-end: spec → model checking / adversarial generation → executable tests or runtime enforcement.
- Emphasize composition and supply-chain threats (bridges, third-party servers, skills/plugins).
Open questions / failure modes:
- How to standardize composition safety requirements for bridges/proxies so properties survive protocol mixing.
- Whether defenses (sanitizers, detectors) can keep up with adaptive black-box payload generation.
- Operational trade-offs: privacy and latency when using external “watcher” oversight.

Theme: Trace-level auditing beats outcome-only scoring for agents

Why it matters: Agents fail procedurally (skipped checks, forbidden tool edges, rule violations) even when final outcomes look correct—undermining safety and auditability.
Representative papers:
Common approach:
- Extract checkable specifications from prompts/policies and score compliance across traces.
- Quantify uncertainty/ambiguity and route to HITL escalation (entropy/support thresholds; gated aggregations).
- Diagnose evaluator misalignment (LLM judges vs humans) with interpretable rubric axes.
Open questions / failure modes:
- Cost and scalability of multi-LLM judging pipelines; risk of judge bias and positional bias.
- How to localize root causes (which step/spec caused failure) without double-penalizing correlated violations.
- Translating log-based “support” audits into real online guarantees (logs are observational).

Theme: Memory & longitudinal belief dynamics as the next reliability frontier

Why it matters: Persistent agents must manage evolving preferences/beliefs across sessions and users; failures manifest as drift, contradiction mishandling, or incorrect state updates.
Representative papers:
- VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
- BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Common approach:
- Long-horizon, multi-session trajectories with objective metrics (state-based evaluation; belief-state vectors).
- Separate “gold memory” vs autonomous memory construction to isolate the bottleneck.
- Measure stability–adaptability trade-offs (evidence-driven revision vs drift resistance).
Open questions / failure modes:
- Retrieval helps recall but may not prevent model-induced nudging (RAG improved revision/CRR but barely changed drift coherence).
- Hard cases: conditional constraints and multi-user conflict resolution in executable environments.
- Scaling contradiction-resolution evaluation beyond human scoring.

Theme: Constrain LLMs to formalization/extraction; let solvers verify/search

Why it matters: End-to-end LLM planning degrades sharply with combinatorial complexity; formalization + solver pipelines improve scalability and verifiability.
Representative papers:
- Language Model Planners do not Scale, but do Formalizers?
- DUPLEX: Agentic Dual-System Planning via LLM-Driven Information Extraction
Common approach:
- Use LLMs for schema-guided extraction or translation to solver-friendly representations (e.g., PDDL).
- Add deterministic mapping/validation layers to avoid brittle codegen.
- Use failure-triggered reflection/repair loops driven by solver diagnostics.
Open questions / failure modes:
- Domain dependence (results concentrated in BlocksWorld / PDDL-authored domains).
- “Unraveling” compression: NL descriptions that expand to huge formal specs; higher-order generator approaches help but need broader validation.
- Latency of iterative repair loops in real-time embodied settings.

Theme: Multimodal hallucination & uncertainty—benchmarks + training-free mitigations

Why it matters: Multimodal systems are increasingly deployed in multi-view and ensemble settings; hallucinations and mis-grounding can be systematic and architecture-dependent.
Representative papers:
Common approach:
- Build targeted benchmarks for specific failure modes (multi-view interference; decision-dense multi-POV temporal grounding).
- Prefer training-free mitigations when possible (contrastive decoding; entropy-weighted pooling).
- Evaluate abstention/hallucination detection explicitly (AUROC/AURAC; paired-view metrics).
Open questions / failure modes:
- Extending beyond multiple-choice settings (SCoOP) and beyond paired-view setups (MVH-Bench).
- Detector brittleness under paradigm shifts (diffusion → MLLM generators) and prompt enrichment.
- Whether decoding-time fixes generalize across LVLM architectures.

Theme: Alignment side effects: uncertainty collapse, steering risks, and “encoded vs expressed” bias

Why it matters: Alignment can change not just outputs but the reliability tooling around models (uncertainty estimation, jailbreak robustness) and can suppress bias without removing it.
Representative papers:
Common approach:
- Measure latent/representation-level phenomena (single-cluster rates; refusal directions; gender directions).
- Use causal-ish ablations (remove projection onto a direction) to test functional contribution.
- Propose routing/cascades or “safety-aware” constraints rather than relying on one estimator.
Open questions / failure modes:
- One-dimensional direction approximations may be insufficient (refusal/bias likely multi-dimensional).
- Sampling-based uncertainty can become structurally uninformative under homogenization; need multi-signal designs.
- Debiasing may not generalize to open-ended tasks (e.g., story generation).

3) Technical synthesis

Formal invariants + executable replay is emerging as a practical security workflow: AgentRFC ties prose specs → typed IR → TLA+ invariants → counterexample traces → SDK-level tests, bridging “paper security” to implementation failures.
Composition is the recurring Achilles’ heel across both protocols and agents: protocol bridges break invariants; agent systems combine tools/memory/judges where local correctness doesn’t imply global safety.
“Constrain the LLM” is a cross-domain pattern: DUPLEX confines LLMs to schema-guided IE; BlocksWorld work uses LLM-as-formalizer/higher-order generator; VLC decouples perception from exact symbolic execution.
Evaluation is moving from single outputs to trajectories and distributions: AgentPex scores multi-evaluator compliance; BeliefShift evaluates belief-state sequences; SCoOP builds system-level distributions from sampled outputs.
RAG helps recall but not necessarily “anti-drift”: BeliefShift shows RAG improves belief revision and contradiction handling but barely changes drift coherence; policy RAG shows retrieval metric gains don’t guarantee better answers.
Training-free inference-time interventions are gaining traction: RSCD (attention-mask contrastive decoding) and SCoOP (entropy-weighted pooling) improve robustness without retraining, but rely on architectural assumptions and sampling cost.
Alignment can undermine common safety heuristics: response homogenization breaks semantic-entropy/self-consistency; activation steering can increase jailbreak ASR depending on geometric overlap with refusal directions.
Benchmarks are becoming more executable and state-based (VehicleMemBench) and more diagnostic via distractors and paired designs (GameplayQA, MVH-Bench), reducing reliance on subjective judging.
Automated “research agents” are now a security factor: Claudini shows LLM-driven algorithm search can materially improve white-box jailbreak optimizers, raising the bar for defense evaluation.

4) Top 5 papers (with “why now”)

1) AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols

Provides a 6-layer Agent Protocol Stack to make protocol completeness explicit across MCP/A2A/ANP/ACP.
Defines 11 agent-agnostic security principles as TLA+ invariants with a taxonomy separating spec-mandated vs hardening.
Delivers an end-to-end spec→IR→TLA+→counterexample→SDK test pipeline; finds 33 spec-level violations and confirms implementation violations via 42 tests.
Why now: agent protocols are rapidly deployed and composed; the paper shows composition can break properties (20/21 CS invariant violations).
Skeptical about: bounded model checking (small bounds) and manual normative-clause extraction.

2) Willful Disobedience: Automatically Detecting Failures in Agentic Traces

Introduces AgentPex: explicit rule extraction from prompts/tool schemas + multi-evaluator trace auditing with gated-min aggregation.
Shows outcome-only success can hide failures: 83% of “perfect reward” Claude traces still had procedural violations.
Surfaces model-specific procedural failure modes (e.g., simultaneous text+tool-call violations).
Why now: production agents generate massive traces; automated procedural auditing is becoming necessary.
Skeptical about: cost (multiple LLM calls per trace) and benchmark/domain generality (τ2-bench focus).

3) VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

Executable simulator + state-based evaluation for multi-user preference evolution with 111 APIs.
Quantifies the gold-memory → autonomous-memory performance drop (e.g., 90.60 → 64.80 ESM for a strong model under one memory method).
Finds memory errors dominate (63.9% of errors), not tool execution.
Why now: memory is the bottleneck for long-horizon agents; this gives an objective harness.
Skeptical about: scenario scope and extension to longer/more complex real driving contexts.

4) Language Model Planners do not Scale, but do Formalizers?

Controlled scaling study: LLM-as-planner degrades to ~20% by ~30 blocks, while formalizers scale much better (e.g., one model at 100% up to 100 blocks).
Practical techniques: divide-and-conquer formalization and higher-order formalizers that output generator programs for “unraveling” inputs.
Why now: planning is central to agents; this clarifies where LLMs should sit in the stack.
Skeptical about: domain narrowness (BlocksWorld) and single-run evaluation / solver-crash handling.

5) The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

Diagnoses response homogenization (high single-cluster rates) that makes sampling-based uncertainty collapse on many queries.
Attributes much of the effect to preference optimization (DPO) via stage-wise ablations and base-vs-instruct comparisons.
Proposes UCBD, a cheapest-first cascade that resolves many queries with cheap signals (token entropy) and escalates selectively.
Why now: many safety stacks rely on self-consistency/semantic entropy; this shows when they fail structurally.
Skeptical about: limited to open 3B–14B families and imperfect judge labels; cascade lacks formal guarantees.

5) Practical next steps

If you ship agent protocols: adopt an APS-style checklist and add composition tests (bridge/proxy scenarios) as first-class CI artifacts; treat “holds in isolation” as insufficient.
For MCP/tool ecosystems: assume structured tool outputs are an injection channel; add provenance, consent enforcement, and audit completeness checks, and test against adaptive payload generation (TIP-like).
For agent evaluation: add trace-level compliance scoring (explicit rule extraction + forbidden edges + argument checks) alongside outcome metrics; track “procedural violation rate among successful outcomes.”
For long-horizon assistants: measure memory as a stateful system (gold vs autonomous memory), and explicitly report memory-error breakdowns; prioritize conditional-constraint and conflict cases.
For planning/automation: refactor stacks so LLMs do schema-guided extraction/formalization and deterministic mapping; use solver diagnostics to trigger targeted repair loops rather than free-form replanning.
For uncertainty/abstention: don’t rely solely on sampling-based semantic entropy; add cheap single-pass signals (e.g., token entropy) and route to heavier checks only when needed.
For multimodal deployments: retrain or at least validate fake-image detectors on MLLM-generated images (paradigm-inclusive data), and add multi-view-specific evaluations (paired-view grounding) if you use multi-camera inputs.
For alignment interventions (steering, adapters): include jailbreak ASR regression tests as part of release gating; check geometric overlap with refusal directions and evaluate whether steering increases ASR under simple templates.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-27

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Agent protocol security & supply-chain style injection

Theme: Trace-level auditing beats outcome-only scoring for agents

Theme: Memory & longitudinal belief dynamics as the next reliability frontier

Theme: Constrain LLMs to formalization/extraction; let solvers verify/search

Theme: Multimodal hallucination & uncertainty—benchmarks + training-free mitigations

Theme: Alignment side effects: uncertainty collapse, steering risks, and “encoded vs expressed” bias

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps