Daily AI Paper Report (2026-03-27)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 216
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-25T00:00:00Z → 2026-03-26T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.24511Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
PDF
cs.LG, cs.AI, cs.CR96Autonomous autoresearch finds stronger jailbreak/prompt-injection attack algorithms; big eval gains vs 30+ baselinesagentic-research, jailbreaks, prompt-injection, adversarial-attacks, red-teaming, white-box, evaluation
2603.23801AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols
PDF
cs.CR94Security principles + protocol stack + conformance tests for agent protocols (MCP/A2A/etc); formal invariants (TLA+)agent-security, protocols, MCP, formal-methods, TLA+, conformance-testing, security-principles
2603.24080LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale
PDF
cs.CL, cs.DB94Scales factuality auditing via 1M generated articles; shows big gap vs MMLU-style benchmarksfactuality, evaluation, parametric-knowledge, benchmarking, hallucinations, knowledge-auditing
2603.24203Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search
PDF
cs.CR, cs.AI92Black-box stealthy indirect prompt injection for MCP tool responses; adaptive search to bypass defensesprompt-injection, tool-security, MCP, black-box-attacks, agent-security, adversarial-search
2603.23806Willful Disobedience: Automatically Detecting Failures in Agentic Traces
PDF
cs.SE, cs.AI92Automated compliance checking of agentic traces; catches procedural/unsafe failures beyond outcomesagents, trace-evaluation, oversight, specification, tool-use, safety-monitoring, auditing
2603.24533UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
PDF
cs.LG, cs.AI, cs.CV92Self-evolving GUI agent w/ RFT+step-level distillation from failures; strong agentic reliability signal.agents, GUI-agents, self-improvement, rejection-finetuning, distillation, long-horizon, evaluation
2603.24079When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
PDF
cs.CV, cs.AI, cs.CR90Finds MLLM image generators produce more unsafe/fake images than diffusion; important new risk surfacemultimodal, image-generation, safety, unsafe-content, misinformation, evaluation
2603.23844Language Model Planners do not Scale, but do Formalizers?
PDF
cs.CL90Shows LLM formalizers scale on planning via solver programs; key for agent planning + verification.planning, program-synthesis, formalization, LLM-reasoning, solver, scaling, BlocksWorld
2603.24414ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers
PDF
cs.CR, cs.AI89Holistic runtime security for tool-using agents (skills/plugins/watchers) targeting leakage/escalation risksagent-runtime, sandboxing, permissions, tool-use, plugins, security-framework, data-leakage
2603.24329GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
PDF
cs.CL, cs.AI, cs.CV88Decision-dense POV-synced multi-video benchmark for agent perception/reasoning in 3D multi-agent playbenchmark, multimodal, video-understanding, agents, multi-agent, evaluation, POV
2603.23934Revealing Multi-View Hallucination in Large Vision-Language Models
PDF
cs.CV, cs.AI88MVH-Bench exposes multi-view VLM hallucinations; adds benchmark + training-free decoding mitigation.VLM, hallucination, benchmark, multiview, evaluation, decoding, robustness
2603.24543Analysing the Safety Pitfalls of Steering Vectors
PDF
cs.CR, cs.CL87Safety audit shows steering vectors can sharply raise/lower jailbreak ASR; highlights activation-steering risk surfaceactivation-steering, CAA, jailbreaks, robustness, model-editing, safety-evaluation
2603.24124The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
PDF
cs.LG, cs.AI, cs.CL86Finds RLHF/DPO causes response homogenization that breaks sampling-based uncertainty; important reliability insightRLHF, DPO, uncertainty, calibration, reliability, TruthfulQA, evaluation
2603.23909DUPLEX: Agentic Dual-System Planning via LLM-Driven Information Extraction
PDF
cs.AI86Neuro-symbolic planning that confines LLM to schema-guided extraction to reduce hallucinated plansagents, planning, neuro-symbolic, PDDL, reliability, hallucination-mitigation, robotics
2603.23841PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
PDF
cs.CL, cs.AI86New multi-turn roleplay benchmark to measure political values/bias drift across major LLMsbenchmark, bias, politics, multi-turn, evaluation, roleplay
2603.23867Can VLMs Reason Robustly? A Neuro-Symbolic Investigation
PDF
cs.LG, cs.AI, cs.CV86Finds VLM reasoning brittle under covariate shift; neuro-symbolic angle for robust generalization.VLM, robustness, distribution-shift, neuro-symbolic, reasoning, evaluation
2603.23848BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
PDF
cs.CL, cs.CY84Longitudinal benchmark for belief consistency/drift and evidence-driven revision in multi-session LLM agentsbenchmarks, agent-memory, longitudinal-eval, belief-dynamics, consistency, over-alignment
2603.24582The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
PDF
cs.AI84Markov framework to audit reliability vs oversight cost for stochastic agent workflows pre-deploymentagentic-systems, reliability, oversight, governance, risk-metrics, workflow-auditing
2603.23996Forensic Implications of Localized AI: Artifact Analysis of Ollama, LM Studio, and llama.cpp
PDF
cs.CR84Forensic artifact study of local LLM runners; relevant to auditing, incident response, and misuse.security, forensics, local-LLMs, ollama, llama.cpp, LM-Studio, auditability
2603.24440CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
PDF
cs.LG, cs.AI, cs.CV83Large human-annotated continuous video demos for computer-use agents; likely high leverage dataset for agent trainingcomputer-use-agents, datasets, demonstrations, video, tool-use, agent-training, evaluation
2603.24125Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study
PDF
cs.CL83Finds alignment reduces expressed but not encoded gender bias; unified intrinsic/extrinsic analysisbias, alignment, representation, fairness, evaluation, interpretability
2603.23990From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring
PDF
cs.CY, cs.AI82Interpretable orchestrator + specialized LLMs for tutoring; improves controllability and constraint adherenceagent-architecture, controllability, education, orchestration, reliability, governance
2603.23878The Luna Bound Propagator for Formal Analysis of Neural Networks
PDF
cs.LG, cs.AI, cs.LO82C++ alpha-CROWN bound propagator for NN verification; improves deployability of formal methods.verification, alpha-CROWN, robustness, formal-methods, tooling, C++
2603.24580Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
PDF
cs.CL, cs.AI, cs.CY, cs.IR, cs.LG81Shows better retrieval may not improve answers in AI policy QA; domain RAG study on AGORA with preferences/DPORAG, evaluation, AI-policy, retrieval, grounding, preference-learning, DPO
2603.24586Comparing Developer and LLM Biases in Code Evaluation
PDF
cs.SE, cs.CL81TRACE measures LLM-judge vs developer preference gaps; extracts rubric biases across coding settingsevaluation, LLM-judges, human-preferences, bias, code, rubrics, reliability
2603.23889Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration
PDF
cs.LG, cs.RO80Off-policy safe RL with constrained optimistic exploration; targets constraint violations directly.safe-RL, constraints, off-policy, exploration, robotics, reliability
2603.23853SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems
PDF
cs.AI, cs.MA79Training-free uncertainty pooling across multiple VLMs improves hallucination detection/abstentionuncertainty, hallucination-detection, VLM, ensembles, abstention, calibration
2603.23840VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
PDF
cs.AI, cs.CL78Executable benchmark for multi-user long-term memory + tool interaction in in-vehicle agents; tests conflicts over timebenchmarks, agent-memory, long-context, tool-use, multi-user, simulation, reliability
2603.24282Software Supply Chain Smells: Lightweight Analysis for Secure Dependency Management
PDF
cs.SE, cs.CR78Lightweight tool to detect supply-chain security 'smells' in Maven/NPM; practical security signalsecurity, software-supply-chain, dependency-management, tooling, risk-detection
2603.24518TuneShift-KD: Knowledge Distillation and Transfer for Fine-tuned Models
PDF
cs.LG78Transfers specialized knowledge across models via few-example distillation when data unavailable.knowledge-distillation, model-transfer, fine-tuning, LoRA, data-privacy, LLMs

AI Paper Insight Brief

2026-03-27

0) Executive takeaways (read this first)

  • Agent “control planes” are becoming the new security perimeter—and composition is the brittle point. Formal conformance checking across agent protocols finds many spec/implementation violations, and composition breaks properties that hold in isolation (20/21 composition-safety invariant violations in composed models).
  • Outcome-only evaluation is increasingly misleading for agents. Trace-level compliance checking finds that even “perfect outcome” traces can hide procedural violations (e.g., 83% of Claude traces with perfect τ2 reward still had at least one violation).
  • Long-horizon reliability bottlenecks are shifting from “tool execution” to “state/memory correctness.” In an executable in-vehicle benchmark, errors are dominated by memory construction/retrieval (63.9% of errors), with large drops from gold-memory to autonomous-memory settings.
  • For planning, “LLM as planner” still doesn’t scale; “LLM as formalizer/extractor” does. Multiple papers converge on constraining the LLM to information extraction / formalization and delegating search/verification to symbolic solvers, yielding large success-rate gains on large BlocksWorld and IPC/household planning.
  • Alignment interventions can degrade safety instrumentation. Two distinct “alignment side effects” show up: (i) response homogenization can break sampling-based uncertainty (high single-cluster rates), and (ii) activation steering vectors can substantially increase jailbreak success depending on overlap with refusal directions.
  • Multimodal safety is facing a paradigm shift. MLLM-based image generators produce higher unsafe-image rates than diffusion baselines and are harder for existing detectors to flag unless detectors are retrained on paradigm-inclusive data; multi-view LVLMs show a distinct hallucination mode with a decoding-time mitigation.

2) Key themes (clusters)

Theme: Agent protocol security & supply-chain style injection

Theme: Trace-level auditing beats outcome-only scoring for agents

Theme: Memory & longitudinal belief dynamics as the next reliability frontier

  • Why it matters: Persistent agents must manage evolving preferences/beliefs across sessions and users; failures manifest as drift, contradiction mishandling, or incorrect state updates.
  • Representative papers:
  • Common approach:
    • Long-horizon, multi-session trajectories with objective metrics (state-based evaluation; belief-state vectors).
    • Separate “gold memory” vs autonomous memory construction to isolate the bottleneck.
    • Measure stability–adaptability trade-offs (evidence-driven revision vs drift resistance).
  • Open questions / failure modes:
    • Retrieval helps recall but may not prevent model-induced nudging (RAG improved revision/CRR but barely changed drift coherence).
    • Hard cases: conditional constraints and multi-user conflict resolution in executable environments.
    • Scaling contradiction-resolution evaluation beyond human scoring.

Theme: Constrain LLMs to formalization/extraction; let solvers verify/search

  • Why it matters: End-to-end LLM planning degrades sharply with combinatorial complexity; formalization + solver pipelines improve scalability and verifiability.
  • Representative papers:
  • Common approach:
    • Use LLMs for schema-guided extraction or translation to solver-friendly representations (e.g., PDDL).
    • Add deterministic mapping/validation layers to avoid brittle codegen.
    • Use failure-triggered reflection/repair loops driven by solver diagnostics.
  • Open questions / failure modes:
    • Domain dependence (results concentrated in BlocksWorld / PDDL-authored domains).
    • “Unraveling” compression: NL descriptions that expand to huge formal specs; higher-order generator approaches help but need broader validation.
    • Latency of iterative repair loops in real-time embodied settings.

Theme: Multimodal hallucination & uncertainty—benchmarks + training-free mitigations

Theme: Alignment side effects: uncertainty collapse, steering risks, and “encoded vs expressed” bias

3) Technical synthesis

  • Formal invariants + executable replay is emerging as a practical security workflow: AgentRFC ties prose specs → typed IR → TLA+ invariants → counterexample traces → SDK-level tests, bridging “paper security” to implementation failures.
  • Composition is the recurring Achilles’ heel across both protocols and agents: protocol bridges break invariants; agent systems combine tools/memory/judges where local correctness doesn’t imply global safety.
  • “Constrain the LLM” is a cross-domain pattern: DUPLEX confines LLMs to schema-guided IE; BlocksWorld work uses LLM-as-formalizer/higher-order generator; VLC decouples perception from exact symbolic execution.
  • Evaluation is moving from single outputs to trajectories and distributions: AgentPex scores multi-evaluator compliance; BeliefShift evaluates belief-state sequences; SCoOP builds system-level distributions from sampled outputs.
  • RAG helps recall but not necessarily “anti-drift”: BeliefShift shows RAG improves belief revision and contradiction handling but barely changes drift coherence; policy RAG shows retrieval metric gains don’t guarantee better answers.
  • Training-free inference-time interventions are gaining traction: RSCD (attention-mask contrastive decoding) and SCoOP (entropy-weighted pooling) improve robustness without retraining, but rely on architectural assumptions and sampling cost.
  • Alignment can undermine common safety heuristics: response homogenization breaks semantic-entropy/self-consistency; activation steering can increase jailbreak ASR depending on geometric overlap with refusal directions.
  • Benchmarks are becoming more executable and state-based (VehicleMemBench) and more diagnostic via distractors and paired designs (GameplayQA, MVH-Bench), reducing reliance on subjective judging.
  • Automated “research agents” are now a security factor: Claudini shows LLM-driven algorithm search can materially improve white-box jailbreak optimizers, raising the bar for defense evaluation.

4) Top 5 papers (with “why now”)

1) AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols

  • Provides a 6-layer Agent Protocol Stack to make protocol completeness explicit across MCP/A2A/ANP/ACP.
  • Defines 11 agent-agnostic security principles as TLA+ invariants with a taxonomy separating spec-mandated vs hardening.
  • Delivers an end-to-end spec→IR→TLA+→counterexample→SDK test pipeline; finds 33 spec-level violations and confirms implementation violations via 42 tests.
  • Why now: agent protocols are rapidly deployed and composed; the paper shows composition can break properties (20/21 CS invariant violations).
  • Skeptical about: bounded model checking (small bounds) and manual normative-clause extraction.

2) Willful Disobedience: Automatically Detecting Failures in Agentic Traces

  • Introduces AgentPex: explicit rule extraction from prompts/tool schemas + multi-evaluator trace auditing with gated-min aggregation.
  • Shows outcome-only success can hide failures: 83% of “perfect reward” Claude traces still had procedural violations.
  • Surfaces model-specific procedural failure modes (e.g., simultaneous text+tool-call violations).
  • Why now: production agents generate massive traces; automated procedural auditing is becoming necessary.
  • Skeptical about: cost (multiple LLM calls per trace) and benchmark/domain generality (τ2-bench focus).

3) VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

  • Executable simulator + state-based evaluation for multi-user preference evolution with 111 APIs.
  • Quantifies the gold-memory → autonomous-memory performance drop (e.g., 90.60 → 64.80 ESM for a strong model under one memory method).
  • Finds memory errors dominate (63.9% of errors), not tool execution.
  • Why now: memory is the bottleneck for long-horizon agents; this gives an objective harness.
  • Skeptical about: scenario scope and extension to longer/more complex real driving contexts.

4) Language Model Planners do not Scale, but do Formalizers?

  • Controlled scaling study: LLM-as-planner degrades to ~20% by ~30 blocks, while formalizers scale much better (e.g., one model at 100% up to 100 blocks).
  • Practical techniques: divide-and-conquer formalization and higher-order formalizers that output generator programs for “unraveling” inputs.
  • Why now: planning is central to agents; this clarifies where LLMs should sit in the stack.
  • Skeptical about: domain narrowness (BlocksWorld) and single-run evaluation / solver-crash handling.

5) The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

  • Diagnoses response homogenization (high single-cluster rates) that makes sampling-based uncertainty collapse on many queries.
  • Attributes much of the effect to preference optimization (DPO) via stage-wise ablations and base-vs-instruct comparisons.
  • Proposes UCBD, a cheapest-first cascade that resolves many queries with cheap signals (token entropy) and escalates selectively.
  • Why now: many safety stacks rely on self-consistency/semantic entropy; this shows when they fail structurally.
  • Skeptical about: limited to open 3B–14B families and imperfect judge labels; cascade lacks formal guarantees.

5) Practical next steps

  • If you ship agent protocols: adopt an APS-style checklist and add composition tests (bridge/proxy scenarios) as first-class CI artifacts; treat “holds in isolation” as insufficient.
  • For MCP/tool ecosystems: assume structured tool outputs are an injection channel; add provenance, consent enforcement, and audit completeness checks, and test against adaptive payload generation (TIP-like).
  • For agent evaluation: add trace-level compliance scoring (explicit rule extraction + forbidden edges + argument checks) alongside outcome metrics; track “procedural violation rate among successful outcomes.”
  • For long-horizon assistants: measure memory as a stateful system (gold vs autonomous memory), and explicitly report memory-error breakdowns; prioritize conditional-constraint and conflict cases.
  • For planning/automation: refactor stacks so LLMs do schema-guided extraction/formalization and deterministic mapping; use solver diagnostics to trigger targeted repair loops rather than free-form replanning.
  • For uncertainty/abstention: don’t rely solely on sampling-based semantic entropy; add cheap single-pass signals (e.g., token entropy) and route to heavier checks only when needed.
  • For multimodal deployments: retrain or at least validate fake-image detectors on MLLM-generated images (paradigm-inclusive data), and add multi-view-specific evaluations (paired-view grounding) if you use multi-camera inputs.
  • For alignment interventions (steering, adapters): include jailbreak ASR regression tests as part of release gating; check geometric overlap with refusal directions and evaluate whether steering increases ASR under simple templates.

Generated from per-paper analyses; no external browsing.