Daily AI Paper Report (2026-03-12)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 252
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-10T00:00:00Z → 2026-03-11T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.09772Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors
PDF
cs.CV, cs.CR95Shows backdoors persist via alternative triggers; defenses removing known triggers can failbackdoors, adversarial-ML, security, representation-learning, robustness
2603.09706OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
PDF
cs.AI94Consequence-driven MLLM safety benchmark; finds causal blindness and alignment ceiling in frontier modelsmultimodal-safety, benchmark, agent-safety, causal-reasoning, evaluation, robustness
2603.09884Benchmarking Political Persuasion Risks Across Frontier Large Language Models
PDF
cs.CL, cs.CY94Large-N benchmark shows frontier LLMs beat ads at persuasion; high societal misuse relevancepolitical-persuasion, misuse, evaluation, survey-experiments, frontier-models
2603.09246Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models
PDF
cs.CR93New jailbreak class for LVLMs via compositional reasoning; ROP-style chaining of benign premises into harmjailbreaks, multimodal, adversarial-attacks, compositionality, security, red-teaming
2603.09046FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation
PDF
cs.CR, cs.LG, cs.OS93Secure on-device LLM serving w/ TrustZone-style isolation; strong systems+security relevance.llm-serving, mobile, trusted-execution, isolation, confidential-inference, systems-security
2603.09781CLIOPATRA: Extracting Private Information from LLM Insights
PDF
cs.CR92Privacy attack on “privacy-preserving” LLM insights; shows realistic data insertion can induce leakageprivacy, data-exfiltration, security, LLM-systems, auditing, attack
2603.09203Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents
PDF
cs.AI91Process rewards for retrieval-augmented agents; explicit eval action + GRPO variant for reliability.agents, rag, process-supervision, reinforcement-learning, reliability, evaluation
2603.09957Think Before You Lie: How Reasoning Improves Honesty
PDF
cs.AI, cs.CL, cs.LG91Finds reasoning increases LLM honesty; probes mechanisms via representation geometryhonesty, deception, reasoning, mechanistic-analysis, behavior
2603.09157Real-Time Trust Verification for Safe Agentic Actions using TrustBench
PDF
cs.AI90TrustBench verifies agent actions pre-execution; shifts from post-hoc eval to real-time safety gatingagents, runtime-guardrails, verification, evaluation, trustworthiness, tool-use
2603.09036SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding
PDF
cs.LG90LLM-guided symbolic planning + RL grounding with feedback loop; strong agent skill learningagents, LLM-planning, reinforcement-learning, skill-learning, tool-use
2603.09875The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation
PDF
cs.MA, cs.CR, cs.DC89Agent auth revocation framed as coherence; bounds unauthorized ops under fast agent executionagent-security, access-control, capabilities, revocation, distributed-systems
2603.09337Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments
PDF
cs.CV, cs.AI88STAR benchmark tests LLMs as adversarial agents in zero-sum, real-time/turn-based settings.agent-evals, adversarial, multi-agent, benchmark, strategic-reasoning, red-teaming
2603.09731EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
PDF
cs.CV, cs.AI, cs.CL88New benchmark for long-horizon egocentric action→scene prediction; targets embodied MLLM limitsbenchmark, embodied-agents, multimodal, long-horizon-reasoning, evaluation
2603.09134AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations
PDF
cs.CR, cs.MA, cs.SE86Enterprise multi-agent security framework; decomposes attack surfaces around tool orchestration & memoryagent-security, enterprise, threat-modeling, memory, tooling, architecture
2603.09309Rescaling Confidence: What Scale Design Reveals About LLM Metacognition
PDF
cs.AI86Shows confidence scales distort metacognition; proposes better scale improving meta-d'uncertainty, calibration, metacognition, evaluation, confidence
2603.09127Chaotic Dynamics in Multi-LLM Deliberation
PDF
cs.AI, cs.MA85Shows multi-LLM committees can be chaotic even at T=0; empirical Lyapunov analysis of instability routesmulti-agent, deliberation, stability, evaluation, reproducibility, dynamics
2603.09065Learning Adaptive LLM Decoding
PDF
cs.LG85Learns adaptive decoding policies via RL without model finetune; compute-aware inference gainsdecoding, inference, test-time, rl, efficiency
2603.09452CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?
PDF
cs.CR, cs.CL84Realistic CTI/OSINT workflow benchmark for LLM threat research (triage→search→draft), beyond MCQ metricscybersecurity, evaluation, agents, OSINT, benchmark, workflows
2603.09435AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems
PDF
cs.AI84Open dataset for evaluating NLP/RAG systems against EU AI Act-style compliance requirements.rag, evaluation, governance, compliance, dataset, auditability
2603.09297TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA
PDF
cs.IR, cs.CL84Tool-augmented autonomous memory retrieval for long-term conversational QA beyond top-kmemory, agents, retrieval, long-context, conversational-QA
2603.09906Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs
PDF
cs.CL84Controlled study: reasoning boosts single-hop factual recall; proposes buffer/mechanism storyparametric-knowledge, reasoning, factuality, mechanisms, elicitation
2603.09951Towards a Neural Debugger for Python
PDF
cs.LG, cs.AI, cs.SE83Neural debugger trained on execution traces; enables interactive stepping/breakpoints for code LMs.code-llms, tooling, debugging, execution-traces, reliability, agents
2603.09184Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning
PDF
cs.LG, cs.AI83Bridges diffusion planners with AR executors; improves reasoning via latent communicationplanning, diffusion-lm, agents, reasoning, coordination
2603.09434Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs
PDF
cs.CL, cs.AI82CoMoral benchmark reveals narrative focus bias: models favor moral reasoning over commonsense consistencyreliability, benchmark, bias, commonsense, moral-reasoning, evaluation
2603.09544Compartmentalization-Aware Automated Program Repair
PDF
cs.CR82LLM-based automated program repair targeting cross-compartment interface security vulnerabilities.cybersecurity, program-repair, llm-for-security, compartmentalization, vulnerability-mitigation
2603.09296Diagnosing and Repairing Citation Failures in Generative Engine Optimization
PDF
cs.IR, cs.CL82Taxonomy + agentic diagnosis/repair of citation failures in GEO; practical grounding/citationscitations, RAG, evaluation, agentic-systems, information-retrieval
2603.09970CREATE: Testing LLMs for Associative Creativity
PDF
cs.CL82New benchmark for associative creativity with objective grading; useful for capability evalsbenchmark, creativity, evaluation, associative-reasoning, knowledge
2603.09249Social-R1: Towards Human-like Social Reasoning in LLMs
PDF
cs.AI81Adversarial social-reasoning benchmark + RL framework to reduce shortcuts in ToM-style tasks.alignment, social-reasoning, theory-of-mind, rl, benchmark, robustness
2603.09652MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
PDF
cs.AI80MiniAppBench targets interactive HTML app generation; 500 tasks distilled from 10M+ real generationsbenchmark, code-generation, HCI, agents, web, evaluation
2603.09206MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data
PDF
cs.CV, cs.LG80Zero-data self-evolving VLM reasoning via multi-role RL framework; potentially high-impact method.vlm, self-improvement, rl, synthetic-data, reasoning, frontier

AI Paper Insight Brief

2026-03-12

0) Executive takeaways (read this first)

  • “Closed-loop” is the day’s dominant pattern: multiple papers move from one-shot prompting to execution feedback loops—LLM plans refined by trajectories (SCALAR), retrieval self-evaluation turned into an action with RL credit assignment (EVALACT), and human/ground-truth-in-the-loop verification for cyber threat research (CyberThreat-Eval/TRA).
  • Agent safety is shifting from “intent” to “consequences” and “runtime gating”: OOD-MMSafe/CASPO targets latent downstream hazards in multimodal settings, while TrustBench proposes sub-200ms pre-execution trust verification that reportedly cuts harmful actions by 87%.
  • Multi-agent systems have governance-grade instability even at T=0: Multi-LLM deliberation can show positive empirical Lyapunov exponents; role structure (esp. Chair) and memory windows materially change divergence (Chaotic Dynamics in Multi-LLM Deliberation).
  • Security work highlights “latent” failure surfaces: LVLM jailbreaks can be composed from benign “semantic gadgets” (VROP), backdoors persist after trigger removal via alternative triggers in feature space (Removing the Trigger, Not the Backdoor), and “privacy-preserving” insight pipelines can be exfiltrated via poisoning + prompt injection (CLIOPATRA).
  • Inference-time control is becoming a first-class optimization target: learned budget-aware decoding adapters improve Pass@1 on MATH by up to ~10.2 points (Learning Adaptive LLM Decoding), while confidence elicitation itself is shown to be scale-sensitive (best metacognitive efficiency at [0,20] vs [0,100]) (Rescaling Confidence).
  • Benchmarks are increasingly “interactive and end-to-end”: from real-time zero-sum games (STAR) to Playwright-tested HTML MiniApps (MiniAppBench) to long-horizon egocentric scene prediction (EXPLORE-Bench), evaluation is moving toward agent-like settings where latency and dynamics matter.

2) Key themes (clusters)

Theme: Closed-loop agents (feedback, verification, and credit assignment)

Theme: Multimodal & multi-agent safety beyond “obvious bad inputs”

Theme: Systems security & privacy for agentic/LLM deployments

Theme: Inference-time control, metacognition, and “reasoning as a knob”

Theme: Benchmarks for interactive artifacts and long-horizon dynamics

3) Technical synthesis

  • Execution feedback is being “compiled” into training signals: SCALAR refines STRIPS-like operators from successful trajectories; EVALACT turns retrieval assessment into an action and rescales GRPO advantages (PCAR).
  • Runtime gating is converging on structured, multi-signal scoring: TrustBench combines calibrated confidence mappings (isotonic regression) with runtime checks into allow/warn/deny decisions; AgenticCyOps uses verified execution + memory integrity principles to intercept attack chains early.
  • Stability/robustness is increasingly treated as a measurable system property: Lyapunov-style divergence (multi-LLM committees) parallels other “dynamics-aware” evaluations (STAR real-time vs turn-based reshuffling).
  • “Latent-space” framing recurs across domains: latent backdoor regions enabling alternative triggers; latent planner→executor communication (Latent-DARM) to avoid text fluency bottlenecks; consequence-driven safety focusing on latent hazards.
  • Inference-time policies are being learned with verifiable rewards: adaptive decoding adapters trained via REINFORCE on correctness checks; Social-R1 uses trajectory-level rewards (SIP stages) with judges/RMs; MM-Zero uses RLVR/GRPO with execution and self-consistency signals.
  • Reasoning tokens act as both capability amplifier and risk factor: reasoning improves factual recall via compute-buffer + factual priming, but hallucinated intermediate facts correlate with worse final correctness; reasoning also increases honesty and reveals metastability of deception.
  • Security evaluations emphasize attacker adaptivity and pipeline-level attacks: VROP uses evolutionary prompt optimization; CLIOPATRA chains poisoning + prompt injection through extraction→clustering→summarization→auditing; trigger removal doesn’t remove backdoor.
  • Latency and resource isolation are first-order constraints: FlexServe shows page-granular secure memory and NPU sandboxing can cut TTFT dramatically vs TrustZone strawmen; STAR shows real-time mode flips leaderboards due to inference latency.
  • Benchmarks are adding “document-centric” and “artifact-centric” generalization: AgentGEO’s MIMIQ evaluates citation rate across held-out queries per document; MiniAppBench evaluates executable HTML behavior under exploration rather than static correctness.

4) Top 5 papers (with “why now”)

1) SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding

  • Closes the LLM-planning ↔ RL-control gap with STRIPS-like operators refined from execution feedback.
  • Strong long-horizon results on Craftax: 88.2% diamond on Craftax-Classic vs 46.9% best baseline; 9.1% Gnomish Mines on full Craftax where prior methods hit 0%.
  • Frontier Checkpointing reallocates frames to deep prerequisites; trajectory analysis is critical (removal drops Mines success to 0%).
  • Skepticism: requires a predefined symbolic abstraction/vocabulary; checkpointing assumes state serialization and can trade off diversity.

2) FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

  • Practical on-device confidentiality/integrity under compromised kernel via page-granular Flex-Mem + Flex-NPU sandboxing.
  • Big latency wins: 10.05× TTFT vs TrustZone strawman; 8GB allocation 568 ms vs 6441 ms CMA baseline; multi-model workflows up to 24.30× faster.
  • On-demand protection can remove virtualization overhead when idle.
  • Skepticism: single SoC prototype; no side-channel/physical/DoS protection; normal-world client I/O not protected from compromised kernel.

3) OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

  • Reframes multimodal safety around consequence projection; introduces OOD-MMSafe (455) and tripartite R/S/E scoring.
  • CASPO (token-level constitution-conditioned self-distillation + outcome rewards) reportedly reduces failure ratio R0 to 7.3% / 5.7% on two backbones.
  • Diagnoses “preference ceiling” and negative transfer for static preference alignment (e.g., reported −1.5% after DPO for one model).
  • Skepticism: benchmark scale is limited; relies on automated judges (reported 86.5% consistency vs humans) and hyperparameter sensitivity (λ extremes can collapse entropy).

4) CLIOPATRA: Extracting Private Information from LLM Insights

  • End-to-end black-box attack on “privacy-preserving” insight pipelines (facet extraction → clustering → summarization → auditing).
  • Reports 39% disease extraction with minimal prior knowledge (age/gender/1 symptom) vs 22% baseline; can approach near-100% with more knowledge/stronger models.
  • Shows LLM privacy auditors can fail completely (zero detected violations on leaked clusters).
  • Skepticism: evaluation uses synthetic medical chats mixed with WildChat; real-world operational constraints (account friction/detection) not fully modeled.

5) Chaotic Dynamics in Multi-LLM Deliberation

  • Provides a concrete audit methodology for committee stability using an empirical Lyapunov estimator hatλ, showing divergence even at T=0.
  • Identifies two separable instability routes (roles, heterogeneity) and a key amplifier (Chair); shorter memory windows reduce divergence.
  • Highlights server-side nondeterminism at T=0 (≈40–50% calls show non-zero parsing variance).
  • Skepticism: lowering instability isn’t yet linked to decision quality; some scenario effects have wide uncertainty and artifacts omit some failure-type logs.

5) Practical next steps

  • If you build agents: add an explicit Evaluate/Verify step in tool loops (like Search→Evaluate) and log per-step self-scores; test whether advantage rescaling (PCAR-style) improves multi-hop reliability.
  • For multimodal safety: evaluate on consequence-driven cases (OOD-MMSafe-style) rather than intent-only; measure R/S/E separately to detect “safe but ineffective” collapse.
  • For multi-agent governance: run stability audits with replicate runs at T=0; ablate roles (especially “Chair”) and shrink memory windows to quantify changes in divergence (hatλ).
  • For privacy/insights products: treat clustering+summarization pipelines as adversarial surfaces; test poisoning attacks and do not rely on LLM auditors alone—consider formal privacy mechanisms (DP) and measure leakage under targeted attacks.
  • For backdoor defense: evaluate defenses against alternative triggers (feature-guided attacks), not just the discovered trigger’s ASR; add latent-space diagnostics (direction interpolation).
  • For inference-time optimization: try budget-aware decoding adapters for your domain; separately, if you elicit confidence, test [0,20] vs [0,100] scales and report meta-d’/Mratio (not only ECE).
  • For real-time agent deployment: benchmark in both unconstrained and time-constrained modes (STAR-style) to surface strategy–execution gaps; track latency as a first-class metric.
  • For on-device deployments: if TrustZone is too rigid, evaluate page-granular isolation designs and accelerator sandboxing; measure TTFT under memory pressure and multi-model scheduling.

Generated from per-paper analyses; no external browsing.