Daily AI Paper Report (2026-03-12)

Published: March 12, 2026

Chinese version: [中文]

Run stats

Candidates: 252
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-10T00:00:00Z → 2026-03-11T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.09772`	Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors PDF	cs.CV, cs.CR	95	Shows backdoors persist via alternative triggers; defenses removing known triggers can fail	backdoors, adversarial-ML, security, representation-learning, robustness
`2603.09706`	OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences PDF	cs.AI	94	Consequence-driven MLLM safety benchmark; finds causal blindness and alignment ceiling in frontier models	multimodal-safety, benchmark, agent-safety, causal-reasoning, evaluation, robustness
`2603.09884`	Benchmarking Political Persuasion Risks Across Frontier Large Language Models PDF	cs.CL, cs.CY	94	Large-N benchmark shows frontier LLMs beat ads at persuasion; high societal misuse relevance	political-persuasion, misuse, evaluation, survey-experiments, frontier-models
`2603.09246`	Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models PDF	cs.CR	93	New jailbreak class for LVLMs via compositional reasoning; ROP-style chaining of benign premises into harm	jailbreaks, multimodal, adversarial-attacks, compositionality, security, red-teaming
`2603.09046`	FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation PDF	cs.CR, cs.LG, cs.OS	93	Secure on-device LLM serving w/ TrustZone-style isolation; strong systems+security relevance.	llm-serving, mobile, trusted-execution, isolation, confidential-inference, systems-security
`2603.09781`	CLIOPATRA: Extracting Private Information from LLM Insights PDF	cs.CR	92	Privacy attack on “privacy-preserving” LLM insights; shows realistic data insertion can induce leakage	privacy, data-exfiltration, security, LLM-systems, auditing, attack
`2603.09203`	Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents PDF	cs.AI	91	Process rewards for retrieval-augmented agents; explicit eval action + GRPO variant for reliability.	agents, rag, process-supervision, reinforcement-learning, reliability, evaluation
`2603.09957`	Think Before You Lie: How Reasoning Improves Honesty PDF	cs.AI, cs.CL, cs.LG	91	Finds reasoning increases LLM honesty; probes mechanisms via representation geometry	honesty, deception, reasoning, mechanistic-analysis, behavior
`2603.09157`	Real-Time Trust Verification for Safe Agentic Actions using TrustBench PDF	cs.AI	90	TrustBench verifies agent actions pre-execution; shifts from post-hoc eval to real-time safety gating	agents, runtime-guardrails, verification, evaluation, trustworthiness, tool-use
`2603.09036`	SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding PDF	cs.LG	90	LLM-guided symbolic planning + RL grounding with feedback loop; strong agent skill learning	agents, LLM-planning, reinforcement-learning, skill-learning, tool-use
`2603.09875`	The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation PDF	cs.MA, cs.CR, cs.DC	89	Agent auth revocation framed as coherence; bounds unauthorized ops under fast agent execution	agent-security, access-control, capabilities, revocation, distributed-systems
`2603.09337`	Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments PDF	cs.CV, cs.AI	88	STAR benchmark tests LLMs as adversarial agents in zero-sum, real-time/turn-based settings.	agent-evals, adversarial, multi-agent, benchmark, strategic-reasoning, red-teaming
`2603.09731`	EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning PDF	cs.CV, cs.AI, cs.CL	88	New benchmark for long-horizon egocentric action→scene prediction; targets embodied MLLM limits	benchmark, embodied-agents, multimodal, long-horizon-reasoning, evaluation
`2603.09134`	AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations PDF	cs.CR, cs.MA, cs.SE	86	Enterprise multi-agent security framework; decomposes attack surfaces around tool orchestration & memory	agent-security, enterprise, threat-modeling, memory, tooling, architecture
`2603.09309`	Rescaling Confidence: What Scale Design Reveals About LLM Metacognition PDF	cs.AI	86	Shows confidence scales distort metacognition; proposes better scale improving meta-d'	uncertainty, calibration, metacognition, evaluation, confidence
`2603.09127`	Chaotic Dynamics in Multi-LLM Deliberation PDF	cs.AI, cs.MA	85	Shows multi-LLM committees can be chaotic even at T=0; empirical Lyapunov analysis of instability routes	multi-agent, deliberation, stability, evaluation, reproducibility, dynamics
`2603.09065`	Learning Adaptive LLM Decoding PDF	cs.LG	85	Learns adaptive decoding policies via RL without model finetune; compute-aware inference gains	decoding, inference, test-time, rl, efficiency
`2603.09452`	CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research? PDF	cs.CR, cs.CL	84	Realistic CTI/OSINT workflow benchmark for LLM threat research (triage→search→draft), beyond MCQ metrics	cybersecurity, evaluation, agents, OSINT, benchmark, workflows
`2603.09435`	AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems PDF	cs.AI	84	Open dataset for evaluating NLP/RAG systems against EU AI Act-style compliance requirements.	rag, evaluation, governance, compliance, dataset, auditability
`2603.09297`	TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA PDF	cs.IR, cs.CL	84	Tool-augmented autonomous memory retrieval for long-term conversational QA beyond top-k	memory, agents, retrieval, long-context, conversational-QA
`2603.09906`	Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs PDF	cs.CL	84	Controlled study: reasoning boosts single-hop factual recall; proposes buffer/mechanism story	parametric-knowledge, reasoning, factuality, mechanisms, elicitation
`2603.09951`	Towards a Neural Debugger for Python PDF	cs.LG, cs.AI, cs.SE	83	Neural debugger trained on execution traces; enables interactive stepping/breakpoints for code LMs.	code-llms, tooling, debugging, execution-traces, reliability, agents
`2603.09184`	Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning PDF	cs.LG, cs.AI	83	Bridges diffusion planners with AR executors; improves reasoning via latent communication	planning, diffusion-lm, agents, reasoning, coordination
`2603.09434`	Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs PDF	cs.CL, cs.AI	82	CoMoral benchmark reveals narrative focus bias: models favor moral reasoning over commonsense consistency	reliability, benchmark, bias, commonsense, moral-reasoning, evaluation
`2603.09544`	Compartmentalization-Aware Automated Program Repair PDF	cs.CR	82	LLM-based automated program repair targeting cross-compartment interface security vulnerabilities.	cybersecurity, program-repair, llm-for-security, compartmentalization, vulnerability-mitigation
`2603.09296`	Diagnosing and Repairing Citation Failures in Generative Engine Optimization PDF	cs.IR, cs.CL	82	Taxonomy + agentic diagnosis/repair of citation failures in GEO; practical grounding/citations	citations, RAG, evaluation, agentic-systems, information-retrieval
`2603.09970`	CREATE: Testing LLMs for Associative Creativity PDF	cs.CL	82	New benchmark for associative creativity with objective grading; useful for capability evals	benchmark, creativity, evaluation, associative-reasoning, knowledge
`2603.09249`	Social-R1: Towards Human-like Social Reasoning in LLMs PDF	cs.AI	81	Adversarial social-reasoning benchmark + RL framework to reduce shortcuts in ToM-style tasks.	alignment, social-reasoning, theory-of-mind, rl, benchmark, robustness
`2603.09652`	MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants PDF	cs.AI	80	MiniAppBench targets interactive HTML app generation; 500 tasks distilled from 10M+ real generations	benchmark, code-generation, HCI, agents, web, evaluation
`2603.09206`	MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data PDF	cs.CV, cs.LG	80	Zero-data self-evolving VLM reasoning via multi-role RL framework; potentially high-impact method.	vlm, self-improvement, rl, synthetic-data, reasoning, frontier

AI Paper Insight Brief

2026-03-12

0) Executive takeaways (read this first)

“Closed-loop” is the day’s dominant pattern: multiple papers move from one-shot prompting to execution feedback loops—LLM plans refined by trajectories (SCALAR), retrieval self-evaluation turned into an action with RL credit assignment (EVALACT), and human/ground-truth-in-the-loop verification for cyber threat research (CyberThreat-Eval/TRA).
Agent safety is shifting from “intent” to “consequences” and “runtime gating”: OOD-MMSafe/CASPO targets latent downstream hazards in multimodal settings, while TrustBench proposes sub-200ms pre-execution trust verification that reportedly cuts harmful actions by 87%.
Multi-agent systems have governance-grade instability even at T=0: Multi-LLM deliberation can show positive empirical Lyapunov exponents; role structure (esp. Chair) and memory windows materially change divergence (Chaotic Dynamics in Multi-LLM Deliberation).
Security work highlights “latent” failure surfaces: LVLM jailbreaks can be composed from benign “semantic gadgets” (VROP), backdoors persist after trigger removal via alternative triggers in feature space (Removing the Trigger, Not the Backdoor), and “privacy-preserving” insight pipelines can be exfiltrated via poisoning + prompt injection (CLIOPATRA).
Inference-time control is becoming a first-class optimization target: learned budget-aware decoding adapters improve Pass@1 on MATH by up to ~10.2 points (Learning Adaptive LLM Decoding), while confidence elicitation itself is shown to be scale-sensitive (best metacognitive efficiency at [0,20] vs [0,100]) (Rescaling Confidence).
Benchmarks are increasingly “interactive and end-to-end”: from real-time zero-sum games (STAR) to Playwright-tested HTML MiniApps (MiniAppBench) to long-horizon egocentric scene prediction (EXPLORE-Bench), evaluation is moving toward agent-like settings where latency and dynamics matter.

2) Key themes (clusters)

Theme: Closed-loop agents (feedback, verification, and credit assignment)

Why it matters: One-shot LLM plans/retrievals fail under hidden prerequisites, noisy evidence, and sparse rewards. Closing the loop with execution feedback or verification improves robustness and sample allocation.
Representative papers:
Common approach:
- Turn intermediate judgments into explicit artifacts (symbolic operators; Evaluate actions; structured rubrics; trust vectors).
- Use iterative loops (trajectory analysis, GRPO/PCAR rescaling, human/ground-truth checks like VirusTotal).
- Reallocate compute toward the current bottleneck (Frontier Checkpointing; process-advantage rescaling; agent tool iteration).
Open questions / failure modes:
- Self-evaluation can be miscalibrated or gamed; strict protocols (e.g., Search→Evaluate) may reduce autonomy.
- Verification components become new attack surfaces (validator compromise, latency/cost blowups).
- Generalization beyond the evaluated domains (Craftax/RAG QA/CTI datasets) remains uncertain.

Theme: Multimodal & multi-agent safety beyond “obvious bad inputs”

Why it matters: Safety failures increasingly arise from composition (benign parts → harmful whole) and consequence projection (benign query + context → latent hazard), not just explicit malicious intent.
Representative papers:
Common approach:
- Build targeted benchmarks for subtle failure modes (consequence-driven hazards; compositional jailbreaks; moral+commonsense contradictions; stability/chaos metrics).
- Use structured scoring beyond pass/fail (R/S/E tripartite safety; ASR; Lyapunov-style divergence).
- Test interventions that change dynamics (CASPO hybrid token/outcome optimization; role ablation; memory-window reduction).
Open questions / failure modes:
- Whether improvements preserve usefulness (e.g., lowering instability without harming decision quality; avoiding refusal collapse).
- Reliance on automated judges and constitutions can introduce bias/noise.
- Attackers can adapt: gadget mining + evolutionary prompt search; committee instability persists even at T=0 due to system nondeterminism.

Theme: Systems security & privacy for agentic/LLM deployments

Why it matters: As LLMs move on-device and into enterprise workflows, the threat model expands: compromised kernels, tool/memory integration surfaces, and aggregate “insight” products become high-value targets.
Representative papers:
Common approach:
- Redesign isolation boundaries (page-granular secure memory; NPU sandboxing; capability scoping; verified execution loops).
- Quantify risk in operational terms (TTFT; trust-boundary counts; unauthorized ops scaling with v·TTL).
- Evaluate with concrete prototypes or structured analyses (RK3588 prototype; boundary enumeration; simulator with RCC bounds).
Open questions / failure modes:
- Heuristic privacy auditing can fail catastrophically (auditors flag zero violations under successful CLIOPATRA leaks).
- Centralized enforcement can be a single point of failure (AgenticCyOps Host; authority service in RCC framing).
- Portability/generalization: single-SoC evaluations; closed-source accelerator runtimes; deployment friction.

Theme: Inference-time control, metacognition, and “reasoning as a knob”

Why it matters: Many gains now come from controlling inference (decoding, confidence elicitation, reasoning traces) rather than retraining base models—useful for deployment constraints and safety gating.
Representative papers:
Common approach:
- Learn lightweight controllers on frozen LLMs (REINFORCE decoding adapters; scale-design prompting).
- Use pass@k / calibration / meta-d’ to measure capability and uncertainty quality.
- Analyze mechanisms: compute-buffer vs semantic priming; representational metastability of deception.
Open questions / failure modes:
- Training stability and limited action spaces (token-level adapter uses temperature-only; masking heuristics).
- Reasoning traces can introduce hallucinated intermediate facts that reduce correctness.
- Confidence reporting is anchor-biased; ECE can be misleading under discretization.

Theme: Benchmarks for interactive artifacts and long-horizon dynamics

Why it matters: Static QA-style benchmarks miss failure modes in interactive, time-constrained, or long-horizon settings that resemble real agent deployments.
Representative papers:
Common approach:
- Evaluate end-to-end behavior with richer metrics (PWER; Intention/Static/Dynamic; object/attribute/relation scoring; creative utility).
- Include test-time scaling analyses (real-time vs turn-based; multi-turn segmentation; prompting variants).
- Use agentic evaluators (Playwright exploration; structured judges).
Open questions / failure modes:
- Cost/latency tradeoffs dominate in real-time modes (strategy–execution gap; multi-turn overhead).
- Evaluator reliability and judge dependence (LLM-as-judge in several benchmarks).
- Whether benchmark improvements transfer to real deployments.

3) Technical synthesis

Execution feedback is being “compiled” into training signals: SCALAR refines STRIPS-like operators from successful trajectories; EVALACT turns retrieval assessment into an action and rescales GRPO advantages (PCAR).
Runtime gating is converging on structured, multi-signal scoring: TrustBench combines calibrated confidence mappings (isotonic regression) with runtime checks into allow/warn/deny decisions; AgenticCyOps uses verified execution + memory integrity principles to intercept attack chains early.
Stability/robustness is increasingly treated as a measurable system property: Lyapunov-style divergence (multi-LLM committees) parallels other “dynamics-aware” evaluations (STAR real-time vs turn-based reshuffling).
“Latent-space” framing recurs across domains: latent backdoor regions enabling alternative triggers; latent planner→executor communication (Latent-DARM) to avoid text fluency bottlenecks; consequence-driven safety focusing on latent hazards.
Inference-time policies are being learned with verifiable rewards: adaptive decoding adapters trained via REINFORCE on correctness checks; Social-R1 uses trajectory-level rewards (SIP stages) with judges/RMs; MM-Zero uses RLVR/GRPO with execution and self-consistency signals.
Reasoning tokens act as both capability amplifier and risk factor: reasoning improves factual recall via compute-buffer + factual priming, but hallucinated intermediate facts correlate with worse final correctness; reasoning also increases honesty and reveals metastability of deception.
Security evaluations emphasize attacker adaptivity and pipeline-level attacks: VROP uses evolutionary prompt optimization; CLIOPATRA chains poisoning + prompt injection through extraction→clustering→summarization→auditing; trigger removal doesn’t remove backdoor.
Latency and resource isolation are first-order constraints: FlexServe shows page-granular secure memory and NPU sandboxing can cut TTFT dramatically vs TrustZone strawmen; STAR shows real-time mode flips leaderboards due to inference latency.
Benchmarks are adding “document-centric” and “artifact-centric” generalization: AgentGEO’s MIMIQ evaluates citation rate across held-out queries per document; MiniAppBench evaluates executable HTML behavior under exploration rather than static correctness.

4) Top 5 papers (with “why now”)

1) SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding

Closes the LLM-planning ↔ RL-control gap with STRIPS-like operators refined from execution feedback.
Strong long-horizon results on Craftax: 88.2% diamond on Craftax-Classic vs 46.9% best baseline; 9.1% Gnomish Mines on full Craftax where prior methods hit 0%.
Frontier Checkpointing reallocates frames to deep prerequisites; trajectory analysis is critical (removal drops Mines success to 0%).
Skepticism: requires a predefined symbolic abstraction/vocabulary; checkpointing assumes state serialization and can trade off diversity.

2) FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Practical on-device confidentiality/integrity under compromised kernel via page-granular Flex-Mem + Flex-NPU sandboxing.
Big latency wins: 10.05× TTFT vs TrustZone strawman; 8GB allocation 568 ms vs 6441 ms CMA baseline; multi-model workflows up to 24.30× faster.
On-demand protection can remove virtualization overhead when idle.
Skepticism: single SoC prototype; no side-channel/physical/DoS protection; normal-world client I/O not protected from compromised kernel.

3) OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Reframes multimodal safety around consequence projection; introduces OOD-MMSafe (455) and tripartite R/S/E scoring.
CASPO (token-level constitution-conditioned self-distillation + outcome rewards) reportedly reduces failure ratio R0 to 7.3% / 5.7% on two backbones.
Diagnoses “preference ceiling” and negative transfer for static preference alignment (e.g., reported −1.5% after DPO for one model).
Skepticism: benchmark scale is limited; relies on automated judges (reported 86.5% consistency vs humans) and hyperparameter sensitivity (λ extremes can collapse entropy).

4) CLIOPATRA: Extracting Private Information from LLM Insights

End-to-end black-box attack on “privacy-preserving” insight pipelines (facet extraction → clustering → summarization → auditing).
Reports 39% disease extraction with minimal prior knowledge (age/gender/1 symptom) vs 22% baseline; can approach near-100% with more knowledge/stronger models.
Shows LLM privacy auditors can fail completely (zero detected violations on leaked clusters).
Skepticism: evaluation uses synthetic medical chats mixed with WildChat; real-world operational constraints (account friction/detection) not fully modeled.

5) Chaotic Dynamics in Multi-LLM Deliberation

Provides a concrete audit methodology for committee stability using an empirical Lyapunov estimator hatλ, showing divergence even at T=0.
Identifies two separable instability routes (roles, heterogeneity) and a key amplifier (Chair); shorter memory windows reduce divergence.
Highlights server-side nondeterminism at T=0 (≈40–50% calls show non-zero parsing variance).
Skepticism: lowering instability isn’t yet linked to decision quality; some scenario effects have wide uncertainty and artifacts omit some failure-type logs.

5) Practical next steps

If you build agents: add an explicit Evaluate/Verify step in tool loops (like Search→Evaluate) and log per-step self-scores; test whether advantage rescaling (PCAR-style) improves multi-hop reliability.
For multimodal safety: evaluate on consequence-driven cases (OOD-MMSafe-style) rather than intent-only; measure R/S/E separately to detect “safe but ineffective” collapse.
For multi-agent governance: run stability audits with replicate runs at T=0; ablate roles (especially “Chair”) and shrink memory windows to quantify changes in divergence (hatλ).
For privacy/insights products: treat clustering+summarization pipelines as adversarial surfaces; test poisoning attacks and do not rely on LLM auditors alone—consider formal privacy mechanisms (DP) and measure leakage under targeted attacks.
For backdoor defense: evaluate defenses against alternative triggers (feature-guided attacks), not just the discovered trigger’s ASR; add latent-space diagnostics (direction interpolation).
For inference-time optimization: try budget-aware decoding adapters for your domain; separately, if you elicit confidence, test [0,20] vs [0,100] scales and report meta-d’/Mratio (not only ECE).
For real-time agent deployment: benchmark in both unconstrained and time-constrained modes (STAR-style) to surface strategy–execution gaps; track latency as a first-class metric.
For on-device deployments: if TrustZone is too rigid, evaluate page-granular isolation designs and accelerator sandboxing; measure TTFT under memory pressure and multi-model scheduling.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-12

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Closed-loop agents (feedback, verification, and credit assignment)

Theme: Multimodal & multi-agent safety beyond “obvious bad inputs”

Theme: Systems security & privacy for agentic/LLM deployments

Theme: Inference-time control, metacognition, and “reasoning as a knob”

Theme: Benchmarks for interactive artifacts and long-horizon dynamics

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps