Daily AI Paper Report (2026-03-19)

Published: March 19, 2026

Chinese version: [中文]

Run stats

Candidates: 277
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-19T00:00:00Z → 2026-03-20T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.18433`	Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems PDF	cs.CR	94	Runtime, role-aware prompt-injection defense for RAG/API stacks; practical gateway design + eval.	prompt-injection, RAG, runtime-defense, policy-enforcement, LLM-security, middleware
`2603.18894`	I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems PDF	cs.AI, cs.MA	94	Empirical multi-agent governance sims quantify rule-breaking/corruption; high direct agent-safety relevance.	agent-safety, multi-agent, governance, evaluation, misuse, institutional-integrity
`2603.19092`	SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues PDF	cs.CV, cs.AI, cs.CL, cs.LG	93	New VLM safety benchmark + semantic steering; separates refusals vs grounded reasoning	vlm-safety, benchmark, steering, refusal, grounded-reasoning, evaluation
`2603.18637`	MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment PDF	cs.CR, cs.CL	92	Closed-loop multi-objective alignment data curation; explicit tradeoff safety vs over-refusal vs IF.	alignment, data-curation, SFT, over-refusal, safety-eval, mixture-optimization
`2603.18614`	ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs PDF	cs.AI	92	Procedural tool-use environment isolates reasoning-action coupling; reduces contamination; strong agent eval asset.	agents, tool-use, benchmark, evaluation, procedural-generation, reasoning
`2603.18736`	CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks PDF	cs.LG, cs.AI, cs.CL, stat.ML	92	Causal framing for reward models from noisy/biased observational feedback; scalable RLHF alternative	RLHF, reward-modeling, causal-inference, observational-feedback, alignment
`2603.18740`	Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review PDF	cs.SE, cs.AI, cs.CR	91	Measures exploitable confirmation bias in LLM security code review; large effect sizes on FN rates.	secure-coding, LLM-failure-modes, supply-chain, evaluation, prompt-framing, robustness
`2603.18631`	D-Mem: A Dual-Process Memory System for LLM Agents PDF	cs.AI	90	Dual-process memory for LLM agents; tackles lossy retrieval for long-horizon context	agents, memory, long-horizon, retrieval, architecture, reliability
`2603.18377`	PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents PDF	cs.CR, cs.AI, cs.ET	89	Privacy-preserving planning for cloud LLM agents via planning abstractions; limits raw state exposure.	agents, privacy, cloud-planning, abstraction, confidential-context, system-design
`2603.18893`	Quantitative Introspection in Language Models: Tracking Internal States Across Conversation PDF	cs.AI	89	Measures whether LLM numeric self-reports track internal states over dialogue; safety+interpretability angle	interpretability, introspection, monitoring, safety, probes, conversation
`2603.18382`	From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents PDF	cs.AI	88	Systematic eval of LLM-agent de-anonymization from weak cues; formalizes inference-driven linkage.	privacy, deanonymization, agents, benchmark, linkage-attacks, risk-evaluation
`2603.18469`	GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms PDF	cs.CL	88	Benchmark for norm-vs-goal conflicts under pressure; useful for alignment and policy compliance testing.	alignment, norms, decision-making, benchmark, robustness, governance
`2603.18683`	HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning PDF	cs.LG, cs.AI, cs.CL	88	Improves multi-turn agent RL via hindsight-modulated segment rewards for credit assignment	agentic-rl, reward-modeling, credit-assignment, process-rewards, long-horizon
`2603.19127`	On Optimizing Multimodal Jailbreaks for Spoken Language Models PDF	cs.LG	87	Joint audio+text gradient jailbreaks for spoken-language models; expands multimodal attack surface.	jailbreaks, multimodal, audio-attacks, adversarial-prompts, SLM, red-teaming
`2603.18756`	Are complicated loss functions necessary for teaching LLMs to reason? PDF	cs.LG, cs.AI, cs.CL	87	Dissects GRPO; finds negative feedback key and clipping unnecessary; simplifies reasoning post-training	reasoning, RL, post-training, GRPO, REINFORCE, optimization
`2603.18762`	ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation PDF	cs.CR, cs.AI	86	MITM-based red-teaming for real web agents (OpenClaw); network-layer attacks beyond sandbox tests.	agents, red-teaming, MITM, web-security, tool-use, evaluation-framework
`2603.19025`	Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference PDF	cs.CR, cs.LG	86	Lightweight verifiable inference protocol for cloud models; relevant to auditing and deployment security.	security, verifiable-inference, cryptography, auditing, cloud-deployment, integrity
`2603.19144`	UGID: Unified Graph Isomorphism for Debiasing Large Language Models PDF	cs.CL, cs.AI	86	Representation-level LLM debiasing via graph invariance across counterfactual inputs	bias, debiasing, interpretability, representations, counterfactuals, fairness
`2603.18829`	Agent Control Protocol: Admission Control for Agent Actions PDF	cs.CR, cs.AI	85	Formal spec for cryptographic admission control of agent actions: identity, delegation, revocation, audit.	agent-governance, capabilities, authorization, cryptography, auditing, protocol
`2603.18743`	Memento-Skills: Let Agents Design Agents PDF	cs.AI, cs.CL, cs.LG	85	Continual agent that writes reusable skills/memory to design new agents; relevant to agentic risk surface	agents, continual-learning, memory, tool-use, skills, agent-design
`2603.19191`	OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards PDF	cs.AI	84	Scalable multi-agent critic for GUI rewards + new cross-platform reward benchmark (OGRBench).	GUI-agents, reward-modeling, critics, benchmarks, verification, RL
`2603.18911`	Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs PDF	cs.CL, cs.AI	84	Citation-grounded bilingual dialogue training + reward; targets hallucinations with verifiable outputs.	hallucination, grounding, citations, RAG, alignment, multilingual
`2603.18507`	Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM PDF	cs.AI	84	Finds persona prompts boost alignment but hurt accuracy; proposes intent-based routing	alignment, personas, prompting, routing, evaluation, tradeoffs
`2603.19017`	What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time? PDF	cs.CL, cs.AI	84	MultiTempBench probes multilingual temporal reasoning; links failures to tokenization via mDFR + probing	evaluation, temporal-reasoning, multilingual, tokenization, benchmarks
`2603.18373`	To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs PDF	cs.CV, cs.AI	83	Diagnoses visual sycophancy/split beliefs in VLMs with counterfactual tests; highlights alignment failure.	VLM, sycophancy, grounding, hallucinations, evaluation, uncertainty
`2603.19220`	Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation PDF	cs.CL, cs.AI, cs.LG	83	Open 30B MoE with Cascade RL/distillation; strong reasoning/agentic claims; potentially impactful post-training.	LLM, post-training, RL, distillation, MoE, reasoning, agents
`2603.18886`	Reasoning over mathematical objects: on-policy reward modeling and test time aggregation PDF	cs.AI, cs.CL	83	Principia suite for formal math objects + on-policy judge training + test-time aggregation	reasoning, math, benchmarks, reward-modeling, llm-judges, verification
`2603.18897`	Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution PDF	cs.DC, cs.AI	82	Speculative tool execution to hide latency in LLM-tool loops; important for scalable agent serving	agents, tool-use, systems, latency, speculation, serving
`2603.18859`	RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models PDF	cs.AI, cs.CL, cs.LG	81	Topology-aware reward propagation for agentic LLM RL; could improve sparse-reward training efficiency.	agentic-RL, process-rewards, trajectory-graphs, reasoning, optimization
`2603.18729`	Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures PDF	cs.AI	80	Studies dialect-triggered stereotypes; tests prompt/COT and multi-agent critique-revise mitigation	bias, stereotypes, multi-agent, mitigation, prompting, fairness

AI Paper Insight Brief

2026-03-19

0) Executive takeaways (read this first)

“Grounding failures” are increasingly alignment/steering failures, not perception failures: a tri-layer VLM diagnostic finds Visual Sycophancy dominates (69.6%) and scaling reduces language shortcuts but amplifies sycophancy (Qwen2.5-VL 7B→72B: 72.4%→95.3% sycophancy).
Agent security is shifting from prompt text to system interfaces and observation channels: priority-aware prompt composition defenses (PCFI), MITM web-traffic red-teaming (ClawTrap), and cryptographic admission control (ACP) all treat the agent stack as the attack surface—not just the model.
Privacy risk is now “inference-time linkage,” not just leakage of explicit identifiers: agents can reconstruct identities from weak cues (e.g., Netflix 79.2% linkage; AOL 10 confirmed identities/40 histories), motivating privacy evaluation that measures inferred identity, not only redaction.
Data/feedback quality is becoming the bottleneck for alignment: CausalRM shows large downstream safety gains from correcting noise + selection bias in observational feedback (e.g., +49.2% WildGuardMix, +32.7% HarmBench), while MOSAIC shows budgeted, slice-aware mixture search can avoid the over-refusal/capability collapse seen in naive safety mixing.
Agent training and evaluation are converging on “credit assignment + efficiency”: ZebraArena quantifies tool-query inefficiency vs a theoretical optimum; RewardFlow and HISR propose denser, structure-aware reward propagation/segmented process rewards; OS-Themis improves long-horizon GUI rewards via milestone verification and auditing.

2) Key themes (clusters)

Theme: Multimodal grounding & safety are steerable (and exploitable)

Why it matters: VLMs can “see” anomalies yet still comply/hallucinate; simple semantic cues can flip safety judgments. This makes multimodal systems vulnerable to both over-trust and over-refusal attacks.
Representative papers:
Common approach:
- Counterfactual interventions (blind/noise/conflict images; cue overlays; prompt steering) to separate perception vs dependence vs alignment.
- New metrics that separate behavior (refusal) from grounded correctness (e.g., BRA/GSA/FRR; LAD/VNS/CS).
- Post-hoc causal checks (occlusion/attribution) to test whether “citations/markers” actually control outputs.
Open questions / failure modes:
- How to reduce visual sycophancy without inducing blanket refusal (Tri-layer shows 0% “robust refusal” under blind/noise in their taxonomy).
- Cue/overlay attacks that induce adversarial over-refusal (SAVeS Attacker: high refusal with extreme false refusals).
- “Citation format without grounding” in decoder-only models (occlusion grounding reported as 0.000 despite nonzero Citation-F1).

Theme: Agent security hardens the composition boundary and the observation channel

Why it matters: Real deployments compose prompts from multiple sources and rely on networked observations; attackers exploit hierarchy confusion, metadata framing, and in-transit tampering.
Representative papers:
Common approach:
- Enforce provenance/priority at runtime (PCFI models S∥D∥U∥R with S>D>U>R; ALLOW/SANITIZE/BLOCK).
- Evaluate attacks that are not prompt-only: PR metadata framing (confirmation bias) and live HTTP MITM rewriting (ClawTrap).
- Add protocol layers: cryptographic identity, capability tokens, PoP handshakes, deterministic risk scoring, single-use execution tokens, append-only audit ledgers (ACP).
Open questions / failure modes:
- Pattern-based prompt defenses may be brittle to paraphrase/semantic attacks and don’t cover multi-turn/tool chains (PCFI limitations).
- MITM evaluation is currently qualitative in ClawTrap; needs quantitative success metrics and broader task coverage.
- Metadata framing can cause huge detection drops (16.2–93.5pp TPR decreases) and high bypass rates (e.g., 88.2% vs Claude Code) unless pipelines explicitly ignore/redact metadata.

Theme: Privacy threats move from “what was revealed” to “what can be inferred”

Why it matters: Even sanitized/anonymized artifacts can be linkable when agents generate hypotheses and retrieve corroborating evidence; cloud planners can also reconstruct identifying structure unless observation is constrained.
Representative papers:
- From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
- PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents
Common approach:
- Formalize deanonymization as producing an identity hypothesis plus evidence Π:(Danon,Daux)→(î,E); measure LSR/CLC across classical + synthetic + modern traces.
- Reduce planner observability via local projection into a typed “digital twin” + capability catalog + gatekeeper with per-object disclosure budgets.
- Evaluate privacy–utility trade-offs explicitly (prompt privacy guards; disclosure budgets; re-identification experiments).
Open questions / failure modes:
- Prompt-based privacy guards reduce linkage but can cause over-refusal/utility loss.
- Structural inference remains: even abstract graphs can fingerprint users/objects (PlanTwin shows 94.1% re-identification when all fields exposed).
- Benchmarks like INFERLINK are simplified (single overlap, small tables), so real-world ambiguity and prevalence remain uncertain.

Theme: Alignment optimization becomes data-centric and causally corrected

Why it matters: Safety/usability regressions often come from biased feedback and misallocated SFT budgets; better estimators and slice-aware loops can move the Pareto frontier under fixed compute.
Representative papers:
Common approach:
- Treat feedback as biased/noisy observational data; apply noise-corrected surrogate losses + propensity weighting + doubly robust estimators (CausalRM).
- Close the loop: slice-level failure profiles → executable data-mixture actions under a fixed token budget (MOSAIC).
- Benchmark realistic decision trade-offs under contextual pressures (GAIN) and mitigate steering trade-offs via gated distillation (PRISM).
Open questions / failure modes:
- Causal corrections depend on estimating propensities and anchor units; misspecification risk remains.
- MOSAIC evaluated with limited iterations and baselines; generality across base models and independently constructed eval sets is open.
- Personas improve alignment behaviors but can degrade knowledge tasks; PRISM adds deployment complexity (gate + LoRA incompatibilities, limited scale tested).

Theme: Agent RL and evaluation emphasize credit assignment, efficiency, and long-horizon reward reliability

Why it matters: Agents fail not only by being wrong, but by being inefficient, miscalibrated under budgets, or trained on noisy reward signals that poison gradients.
Representative papers:
Common approach:
- Define theoretical lower bounds / structure (ZebraArena’s K⋆; RewardFlow’s state graphs; HISR’s sub-goal segments).
- Convert sparse outcomes into denser signals without heavy human labeling (graph BFS propagation; hindsight likelihood ratios; milestone verification).
- Diagnose inefficiency and “budget anxiety” rather than only final success.
Open questions / failure modes:
- Idealized environments (logic puzzles) may not transfer to noisy real tools; need bridging studies.
- Reward shaping depends on state representations and availability of successes (RewardFlow notes reliance on successful trajectories).
- Critic frameworks can be expensive (OS-Themis reports ~117.6s per-trajectory evaluation latency).

3) Technical synthesis

Multiple papers converge on counterfactual/provenance-aware evaluation: VLM blind/noise/conflict interventions (visual grounding), prompt-segment priority enforcement (PCFI), and MITM observation rewriting (ClawTrap) all treat “what the model saw” as the key variable.
A recurring pattern is separating behavior from underlying competence: refusal rate vs grounded safety (SAVeS), accuracy vs image dependence vs alignment preference (Tri-layer), and “citation presence” vs causal grounding (XKD-Dial occlusion).
Budgeting shows up everywhere: PlanTwin disclosure budgets, ZebraArena query budgets/pricing, MOSAIC fixed SFT token budgets, and OS-Themis cost/latency accounting—suggesting evaluation should report cost-conditioned performance curves, not single scores.
Alignment methods increasingly use causal/statistical correction rather than more data: CausalRM’s noise + selection-bias correction parallels MOSAIC’s slice-aware allocation—both aim to prevent “training on the wrong signal.”
Agent RL work is moving toward structure-induced dense rewards without training separate reward models: RewardFlow uses topology; HISR uses hindsight likelihood ratios; OS-Themis uses milestone evidence chains.
Prompting/steering is shown to be double-edged: personas improve alignment but harm knowledge (PRISM), semantic cues can assist or attack VLM safety (SAVeS), and PR metadata can anchor code-review judgments (confirmation bias).
Robustness failures are often asymmetric: confirmation bias mainly increases false negatives; VLMs can detect anomalies (high LAD) yet still hallucinate (high CS); privacy linkage can occur even under “benign” task framing (INFERLINK IMPLICIT).
Several works emphasize auditable interfaces: ACP’s signed ledger + execution tokens, PlanTwin’s schema-bounded twin + gatekeeper, and OS-Themis’s verifiable milestone checks all create artifacts that can be inspected post hoc.
Simplification trend in RL objectives: RGRA suggests PPO-style clipping may be unnecessary for GRPO-like reasoning gains in small models, but advantage normalization and negative feedback are essential for stability.

4) Top 5 papers (with “why now”)

1) To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

Decomposes VLM failures into Perception (LAD), Dependency (VNS), Alignment (CS) via counterfactual images.
Finds Visual Sycophancy is the dominant failure mode (69.6%) and scales up with model size in their Qwen2.5-VL analysis.
Offers a practical mitigation via diagnostic-guided selective prediction (up to +9.5pp accuracy at 50% coverage).
Skepticism: requires full logits (excludes closed models) and mitigation doesn’t fix the dominant sycophancy mechanism.

2) From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

Formalizes and measures inference-driven linkage across classical (Netflix/AOL), controlled (INFERLINK), and modern traces.
Reports high linkage capability (e.g., 79.2% Netflix for GPT-5; CLC=10 on AOL subset) and that linkage can arise under benign framing.
Tests prompt-based privacy guards and quantifies privacy–utility trade-offs.
Skepticism: INFERLINK is simplified; modern-trace studies are mechanism demos, not prevalence estimates.

3) CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Combines noise-corrected surrogate loss with propensity reweighting and doubly robust estimation for observational RLHF signals.
Shows consistent RM improvements and large downstream safety gains (e.g., +49.2% WildGuardMix, +32.7% HarmBench for Qwen2.5-7B in their setup).
Provides theoretical unbiasedness guarantees (IPS/DR) under correct nuisance estimation.
Skepticism: depends on accurate propensity/noise-rate estimation (anchor units) and doesn’t explore hybrid observational+experimental regimes.

4) Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

Quantifies how PR framing (“bug-free”) can cause 16.2–93.5pp drops in vulnerability detection TPR across models.
Demonstrates real exploitability: 35.3% bypass vs Copilot and 88.2% vs Claude Code in their tested setups; iterative refinement increases success.
Shows mitigations (ignore metadata/redaction) can largely restore detection (reported 100% recovery in interactive cases; ~94% in autonomous).
Skepticism: evaluated on selected models and controlled environments; high baseline false positives complicate operational interpretation.

5) ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Provides a procedural, contamination-resistant environment with a theoretical minimum query count K⋆ and rich efficiency diagnostics.
Shows even strong models can be highly inefficient (GPT-5 near-perfect accuracy but 70–270% more tool calls than K⋆).
Surfaces “budget anxiety” where more budget doesn’t reliably improve accuracy.
Skepticism: idealized logic-puzzle setting; transfer to noisy real tools remains to be established.

5) Practical next steps

For VLM products: implement counterfactual input probes (blind/noise/conflict) and track LAD/VNS/CS-like signals to distinguish “can’t see” vs “won’t say.”
Add grounding audits for any citation/marker-based safety UX: run occlusion-style causal checks to ensure citations/markers actually control outputs (not just formatting).
In agent stacks, treat prompt assembly as a security boundary: adopt provenance tagging + priority enforcement (PCFI-like) and log segment lineage for incident response.
For code-review agents: strip/normalize PR metadata or explicitly instruct “ignore metadata” in reviewer prompts; measure detection under adversarial “bug-free” framing as a regression test.
For cloud-planned agents handling private state: prototype a typed digital twin + capability catalog + gatekeeper (PlanTwin-like) and add disclosure budgets to prevent multi-turn fingerprinting.
For RLHF from logs: evaluate whether your feedback is missing-not-at-random; try propensity + noise correction (CausalRM-style) before collecting more labels.
For tool-augmented agents: report efficiency metrics (queries vs K⋆, redundancy ratios, token cost) alongside accuracy; use these to tune budget policies and reduce “budget anxiety.”
For GUI/long-horizon RL: consider evidence-chain critics (milestones + verification + audit) and track critic precision/recall, not just policy success.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-19

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Multimodal grounding & safety are steerable (and exploitable)

Theme: Agent security hardens the composition boundary and the observation channel

Theme: Privacy threats move from “what was revealed” to “what can be inferred”

Theme: Alignment optimization becomes data-centric and causally corrected

Theme: Agent RL and evaluation emphasize credit assignment, efficiency, and long-horizon reward reliability

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps