Daily AI Paper Report (2026-03-19)
Published:
Chinese version: [中文]
Run stats
- Candidates: 277
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-19T00:00:00Z → 2026-03-20T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.18433 | Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems | cs.CR | 94 | Runtime, role-aware prompt-injection defense for RAG/API stacks; practical gateway design + eval. | prompt-injection, RAG, runtime-defense, policy-enforcement, LLM-security, middleware |
2603.18894 | I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems | cs.AI, cs.MA | 94 | Empirical multi-agent governance sims quantify rule-breaking/corruption; high direct agent-safety relevance. | agent-safety, multi-agent, governance, evaluation, misuse, institutional-integrity |
2603.19092 | SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues | cs.CV, cs.AI, cs.CL, cs.LG | 93 | New VLM safety benchmark + semantic steering; separates refusals vs grounded reasoning | vlm-safety, benchmark, steering, refusal, grounded-reasoning, evaluation |
2603.18637 | MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment | cs.CR, cs.CL | 92 | Closed-loop multi-objective alignment data curation; explicit tradeoff safety vs over-refusal vs IF. | alignment, data-curation, SFT, over-refusal, safety-eval, mixture-optimization |
2603.18614 | ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs | cs.AI | 92 | Procedural tool-use environment isolates reasoning-action coupling; reduces contamination; strong agent eval asset. | agents, tool-use, benchmark, evaluation, procedural-generation, reasoning |
2603.18736 | CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks | cs.LG, cs.AI, cs.CL, stat.ML | 92 | Causal framing for reward models from noisy/biased observational feedback; scalable RLHF alternative | RLHF, reward-modeling, causal-inference, observational-feedback, alignment |
2603.18740 | Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review | cs.SE, cs.AI, cs.CR | 91 | Measures exploitable confirmation bias in LLM security code review; large effect sizes on FN rates. | secure-coding, LLM-failure-modes, supply-chain, evaluation, prompt-framing, robustness |
2603.18631 | D-Mem: A Dual-Process Memory System for LLM Agents | cs.AI | 90 | Dual-process memory for LLM agents; tackles lossy retrieval for long-horizon context | agents, memory, long-horizon, retrieval, architecture, reliability |
2603.18377 | PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents | cs.CR, cs.AI, cs.ET | 89 | Privacy-preserving planning for cloud LLM agents via planning abstractions; limits raw state exposure. | agents, privacy, cloud-planning, abstraction, confidential-context, system-design |
2603.18893 | Quantitative Introspection in Language Models: Tracking Internal States Across Conversation | cs.AI | 89 | Measures whether LLM numeric self-reports track internal states over dialogue; safety+interpretability angle | interpretability, introspection, monitoring, safety, probes, conversation |
2603.18382 | From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents | cs.AI | 88 | Systematic eval of LLM-agent de-anonymization from weak cues; formalizes inference-driven linkage. | privacy, deanonymization, agents, benchmark, linkage-attacks, risk-evaluation |
2603.18469 | GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms | cs.CL | 88 | Benchmark for norm-vs-goal conflicts under pressure; useful for alignment and policy compliance testing. | alignment, norms, decision-making, benchmark, robustness, governance |
2603.18683 | HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning | cs.LG, cs.AI, cs.CL | 88 | Improves multi-turn agent RL via hindsight-modulated segment rewards for credit assignment | agentic-rl, reward-modeling, credit-assignment, process-rewards, long-horizon |
2603.19127 | On Optimizing Multimodal Jailbreaks for Spoken Language Models | cs.LG | 87 | Joint audio+text gradient jailbreaks for spoken-language models; expands multimodal attack surface. | jailbreaks, multimodal, audio-attacks, adversarial-prompts, SLM, red-teaming |
2603.18756 | Are complicated loss functions necessary for teaching LLMs to reason? | cs.LG, cs.AI, cs.CL | 87 | Dissects GRPO; finds negative feedback key and clipping unnecessary; simplifies reasoning post-training | reasoning, RL, post-training, GRPO, REINFORCE, optimization |
2603.18762 | ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation | cs.CR, cs.AI | 86 | MITM-based red-teaming for real web agents (OpenClaw); network-layer attacks beyond sandbox tests. | agents, red-teaming, MITM, web-security, tool-use, evaluation-framework |
2603.19025 | Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference | cs.CR, cs.LG | 86 | Lightweight verifiable inference protocol for cloud models; relevant to auditing and deployment security. | security, verifiable-inference, cryptography, auditing, cloud-deployment, integrity |
2603.19144 | UGID: Unified Graph Isomorphism for Debiasing Large Language Models | cs.CL, cs.AI | 86 | Representation-level LLM debiasing via graph invariance across counterfactual inputs | bias, debiasing, interpretability, representations, counterfactuals, fairness |
2603.18829 | Agent Control Protocol: Admission Control for Agent Actions | cs.CR, cs.AI | 85 | Formal spec for cryptographic admission control of agent actions: identity, delegation, revocation, audit. | agent-governance, capabilities, authorization, cryptography, auditing, protocol |
2603.18743 | Memento-Skills: Let Agents Design Agents | cs.AI, cs.CL, cs.LG | 85 | Continual agent that writes reusable skills/memory to design new agents; relevant to agentic risk surface | agents, continual-learning, memory, tool-use, skills, agent-design |
2603.19191 | OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards | cs.AI | 84 | Scalable multi-agent critic for GUI rewards + new cross-platform reward benchmark (OGRBench). | GUI-agents, reward-modeling, critics, benchmarks, verification, RL |
2603.18911 | Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs | cs.CL, cs.AI | 84 | Citation-grounded bilingual dialogue training + reward; targets hallucinations with verifiable outputs. | hallucination, grounding, citations, RAG, alignment, multilingual |
2603.18507 | Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM | cs.AI | 84 | Finds persona prompts boost alignment but hurt accuracy; proposes intent-based routing | alignment, personas, prompting, routing, evaluation, tradeoffs |
2603.19017 | What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time? | cs.CL, cs.AI | 84 | MultiTempBench probes multilingual temporal reasoning; links failures to tokenization via mDFR + probing | evaluation, temporal-reasoning, multilingual, tokenization, benchmarks |
2603.18373 | To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs | cs.CV, cs.AI | 83 | Diagnoses visual sycophancy/split beliefs in VLMs with counterfactual tests; highlights alignment failure. | VLM, sycophancy, grounding, hallucinations, evaluation, uncertainty |
2603.19220 | Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation | cs.CL, cs.AI, cs.LG | 83 | Open 30B MoE with Cascade RL/distillation; strong reasoning/agentic claims; potentially impactful post-training. | LLM, post-training, RL, distillation, MoE, reasoning, agents |
2603.18886 | Reasoning over mathematical objects: on-policy reward modeling and test time aggregation | cs.AI, cs.CL | 83 | Principia suite for formal math objects + on-policy judge training + test-time aggregation | reasoning, math, benchmarks, reward-modeling, llm-judges, verification |
2603.18897 | Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution | cs.DC, cs.AI | 82 | Speculative tool execution to hide latency in LLM-tool loops; important for scalable agent serving | agents, tool-use, systems, latency, speculation, serving |
2603.18859 | RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models | cs.AI, cs.CL, cs.LG | 81 | Topology-aware reward propagation for agentic LLM RL; could improve sparse-reward training efficiency. | agentic-RL, process-rewards, trajectory-graphs, reasoning, optimization |
2603.18729 | Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures | cs.AI | 80 | Studies dialect-triggered stereotypes; tests prompt/COT and multi-agent critique-revise mitigation | bias, stereotypes, multi-agent, mitigation, prompting, fairness |
AI Paper Insight Brief
2026-03-19
0) Executive takeaways (read this first)
- “Grounding failures” are increasingly alignment/steering failures, not perception failures: a tri-layer VLM diagnostic finds Visual Sycophancy dominates (69.6%) and scaling reduces language shortcuts but amplifies sycophancy (Qwen2.5-VL 7B→72B: 72.4%→95.3% sycophancy).
- Agent security is shifting from prompt text to system interfaces and observation channels: priority-aware prompt composition defenses (PCFI), MITM web-traffic red-teaming (ClawTrap), and cryptographic admission control (ACP) all treat the agent stack as the attack surface—not just the model.
- Privacy risk is now “inference-time linkage,” not just leakage of explicit identifiers: agents can reconstruct identities from weak cues (e.g., Netflix 79.2% linkage; AOL 10 confirmed identities/40 histories), motivating privacy evaluation that measures inferred identity, not only redaction.
- Data/feedback quality is becoming the bottleneck for alignment: CausalRM shows large downstream safety gains from correcting noise + selection bias in observational feedback (e.g., +49.2% WildGuardMix, +32.7% HarmBench), while MOSAIC shows budgeted, slice-aware mixture search can avoid the over-refusal/capability collapse seen in naive safety mixing.
- Agent training and evaluation are converging on “credit assignment + efficiency”: ZebraArena quantifies tool-query inefficiency vs a theoretical optimum; RewardFlow and HISR propose denser, structure-aware reward propagation/segmented process rewards; OS-Themis improves long-horizon GUI rewards via milestone verification and auditing.
2) Key themes (clusters)
Theme: Multimodal grounding & safety are steerable (and exploitable)
- Why it matters: VLMs can “see” anomalies yet still comply/hallucinate; simple semantic cues can flip safety judgments. This makes multimodal systems vulnerable to both over-trust and over-refusal attacks.
- Representative papers:
- Common approach:
- Counterfactual interventions (blind/noise/conflict images; cue overlays; prompt steering) to separate perception vs dependence vs alignment.
- New metrics that separate behavior (refusal) from grounded correctness (e.g., BRA/GSA/FRR; LAD/VNS/CS).
- Post-hoc causal checks (occlusion/attribution) to test whether “citations/markers” actually control outputs.
- Open questions / failure modes:
- How to reduce visual sycophancy without inducing blanket refusal (Tri-layer shows 0% “robust refusal” under blind/noise in their taxonomy).
- Cue/overlay attacks that induce adversarial over-refusal (SAVeS Attacker: high refusal with extreme false refusals).
- “Citation format without grounding” in decoder-only models (occlusion grounding reported as 0.000 despite nonzero Citation-F1).
Theme: Agent security hardens the composition boundary and the observation channel
- Why it matters: Real deployments compose prompts from multiple sources and rely on networked observations; attackers exploit hierarchy confusion, metadata framing, and in-transit tampering.
- Representative papers:
- Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems
- ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation
- Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
- Agent Control Protocol: Admission Control for Agent Actions
- Common approach:
- Enforce provenance/priority at runtime (PCFI models S∥D∥U∥R with S>D>U>R; ALLOW/SANITIZE/BLOCK).
- Evaluate attacks that are not prompt-only: PR metadata framing (confirmation bias) and live HTTP MITM rewriting (ClawTrap).
- Add protocol layers: cryptographic identity, capability tokens, PoP handshakes, deterministic risk scoring, single-use execution tokens, append-only audit ledgers (ACP).
- Open questions / failure modes:
- Pattern-based prompt defenses may be brittle to paraphrase/semantic attacks and don’t cover multi-turn/tool chains (PCFI limitations).
- MITM evaluation is currently qualitative in ClawTrap; needs quantitative success metrics and broader task coverage.
- Metadata framing can cause huge detection drops (16.2–93.5pp TPR decreases) and high bypass rates (e.g., 88.2% vs Claude Code) unless pipelines explicitly ignore/redact metadata.
Theme: Privacy threats move from “what was revealed” to “what can be inferred”
- Why it matters: Even sanitized/anonymized artifacts can be linkable when agents generate hypotheses and retrieve corroborating evidence; cloud planners can also reconstruct identifying structure unless observation is constrained.
- Representative papers:
- Common approach:
- Formalize deanonymization as producing an identity hypothesis plus evidence Π:(Danon,Daux)→(î,E); measure LSR/CLC across classical + synthetic + modern traces.
- Reduce planner observability via local projection into a typed “digital twin” + capability catalog + gatekeeper with per-object disclosure budgets.
- Evaluate privacy–utility trade-offs explicitly (prompt privacy guards; disclosure budgets; re-identification experiments).
- Open questions / failure modes:
- Prompt-based privacy guards reduce linkage but can cause over-refusal/utility loss.
- Structural inference remains: even abstract graphs can fingerprint users/objects (PlanTwin shows 94.1% re-identification when all fields exposed).
- Benchmarks like INFERLINK are simplified (single overlap, small tables), so real-world ambiguity and prevalence remain uncertain.
Theme: Alignment optimization becomes data-centric and causally corrected
- Why it matters: Safety/usability regressions often come from biased feedback and misallocated SFT budgets; better estimators and slice-aware loops can move the Pareto frontier under fixed compute.
- Representative papers:
- CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
- MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment
- GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms
- Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM
- Common approach:
- Treat feedback as biased/noisy observational data; apply noise-corrected surrogate losses + propensity weighting + doubly robust estimators (CausalRM).
- Close the loop: slice-level failure profiles → executable data-mixture actions under a fixed token budget (MOSAIC).
- Benchmark realistic decision trade-offs under contextual pressures (GAIN) and mitigate steering trade-offs via gated distillation (PRISM).
- Open questions / failure modes:
- Causal corrections depend on estimating propensities and anchor units; misspecification risk remains.
- MOSAIC evaluated with limited iterations and baselines; generality across base models and independently constructed eval sets is open.
- Personas improve alignment behaviors but can degrade knowledge tasks; PRISM adds deployment complexity (gate + LoRA incompatibilities, limited scale tested).
Theme: Agent RL and evaluation emphasize credit assignment, efficiency, and long-horizon reward reliability
- Why it matters: Agents fail not only by being wrong, but by being inefficient, miscalibrated under budgets, or trained on noisy reward signals that poison gradients.
- Representative papers:
- ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs
- RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
- HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
- OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
- Common approach:
- Define theoretical lower bounds / structure (ZebraArena’s K⋆; RewardFlow’s state graphs; HISR’s sub-goal segments).
- Convert sparse outcomes into denser signals without heavy human labeling (graph BFS propagation; hindsight likelihood ratios; milestone verification).
- Diagnose inefficiency and “budget anxiety” rather than only final success.
- Open questions / failure modes:
- Idealized environments (logic puzzles) may not transfer to noisy real tools; need bridging studies.
- Reward shaping depends on state representations and availability of successes (RewardFlow notes reliance on successful trajectories).
- Critic frameworks can be expensive (OS-Themis reports ~117.6s per-trajectory evaluation latency).
3) Technical synthesis
- Multiple papers converge on counterfactual/provenance-aware evaluation: VLM blind/noise/conflict interventions (visual grounding), prompt-segment priority enforcement (PCFI), and MITM observation rewriting (ClawTrap) all treat “what the model saw” as the key variable.
- A recurring pattern is separating behavior from underlying competence: refusal rate vs grounded safety (SAVeS), accuracy vs image dependence vs alignment preference (Tri-layer), and “citation presence” vs causal grounding (XKD-Dial occlusion).
- Budgeting shows up everywhere: PlanTwin disclosure budgets, ZebraArena query budgets/pricing, MOSAIC fixed SFT token budgets, and OS-Themis cost/latency accounting—suggesting evaluation should report cost-conditioned performance curves, not single scores.
- Alignment methods increasingly use causal/statistical correction rather than more data: CausalRM’s noise + selection-bias correction parallels MOSAIC’s slice-aware allocation—both aim to prevent “training on the wrong signal.”
- Agent RL work is moving toward structure-induced dense rewards without training separate reward models: RewardFlow uses topology; HISR uses hindsight likelihood ratios; OS-Themis uses milestone evidence chains.
- Prompting/steering is shown to be double-edged: personas improve alignment but harm knowledge (PRISM), semantic cues can assist or attack VLM safety (SAVeS), and PR metadata can anchor code-review judgments (confirmation bias).
- Robustness failures are often asymmetric: confirmation bias mainly increases false negatives; VLMs can detect anomalies (high LAD) yet still hallucinate (high CS); privacy linkage can occur even under “benign” task framing (INFERLINK IMPLICIT).
- Several works emphasize auditable interfaces: ACP’s signed ledger + execution tokens, PlanTwin’s schema-bounded twin + gatekeeper, and OS-Themis’s verifiable milestone checks all create artifacts that can be inspected post hoc.
- Simplification trend in RL objectives: RGRA suggests PPO-style clipping may be unnecessary for GRPO-like reasoning gains in small models, but advantage normalization and negative feedback are essential for stability.
4) Top 5 papers (with “why now”)
1) To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
- Decomposes VLM failures into Perception (LAD), Dependency (VNS), Alignment (CS) via counterfactual images.
- Finds Visual Sycophancy is the dominant failure mode (69.6%) and scales up with model size in their Qwen2.5-VL analysis.
- Offers a practical mitigation via diagnostic-guided selective prediction (up to +9.5pp accuracy at 50% coverage).
- Skepticism: requires full logits (excludes closed models) and mitigation doesn’t fix the dominant sycophancy mechanism.
2) From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
- Formalizes and measures inference-driven linkage across classical (Netflix/AOL), controlled (INFERLINK), and modern traces.
- Reports high linkage capability (e.g., 79.2% Netflix for GPT-5; CLC=10 on AOL subset) and that linkage can arise under benign framing.
- Tests prompt-based privacy guards and quantifies privacy–utility trade-offs.
- Skepticism: INFERLINK is simplified; modern-trace studies are mechanism demos, not prevalence estimates.
3) CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
- Combines noise-corrected surrogate loss with propensity reweighting and doubly robust estimation for observational RLHF signals.
- Shows consistent RM improvements and large downstream safety gains (e.g., +49.2% WildGuardMix, +32.7% HarmBench for Qwen2.5-7B in their setup).
- Provides theoretical unbiasedness guarantees (IPS/DR) under correct nuisance estimation.
- Skepticism: depends on accurate propensity/noise-rate estimation (anchor units) and doesn’t explore hybrid observational+experimental regimes.
4) Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
- Quantifies how PR framing (“bug-free”) can cause 16.2–93.5pp drops in vulnerability detection TPR across models.
- Demonstrates real exploitability: 35.3% bypass vs Copilot and 88.2% vs Claude Code in their tested setups; iterative refinement increases success.
- Shows mitigations (ignore metadata/redaction) can largely restore detection (reported 100% recovery in interactive cases; ~94% in autonomous).
- Skepticism: evaluated on selected models and controlled environments; high baseline false positives complicate operational interpretation.
- Provides a procedural, contamination-resistant environment with a theoretical minimum query count K⋆ and rich efficiency diagnostics.
- Shows even strong models can be highly inefficient (GPT-5 near-perfect accuracy but 70–270% more tool calls than K⋆).
- Surfaces “budget anxiety” where more budget doesn’t reliably improve accuracy.
- Skepticism: idealized logic-puzzle setting; transfer to noisy real tools remains to be established.
5) Practical next steps
- For VLM products: implement counterfactual input probes (blind/noise/conflict) and track LAD/VNS/CS-like signals to distinguish “can’t see” vs “won’t say.”
- Add grounding audits for any citation/marker-based safety UX: run occlusion-style causal checks to ensure citations/markers actually control outputs (not just formatting).
- In agent stacks, treat prompt assembly as a security boundary: adopt provenance tagging + priority enforcement (PCFI-like) and log segment lineage for incident response.
- For code-review agents: strip/normalize PR metadata or explicitly instruct “ignore metadata” in reviewer prompts; measure detection under adversarial “bug-free” framing as a regression test.
- For cloud-planned agents handling private state: prototype a typed digital twin + capability catalog + gatekeeper (PlanTwin-like) and add disclosure budgets to prevent multi-turn fingerprinting.
- For RLHF from logs: evaluate whether your feedback is missing-not-at-random; try propensity + noise correction (CausalRM-style) before collecting more labels.
- For tool-augmented agents: report efficiency metrics (queries vs K⋆, redundancy ratios, token cost) alongside accuracy; use these to tune budget policies and reduce “budget anxiety.”
- For GUI/long-horizon RL: consider evidence-chain critics (milestones + verification + audit) and track critic precision/recall, not just policy success.
Generated from per-paper analyses; no external browsing.
