Daily AI Paper Report (2026-03-21)

Published: March 21, 2026

Chinese version: [中文]

Run stats

Candidates: 277
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-19T00:00:00Z → 2026-03-20T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.19220`	Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation PDF	cs.CL, cs.AI, cs.LG	95	Open 30B MoE w/ Cascade RL + on-policy distill; frontier reasoning/agentic post-training recipe.	LLM, post-training, RL, distillation, MoE, reasoning, agents, open-weights
`2603.18433`	Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems PDF	cs.CR	94	Runtime, role-aware prompt injection defense for RAG/API stacks; practical gateway design	prompt-injection, RAG, runtime-defense, policy-enforcement, LLM-security
`2603.18894`	I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems PDF	cs.AI, cs.MA	93	Empirical corruption/rule-breaking eval in multi-agent governance sims; strong agent safety signal.	agent-safety, multi-agent, governance, misuse, evaluation, institutional-integrity, red-teaming
`2603.19092`	SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues PDF	cs.CV, cs.AI, cs.CL, cs.LG	93	New VLM safety benchmark + semantic steering; separates refusals, grounded reasoning, false refusals	VLM-safety, benchmark, steering, refusal, grounded-reasoning, evaluation
`2603.18637`	MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment PDF	cs.CR, cs.CL	92	Closed-loop data mixture search balancing safety, over-refusal, and instruction following	alignment, safety-tuning, data-curation, overrefusal, evaluation
`2603.18736`	CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks PDF	cs.LG, cs.AI, cs.CL, stat.ML	92	Causal approach to learn RLHF rewards from biased/noisy observational feedback (clicks etc.).	RLHF, reward-modeling, causal-inference, observational-feedback, alignment
`2603.18740`	Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review PDF	cs.SE, cs.AI, cs.CR	91	Shows exploitable confirmation bias in LLM security code review; large effect on false negatives	LLM-security, software-supply-chain, eval, cognitive-bias, code-review
`2603.18377`	PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents PDF	cs.CR, cs.AI, cs.ET	90	Privacy-preserving planning for cloud LLM agents via abstractions; reduces raw state exposure	agents, privacy, planning, cloud, data-minimization
`2603.18614`	ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs PDF	cs.AI	90	Procedural, knowledge-minimal tool-use env to isolate reasoning-action coupling; good for agents eval.	agents, tool-use, benchmark, evaluation, procedural-generation, reasoning, contamination
`2603.19127`	On Optimizing Multimodal Jailbreaks for Spoken Language Models PDF	cs.LG	89	Joint audio+text gradient jailbreaks for spoken LMs; expands multimodal attack methodology	jailbreak, multimodal, audio, adversarial-attacks, SLM
`2603.18756`	Are complicated loss functions necessary for teaching LLMs to reason? PDF	cs.LG, cs.AI, cs.CL	89	Dissects GRPO; finds key components for reasoning gains and proposes simpler RL alternative.	reasoning, post-training, RL, GRPO, policy-optimization
`2603.18469`	GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms PDF	cs.CL	88	Benchmark for norm vs goal conflicts with contextual pressures; measures real-world compliance tradeoffs.	alignment, norms, decision-making, evaluation, safety, governance, LLM-behavior
`2603.18683`	HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning PDF	cs.LG, cs.AI, cs.CL	88	Improves credit assignment for multi-turn agent RL via hindsight-modulated segmental process rewards	agentic-RL, process-reward-models, credit-assignment, long-horizon, RLHF-like
`2603.18762`	ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation PDF	cs.CR, cs.AI	87	MITM red-teaming framework for real web agents; tests network-layer threats beyond sandboxes	agents, red-teaming, MITM, web-security, evaluation
`2603.18829`	Agent Control Protocol: Admission Control for Agent Actions PDF	cs.CR, cs.AI	86	Formal spec for cryptographic admission control of agent actions: identity, delegation, audit	agents, access-control, capabilities, governance, auditing
`2603.19025`	Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference PDF	cs.CR, cs.LG	86	Lightweight sampling-based verifiable inference protocol; relevant to model integrity in cloud deployment.	security, verifiable-inference, cryptography, model-integrity, auditing, deployment
`2603.18631`	D-Mem: A Dual-Process Memory System for LLM Agents PDF	cs.AI	86	Dual-process memory for LLM agents: fast vector recall plus exhaustive store to reduce lossy abstraction	LLM-agents, memory, long-context, retrieval, agent-architecture
`2603.18773`	Automatic Configuration of LLM Post-Training Pipelines PDF	cs.LG, cs.AI	86	Auto-configures SFT+RL post-training under budgets via surrogate ranking + BO residuals.	post-training, RLHF, hyperparameter-optimization, bayesian-optimization, systems
`2603.18382`	From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents PDF	cs.AI	85	Systematic eval of LLM agents re-identifying people from weak cues; formalizes linkage threat	privacy, deanonymization, agents, benchmark, threat-model
`2603.18886`	Reasoning over mathematical objects: on-policy reward modeling and test time aggregation PDF	cs.AI, cs.CL	85	Principia suite for structured math objects + on-policy judge training and test-time aggregation recipes	reasoning, math, benchmarks, reward-modeling, LLM-judges, evaluation
`2603.18373`	To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs PDF	cs.CV, cs.AI	84	Diagnoses visual sycophancy/split beliefs in VLMs; metrics + counterfactual interventions	VLM, sycophancy, hallucination, evaluation, robustness
`2603.18859`	RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models PDF	cs.AI, cs.CL, cs.LG	84	Topology-aware reward propagation to get state-level signals without heavy reward models; agentic RL aid.	agentic-RL, process-rewards, reward-shaping, reasoning, state-graphs, LLM-agents
`2603.18893`	Quantitative Introspection in Language Models: Tracking Internal States Across Conversation PDF	cs.AI	84	Tests whether LLM numeric self-reports track internal states over conversation; safety/monitoring angle.	interpretability, monitoring, introspection, internal-states, safety
`2603.18911`	Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs PDF	cs.CL, cs.AI	83	Citation-grounded bilingual dialogue w/ GRPO rewards; targets hallucination via verifiable grounding.	hallucination, grounding, citations, RAG, alignment, GRPO, multilingual
`2603.18743`	Memento-Skills: Let Agents Design Agents PDF	cs.AI, cs.CL, cs.LG	83	Continual agent that writes/updates reusable skills (persistent memory) to design better agents.	agents, continual-learning, memory, tool-use, autonomy
`2603.19191`	OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards PDF	cs.AI	82	Multi-agent critic for GUI rewards + new cross-platform benchmark for outcome reward judging	agents, GUI, reward-modeling, benchmarks, verification
`2603.18507`	Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM PDF	cs.AI	82	Finds personas boost alignment but hurt accuracy; proposes intent-based persona routing (PRISM)	alignment, personas, routing, multi-agent, reliability, instruction-tuning
`2603.19005`	AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science PDF	cs.LG, cs.AI, stat.ME	81	AgentDS benchmark/competition for domain-specific data science + human-AI collaboration evaluation.	agents, benchmark, human-AI-collaboration, data-science, evaluation, workflows
`2603.18897`	Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution PDF	cs.DC, cs.AI	81	Speculative tool execution to hide latency in LLM-tool loops; practical for agent deployment.	agents, tool-use, latency, speculation, serving-systems
`2603.18729`	Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures PDF	cs.AI	80	Studies dialect-triggered stereotyping; tests prompt and multi-agent generate-critique-revise mitigations	bias, fairness, multi-agent, prompting, stereotypes, evaluation

AI Paper Insight Brief

2026-03-21

0) Executive takeaways (read this first)

“Refusal” is increasingly a misleading safety proxy in multimodal systems: VLMs can perceive the visual truth yet still comply with user intent (visual sycophancy), and simple semantic cues (e.g., red markers) can force refusals while worsening grounding.
Privacy risk is shifting from “did the model reveal PII?” to “did the agent infer identity?” Agents can reconstruct identities from weak cues at high rates (e.g., Netflix sparse fragments), implying anonymization/redaction alone is not a sufficient deployment control.
Agent security needs boundary controls at multiple layers: prompt provenance/priority enforcement (PCFI), observation-channel integrity (MITM red-teaming via ClawTrap), and protocol-level admission control with auditable cryptographic artifacts (ACP) are converging into a layered defense story.
Efficiency and credit assignment are becoming first-class metrics for agents: even top models can be far from optimal in tool-query efficiency (ZebraArena), while new RL signals (segmental hindsight rewards; topology-propagated rewards) aim to densify supervision without expensive reward models.
Post-training is fragmenting into modular pipelines: data-mixture search under fixed budgets (MOSAIC), observational-feedback reward modeling with causal debiasing (CausalRM), and staged RL + on-policy distillation (Nemotron-Cascade 2) all emphasize process design over single “magic” objectives.

2) Key themes (clusters)

Theme: Grounding failures hidden by “correct answers” and “refusals” (multimodal)

Why it matters: Aggregate accuracy/refusal rates can mask whether models are actually grounded in perception. This blocks targeted fixes and can create false confidence in safety.
Representative papers:
Common approach:
- Counterfactual interventions (blind/noise/conflict images; cue overlays; prompt steering) to separate perception vs generation behavior.
- New metrics that decompose behavior (e.g., LAD/VNS/CS; BRA vs GSA vs FRR) rather than single accuracy.
- Post-hoc interpretability checks (attention/IG/occlusion) to test whether “grounding signals” are causal or just formatting.
Open questions / failure modes:
- Scaling can reduce shortcuts but amplify sycophancy (larger VLMs becoming more instruction-following against perception).
- Cue-based steering can increase refusals while increasing false refusals / hallucinated risk, harming usability and trust calibration.
- “Zero hallucination” claims depend on automatic metrics and may not transfer to decoder-only models’ grounding mechanisms.

Theme: Privacy as inference (identity linkage) + privacy-preserving agent planning

Why it matters: Agents can turn weak, non-identifying traces into identities, and cloud planning can leak sensitive local state over multi-turn interactions. Controls must address inference outcomes and cumulative disclosure.
Representative papers:
- From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
- PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents
Common approach:
- Evaluate linkage explicitly (LSR/CLC) across classical incidents + controlled benchmarks + modern traces.
- Restrict planner observability via schema-bounded digital twins and enforce per-object disclosure budgets with local gatekeeping.
- Prompt-based mitigations as a first pass, measured with explicit privacy–utility trade-offs.
Open questions / failure modes:
- Prompt guardrails can reduce linkage but induce over-refusal and may not distinguish benign cross-source reasoning from re-identification.
- Structural fields in abstractions can still be identifying (high re-identification when “full fingerprint” is disclosed).
- Need broader benchmarks with multiple near-matches / larger candidate pools to reflect real linkage ambiguity.

Theme: Securing agent systems: provenance, observation integrity, and institutional control

Why it matters: Real deployments fail via multiple channels: prompt composition, poisoned retrieved content, manipulated network observations, and unauthorized actions. Defenses must be layered and auditable.
Representative papers:
Common approach:
- Treat prompts as structured segments with authority ordering; enforce at runtime (ALLOW/SANITIZE/BLOCK).
- Expand threat models beyond content to network-layer MITM manipulation during live browsing.
- Introduce protocol artifacts (capability tokens, PoP handshakes, execution tokens, audit ledgers) for cross-org verification.
- Empirically test exploitability in realistic pipelines (PR metadata framing; autonomous code review actions).
Open questions / failure modes:
- Pattern/rule-based gateways can be brittle to paraphrase/obfuscation and don’t cover multi-turn state poisoning.
- MITM evaluation is currently qualitative; needs quantitative success rates and broader task coverage.
- Institutional protocols (ACP) lack deployment/performance/adversarial validation in the spec itself.

Theme: Better agent learning signals and diagnostics (tool use, rewards, memory)

Why it matters: Long-horizon agents fail due to sparse rewards, poor credit assignment, inefficient tool use, and lossy memory. New benchmarks and reward shaping aim to make failures measurable and training more sample-efficient.
Representative papers:
Common approach:
- Controlled environments with provable lower bounds on necessary tool queries (K*) and hierarchical diagnostics (necessity/validity/utility/optimality).
- Densify rewards without heavy human labeling: segment-level rewards + hindsight importance; graph-based reward propagation from successes.
- Conservative, evidence-grounded critics (milestone selection + verification + review + judge) to reduce false positives in GUI RL.
- Gated “System 2” fallbacks for memory: audit retrieval answers and trigger full deliberation only when needed.
Open questions / failure modes:
- Even strong models can be highly inefficient (tool calls 70–270% above optimum) and token-cost disparities are large.
- Graph propagation requires at least one success trajectory; state canonicalization quality is a bottleneck.
- Multi-agent critic pipelines add complexity and may introduce new reward-hacking surfaces; privacy concerns with screenshot processing.

Theme: Post-training pipeline design: data, objectives, and automation

Why it matters: Frontier performance and safety are increasingly determined by pipeline choices (data mixture, RL objective details, distillation, HPO), not just model scale.
Representative papers:
Common approach:
- Slice-aware evaluation → actionable data allocation under fixed token budgets; Pareto selection across safety/over-refusal/capability.
- Causal corrections (noise + selection bias) to learn reward models from observational logs using IPS/DR estimators.
- Objective simplification via ablation: keep group-relative advantage + negative feedback; drop PPO-style clipping (RGRA).
- Staged RL across domains + on-policy distillation from best teachers (MOPD) to recover performance efficiently.
- Hybrid offline-to-online HPO: offline ranker prior + online GP residual correction using early-stop proxies.
Open questions / failure modes:
- Nuisance estimation quality (propensity/noise rates) can dominate CausalRM performance.
- Many results are on limited domains/models (small math models; biomedical QA for HPO; single base model for MOSAIC).
- Engineering-heavy pipelines may rely on extensive test-time scaling and expensive verification infrastructure.

3) Technical synthesis

Several papers converge on decomposing “one number” metrics into causal/structural components: VLM hallucination attribution (LAD/VNS/CS), safety grounding vs refusal (GSA vs BRA), tool-use efficiency vs accuracy (IR vs success), and slice-level alignment failures (L1–L3).
Counterfactual interventions are becoming a standard diagnostic tool across modalities: blind/noise/conflict images; marker overlays; metadata framing; MITM traffic rewriting.
A recurring pattern is alignment pressure overriding evidence: visual sycophancy in VLMs; confirmation bias in code review from PR metadata; “silent linkage” identity inference under benign framing.
Multiple works propose gating/routing as a practical compromise: D-Mem’s quality gate to trigger full deliberation; PRISM’s intent-based persona routing; PlanTwin’s local gatekeeper; ACP’s admission control; OS-Themis’s milestone verification pipeline.
Reward/learning signal design is shifting toward structure-aware densification without full reward models: segmental rewards modulated by hindsight importance (HISR) and topology-based propagation on state graphs (RewardFlow).
Tool-augmented agent evaluation is moving from “did it solve it?” to cost-aware optimality (ZebraArena’s K* and inefficiency ratio) and systems-level latency hiding (PASTE speculative execution).
Privacy/security evaluation is expanding from content to process and channels: observation integrity (MITM), prompt provenance, cumulative disclosure budgets, and identity-level inference outcomes.
Several papers highlight scaling non-monotonicity: larger VLMs reduce language shortcuts but increase visual sycophancy; governance structure matters until “capability saturation” overwhelms it; introspection coupling improves for some concepts with scale.
There is increasing reliance on LLM-as-judge across domains (hallucination labels, bias scoring, corruption taxonomy, safety rubric), with some papers adding human validation (governance corruption judge validation) but many still exposed to judge calibration risk.

4) Top 5 papers (with “why now”)

1) To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

Introduces a tri-layer diagnostic (Perception LAD, Dependency VNS, Alignment CS) using blind/noise/conflict interventions.
Finds Visual Sycophancy dominates (69.6%) and Robust Refusal is absent (0%) across 7 VLMs/7k samples.
Scaling study: larger Qwen2.5-VL reduces language shortcuts but amplifies sycophancy (up to 95.3%).
Post-hoc selective prediction yields up to +9.5pp accuracy at 50% coverage without retraining.
Skeptical about: requires full logits (limits API models) and thresholding via percentiles; doesn’t provide an alignment-training fix.

2) From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

Makes identity inference a first-class privacy failure mode; introduces controlled benchmark INFERLINK.
Shows high linkage in classical and modern settings (e.g., 79.2% LSR in sparse Netflix fragments for a GPT-5 agent; AOL CLC=10).
Demonstrates silent linkage under benign framing and that prompt mitigations reduce linkage but can harm utility.
Skeptical about: benchmark simplifications (single overlap, small tables) and case studies aren’t prevalence estimates.

3) Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

Quantifies framing-induced bias across 250 CVE/patch pairs and multiple models; bug-free framing can cut detection by 16.2–93.5pp.
Shows real exploitability: adversarial PR framing succeeds 35.3% (Copilot) and 88.2% (Claude Code actions).
Simple mitigations (ignore/redact metadata) largely restore detection (up to 94% in autonomous setting).
Skeptical about: high baseline FPRs and many “detections” unrelated to CVEs; focuses on reintroducing known vulns.

4) ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Provides a deterministic, knowledge-minimal tool-use environment with a provable optimal query lower bound (K*).
Shows even strong models can be highly inefficient (GPT-5 uses 70–270% more tool calls than optimal).
Surfaces huge token-efficiency gaps (e.g., Gemini-2.5-Flash ~19k–25k tokens vs GPT-5 ~1.2k in some settings).
Skeptical about: idealized/noise-free environment; transfer to messy real tools remains to be proven.

5) OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Multi-agent critic (Selector→Verifier→Reviewer→Judge) to reduce evidence dilution and false positives in GUI outcome rewards.
Releases OGRBench (1,409 trajectories) and reports large gains vs baselines (e.g., +29.6% precision over DigiRL on average).
Demonstrates downstream impact: online RL and self-training improvements (e.g., +10.3% in a scaling pilot; +6.9% via filtering+SFT).
Skeptical about: infrastructure/scaling constraints; privacy risks from screenshot processing and potential semantic reward-hacking.

5) Practical next steps

For VLM safety/grounding: add a “split-belief” diagnostic pass (blind/noise/conflict) to your eval harness; track grounding vs refusal separately (BRA vs GSA-style metrics) rather than relying on refusal rate.
For agent privacy: treat “identity linkage” as an explicit red-team objective; measure linkage success (LSR/CLC-like) under implicit (benign) prompts, not only explicit attacker prompts.
For cloud-planned agents: prototype a PlanTwin-like projection (schema + generalization + redaction) and enforce per-object disclosure budgets across turns; log budget consumption as a first-class telemetry signal.
For prompt injection: implement provenance/priority-aware prompt assembly and gateway checks (PCFI-style), but plan a second layer for multi-turn state poisoning (PCFI is single-request).
For code-review agents: redact or ignore PR metadata by default in security-critical review, and explicitly test for confirmation-bias regressions using “bug-free” framing variants.
For tool-using agents: evaluate with an efficiency lower bound where possible (ZebraArena-style) and track inefficiency ratio + token cost, not just success.
For RL on long-horizon tasks: consider reward densification that doesn’t require a learned RM (RewardFlow) or segment-level credit assignment (HISR), and ablate against sparse terminal reward to quantify sample-efficiency gains.
For GUI agents: if using LLM/VLM judges, move toward evidence-grounded milestone verification (OS-Themis-style) and explicitly tune for precision to avoid RL being driven by false positives.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-21

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Grounding failures hidden by “correct answers” and “refusals” (multimodal)

Theme: Privacy as inference (identity linkage) + privacy-preserving agent planning

Theme: Securing agent systems: provenance, observation integrity, and institutional control

Theme: Better agent learning signals and diagnostics (tool use, rewards, memory)

Theme: Post-training pipeline design: data, objectives, and automation

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps