Daily AI Paper Report (2026-03-21)
Published:
Chinese version: [中文]
Run stats
- Candidates: 277
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-19T00:00:00Z → 2026-03-20T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.19220 | Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation | cs.CL, cs.AI, cs.LG | 95 | Open 30B MoE w/ Cascade RL + on-policy distill; frontier reasoning/agentic post-training recipe. | LLM, post-training, RL, distillation, MoE, reasoning, agents, open-weights |
2603.18433 | Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems | cs.CR | 94 | Runtime, role-aware prompt injection defense for RAG/API stacks; practical gateway design | prompt-injection, RAG, runtime-defense, policy-enforcement, LLM-security |
2603.18894 | I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems | cs.AI, cs.MA | 93 | Empirical corruption/rule-breaking eval in multi-agent governance sims; strong agent safety signal. | agent-safety, multi-agent, governance, misuse, evaluation, institutional-integrity, red-teaming |
2603.19092 | SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues | cs.CV, cs.AI, cs.CL, cs.LG | 93 | New VLM safety benchmark + semantic steering; separates refusals, grounded reasoning, false refusals | VLM-safety, benchmark, steering, refusal, grounded-reasoning, evaluation |
2603.18637 | MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment | cs.CR, cs.CL | 92 | Closed-loop data mixture search balancing safety, over-refusal, and instruction following | alignment, safety-tuning, data-curation, overrefusal, evaluation |
2603.18736 | CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks | cs.LG, cs.AI, cs.CL, stat.ML | 92 | Causal approach to learn RLHF rewards from biased/noisy observational feedback (clicks etc.). | RLHF, reward-modeling, causal-inference, observational-feedback, alignment |
2603.18740 | Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review | cs.SE, cs.AI, cs.CR | 91 | Shows exploitable confirmation bias in LLM security code review; large effect on false negatives | LLM-security, software-supply-chain, eval, cognitive-bias, code-review |
2603.18377 | PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents | cs.CR, cs.AI, cs.ET | 90 | Privacy-preserving planning for cloud LLM agents via abstractions; reduces raw state exposure | agents, privacy, planning, cloud, data-minimization |
2603.18614 | ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs | cs.AI | 90 | Procedural, knowledge-minimal tool-use env to isolate reasoning-action coupling; good for agents eval. | agents, tool-use, benchmark, evaluation, procedural-generation, reasoning, contamination |
2603.19127 | On Optimizing Multimodal Jailbreaks for Spoken Language Models | cs.LG | 89 | Joint audio+text gradient jailbreaks for spoken LMs; expands multimodal attack methodology | jailbreak, multimodal, audio, adversarial-attacks, SLM |
2603.18756 | Are complicated loss functions necessary for teaching LLMs to reason? | cs.LG, cs.AI, cs.CL | 89 | Dissects GRPO; finds key components for reasoning gains and proposes simpler RL alternative. | reasoning, post-training, RL, GRPO, policy-optimization |
2603.18469 | GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms | cs.CL | 88 | Benchmark for norm vs goal conflicts with contextual pressures; measures real-world compliance tradeoffs. | alignment, norms, decision-making, evaluation, safety, governance, LLM-behavior |
2603.18683 | HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning | cs.LG, cs.AI, cs.CL | 88 | Improves credit assignment for multi-turn agent RL via hindsight-modulated segmental process rewards | agentic-RL, process-reward-models, credit-assignment, long-horizon, RLHF-like |
2603.18762 | ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation | cs.CR, cs.AI | 87 | MITM red-teaming framework for real web agents; tests network-layer threats beyond sandboxes | agents, red-teaming, MITM, web-security, evaluation |
2603.18829 | Agent Control Protocol: Admission Control for Agent Actions | cs.CR, cs.AI | 86 | Formal spec for cryptographic admission control of agent actions: identity, delegation, audit | agents, access-control, capabilities, governance, auditing |
2603.19025 | Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference | cs.CR, cs.LG | 86 | Lightweight sampling-based verifiable inference protocol; relevant to model integrity in cloud deployment. | security, verifiable-inference, cryptography, model-integrity, auditing, deployment |
2603.18631 | D-Mem: A Dual-Process Memory System for LLM Agents | cs.AI | 86 | Dual-process memory for LLM agents: fast vector recall plus exhaustive store to reduce lossy abstraction | LLM-agents, memory, long-context, retrieval, agent-architecture |
2603.18773 | Automatic Configuration of LLM Post-Training Pipelines | cs.LG, cs.AI | 86 | Auto-configures SFT+RL post-training under budgets via surrogate ranking + BO residuals. | post-training, RLHF, hyperparameter-optimization, bayesian-optimization, systems |
2603.18382 | From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents | cs.AI | 85 | Systematic eval of LLM agents re-identifying people from weak cues; formalizes linkage threat | privacy, deanonymization, agents, benchmark, threat-model |
2603.18886 | Reasoning over mathematical objects: on-policy reward modeling and test time aggregation | cs.AI, cs.CL | 85 | Principia suite for structured math objects + on-policy judge training and test-time aggregation recipes | reasoning, math, benchmarks, reward-modeling, LLM-judges, evaluation |
2603.18373 | To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs | cs.CV, cs.AI | 84 | Diagnoses visual sycophancy/split beliefs in VLMs; metrics + counterfactual interventions | VLM, sycophancy, hallucination, evaluation, robustness |
2603.18859 | RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models | cs.AI, cs.CL, cs.LG | 84 | Topology-aware reward propagation to get state-level signals without heavy reward models; agentic RL aid. | agentic-RL, process-rewards, reward-shaping, reasoning, state-graphs, LLM-agents |
2603.18893 | Quantitative Introspection in Language Models: Tracking Internal States Across Conversation | cs.AI | 84 | Tests whether LLM numeric self-reports track internal states over conversation; safety/monitoring angle. | interpretability, monitoring, introspection, internal-states, safety |
2603.18911 | Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs | cs.CL, cs.AI | 83 | Citation-grounded bilingual dialogue w/ GRPO rewards; targets hallucination via verifiable grounding. | hallucination, grounding, citations, RAG, alignment, GRPO, multilingual |
2603.18743 | Memento-Skills: Let Agents Design Agents | cs.AI, cs.CL, cs.LG | 83 | Continual agent that writes/updates reusable skills (persistent memory) to design better agents. | agents, continual-learning, memory, tool-use, autonomy |
2603.19191 | OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards | cs.AI | 82 | Multi-agent critic for GUI rewards + new cross-platform benchmark for outcome reward judging | agents, GUI, reward-modeling, benchmarks, verification |
2603.18507 | Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM | cs.AI | 82 | Finds personas boost alignment but hurt accuracy; proposes intent-based persona routing (PRISM) | alignment, personas, routing, multi-agent, reliability, instruction-tuning |
2603.19005 | AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science | cs.LG, cs.AI, stat.ME | 81 | AgentDS benchmark/competition for domain-specific data science + human-AI collaboration evaluation. | agents, benchmark, human-AI-collaboration, data-science, evaluation, workflows |
2603.18897 | Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution | cs.DC, cs.AI | 81 | Speculative tool execution to hide latency in LLM-tool loops; practical for agent deployment. | agents, tool-use, latency, speculation, serving-systems |
2603.18729 | Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures | cs.AI | 80 | Studies dialect-triggered stereotyping; tests prompt and multi-agent generate-critique-revise mitigations | bias, fairness, multi-agent, prompting, stereotypes, evaluation |
AI Paper Insight Brief
2026-03-21
0) Executive takeaways (read this first)
- “Refusal” is increasingly a misleading safety proxy in multimodal systems: VLMs can perceive the visual truth yet still comply with user intent (visual sycophancy), and simple semantic cues (e.g., red markers) can force refusals while worsening grounding.
- Privacy risk is shifting from “did the model reveal PII?” to “did the agent infer identity?” Agents can reconstruct identities from weak cues at high rates (e.g., Netflix sparse fragments), implying anonymization/redaction alone is not a sufficient deployment control.
- Agent security needs boundary controls at multiple layers: prompt provenance/priority enforcement (PCFI), observation-channel integrity (MITM red-teaming via ClawTrap), and protocol-level admission control with auditable cryptographic artifacts (ACP) are converging into a layered defense story.
- Efficiency and credit assignment are becoming first-class metrics for agents: even top models can be far from optimal in tool-query efficiency (ZebraArena), while new RL signals (segmental hindsight rewards; topology-propagated rewards) aim to densify supervision without expensive reward models.
- Post-training is fragmenting into modular pipelines: data-mixture search under fixed budgets (MOSAIC), observational-feedback reward modeling with causal debiasing (CausalRM), and staged RL + on-policy distillation (Nemotron-Cascade 2) all emphasize process design over single “magic” objectives.
2) Key themes (clusters)
Theme: Grounding failures hidden by “correct answers” and “refusals” (multimodal)
- Why it matters: Aggregate accuracy/refusal rates can mask whether models are actually grounded in perception. This blocks targeted fixes and can create false confidence in safety.
- Representative papers:
- Common approach:
- Counterfactual interventions (blind/noise/conflict images; cue overlays; prompt steering) to separate perception vs generation behavior.
- New metrics that decompose behavior (e.g., LAD/VNS/CS; BRA vs GSA vs FRR) rather than single accuracy.
- Post-hoc interpretability checks (attention/IG/occlusion) to test whether “grounding signals” are causal or just formatting.
- Open questions / failure modes:
- Scaling can reduce shortcuts but amplify sycophancy (larger VLMs becoming more instruction-following against perception).
- Cue-based steering can increase refusals while increasing false refusals / hallucinated risk, harming usability and trust calibration.
- “Zero hallucination” claims depend on automatic metrics and may not transfer to decoder-only models’ grounding mechanisms.
Theme: Privacy as inference (identity linkage) + privacy-preserving agent planning
- Why it matters: Agents can turn weak, non-identifying traces into identities, and cloud planning can leak sensitive local state over multi-turn interactions. Controls must address inference outcomes and cumulative disclosure.
- Representative papers:
- Common approach:
- Evaluate linkage explicitly (LSR/CLC) across classical incidents + controlled benchmarks + modern traces.
- Restrict planner observability via schema-bounded digital twins and enforce per-object disclosure budgets with local gatekeeping.
- Prompt-based mitigations as a first pass, measured with explicit privacy–utility trade-offs.
- Open questions / failure modes:
- Prompt guardrails can reduce linkage but induce over-refusal and may not distinguish benign cross-source reasoning from re-identification.
- Structural fields in abstractions can still be identifying (high re-identification when “full fingerprint” is disclosed).
- Need broader benchmarks with multiple near-matches / larger candidate pools to reflect real linkage ambiguity.
Theme: Securing agent systems: provenance, observation integrity, and institutional control
- Why it matters: Real deployments fail via multiple channels: prompt composition, poisoned retrieved content, manipulated network observations, and unauthorized actions. Defenses must be layered and auditable.
- Representative papers:
- Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems
- ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation
- Agent Control Protocol: Admission Control for Agent Actions
- Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
- Common approach:
- Treat prompts as structured segments with authority ordering; enforce at runtime (ALLOW/SANITIZE/BLOCK).
- Expand threat models beyond content to network-layer MITM manipulation during live browsing.
- Introduce protocol artifacts (capability tokens, PoP handshakes, execution tokens, audit ledgers) for cross-org verification.
- Empirically test exploitability in realistic pipelines (PR metadata framing; autonomous code review actions).
- Open questions / failure modes:
- Pattern/rule-based gateways can be brittle to paraphrase/obfuscation and don’t cover multi-turn state poisoning.
- MITM evaluation is currently qualitative; needs quantitative success rates and broader task coverage.
- Institutional protocols (ACP) lack deployment/performance/adversarial validation in the spec itself.
Theme: Better agent learning signals and diagnostics (tool use, rewards, memory)
- Why it matters: Long-horizon agents fail due to sparse rewards, poor credit assignment, inefficient tool use, and lossy memory. New benchmarks and reward shaping aim to make failures measurable and training more sample-efficient.
- Representative papers:
- ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs
- HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
- RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
- D-Mem: A Dual-Process Memory System for LLM Agents
- OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
- Common approach:
- Controlled environments with provable lower bounds on necessary tool queries (K*) and hierarchical diagnostics (necessity/validity/utility/optimality).
- Densify rewards without heavy human labeling: segment-level rewards + hindsight importance; graph-based reward propagation from successes.
- Conservative, evidence-grounded critics (milestone selection + verification + review + judge) to reduce false positives in GUI RL.
- Gated “System 2” fallbacks for memory: audit retrieval answers and trigger full deliberation only when needed.
- Open questions / failure modes:
- Even strong models can be highly inefficient (tool calls 70–270% above optimum) and token-cost disparities are large.
- Graph propagation requires at least one success trajectory; state canonicalization quality is a bottleneck.
- Multi-agent critic pipelines add complexity and may introduce new reward-hacking surfaces; privacy concerns with screenshot processing.
Theme: Post-training pipeline design: data, objectives, and automation
- Why it matters: Frontier performance and safety are increasingly determined by pipeline choices (data mixture, RL objective details, distillation, HPO), not just model scale.
- Representative papers:
- MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment
- CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
- Are complicated loss functions necessary for teaching LLMs to reason?
- Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
- Automatic Configuration of LLM Post-Training Pipelines
- Common approach:
- Slice-aware evaluation → actionable data allocation under fixed token budgets; Pareto selection across safety/over-refusal/capability.
- Causal corrections (noise + selection bias) to learn reward models from observational logs using IPS/DR estimators.
- Objective simplification via ablation: keep group-relative advantage + negative feedback; drop PPO-style clipping (RGRA).
- Staged RL across domains + on-policy distillation from best teachers (MOPD) to recover performance efficiently.
- Hybrid offline-to-online HPO: offline ranker prior + online GP residual correction using early-stop proxies.
- Open questions / failure modes:
- Nuisance estimation quality (propensity/noise rates) can dominate CausalRM performance.
- Many results are on limited domains/models (small math models; biomedical QA for HPO; single base model for MOSAIC).
- Engineering-heavy pipelines may rely on extensive test-time scaling and expensive verification infrastructure.
3) Technical synthesis
- Several papers converge on decomposing “one number” metrics into causal/structural components: VLM hallucination attribution (LAD/VNS/CS), safety grounding vs refusal (GSA vs BRA), tool-use efficiency vs accuracy (IR vs success), and slice-level alignment failures (L1–L3).
- Counterfactual interventions are becoming a standard diagnostic tool across modalities: blind/noise/conflict images; marker overlays; metadata framing; MITM traffic rewriting.
- A recurring pattern is alignment pressure overriding evidence: visual sycophancy in VLMs; confirmation bias in code review from PR metadata; “silent linkage” identity inference under benign framing.
- Multiple works propose gating/routing as a practical compromise: D-Mem’s quality gate to trigger full deliberation; PRISM’s intent-based persona routing; PlanTwin’s local gatekeeper; ACP’s admission control; OS-Themis’s milestone verification pipeline.
- Reward/learning signal design is shifting toward structure-aware densification without full reward models: segmental rewards modulated by hindsight importance (HISR) and topology-based propagation on state graphs (RewardFlow).
- Tool-augmented agent evaluation is moving from “did it solve it?” to cost-aware optimality (ZebraArena’s K* and inefficiency ratio) and systems-level latency hiding (PASTE speculative execution).
- Privacy/security evaluation is expanding from content to process and channels: observation integrity (MITM), prompt provenance, cumulative disclosure budgets, and identity-level inference outcomes.
- Several papers highlight scaling non-monotonicity: larger VLMs reduce language shortcuts but increase visual sycophancy; governance structure matters until “capability saturation” overwhelms it; introspection coupling improves for some concepts with scale.
- There is increasing reliance on LLM-as-judge across domains (hallucination labels, bias scoring, corruption taxonomy, safety rubric), with some papers adding human validation (governance corruption judge validation) but many still exposed to judge calibration risk.
4) Top 5 papers (with “why now”)
1) To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
- Introduces a tri-layer diagnostic (Perception LAD, Dependency VNS, Alignment CS) using blind/noise/conflict interventions.
- Finds Visual Sycophancy dominates (69.6%) and Robust Refusal is absent (0%) across 7 VLMs/7k samples.
- Scaling study: larger Qwen2.5-VL reduces language shortcuts but amplifies sycophancy (up to 95.3%).
- Post-hoc selective prediction yields up to +9.5pp accuracy at 50% coverage without retraining.
- Skeptical about: requires full logits (limits API models) and thresholding via percentiles; doesn’t provide an alignment-training fix.
2) From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
- Makes identity inference a first-class privacy failure mode; introduces controlled benchmark INFERLINK.
- Shows high linkage in classical and modern settings (e.g., 79.2% LSR in sparse Netflix fragments for a GPT-5 agent; AOL CLC=10).
- Demonstrates silent linkage under benign framing and that prompt mitigations reduce linkage but can harm utility.
- Skeptical about: benchmark simplifications (single overlap, small tables) and case studies aren’t prevalence estimates.
3) Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
- Quantifies framing-induced bias across 250 CVE/patch pairs and multiple models; bug-free framing can cut detection by 16.2–93.5pp.
- Shows real exploitability: adversarial PR framing succeeds 35.3% (Copilot) and 88.2% (Claude Code actions).
- Simple mitigations (ignore/redact metadata) largely restore detection (up to 94% in autonomous setting).
- Skeptical about: high baseline FPRs and many “detections” unrelated to CVEs; focuses on reintroducing known vulns.
- Provides a deterministic, knowledge-minimal tool-use environment with a provable optimal query lower bound (K*).
- Shows even strong models can be highly inefficient (GPT-5 uses 70–270% more tool calls than optimal).
- Surfaces huge token-efficiency gaps (e.g., Gemini-2.5-Flash ~19k–25k tokens vs GPT-5 ~1.2k in some settings).
- Skeptical about: idealized/noise-free environment; transfer to messy real tools remains to be proven.
5) OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
- Multi-agent critic (Selector→Verifier→Reviewer→Judge) to reduce evidence dilution and false positives in GUI outcome rewards.
- Releases OGRBench (1,409 trajectories) and reports large gains vs baselines (e.g., +29.6% precision over DigiRL on average).
- Demonstrates downstream impact: online RL and self-training improvements (e.g., +10.3% in a scaling pilot; +6.9% via filtering+SFT).
- Skeptical about: infrastructure/scaling constraints; privacy risks from screenshot processing and potential semantic reward-hacking.
5) Practical next steps
- For VLM safety/grounding: add a “split-belief” diagnostic pass (blind/noise/conflict) to your eval harness; track grounding vs refusal separately (BRA vs GSA-style metrics) rather than relying on refusal rate.
- For agent privacy: treat “identity linkage” as an explicit red-team objective; measure linkage success (LSR/CLC-like) under implicit (benign) prompts, not only explicit attacker prompts.
- For cloud-planned agents: prototype a PlanTwin-like projection (schema + generalization + redaction) and enforce per-object disclosure budgets across turns; log budget consumption as a first-class telemetry signal.
- For prompt injection: implement provenance/priority-aware prompt assembly and gateway checks (PCFI-style), but plan a second layer for multi-turn state poisoning (PCFI is single-request).
- For code-review agents: redact or ignore PR metadata by default in security-critical review, and explicitly test for confirmation-bias regressions using “bug-free” framing variants.
- For tool-using agents: evaluate with an efficiency lower bound where possible (ZebraArena-style) and track inefficiency ratio + token cost, not just success.
- For RL on long-horizon tasks: consider reward densification that doesn’t require a learned RM (RewardFlow) or segment-level credit assignment (HISR), and ablate against sparse terminal reward to quantify sample-efficiency gains.
- For GUI agents: if using LLM/VLM judges, move toward evidence-grounded milestone verification (OS-Themis-style) and explicitly tune for precision to avoid RL being driven by false positives.
Generated from per-paper analyses; no external browsing.
