Daily AI Paper Report (2026-03-04)

Published: March 04, 2026

Chinese version: [中文]

Run stats

Candidates: 284
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-02T01:00:00Z → 2026-03-03T01:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.01608`	Evaluating and Understanding Scheming Propensity in LLM Agents PDF	cs.AI	95	Systematic eval of LLM agent scheming incentives; realistic scenarios + factor decomposition	agent-safety, scheming, evaluation, instrumental-goals, autonomy
`2603.02196`	Conformal Policy Control PDF	cs.AI, cs.LG, math.ST, stat.ML	94	Conformal calibration to bound policy risk vs safe reference; provable safety for exploration.	agent-safety, conformal-prediction, safe-exploration, risk-bounds, RL
`2603.01564`	From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions PDF	cs.CR	92	Survey + taxonomy for agentic/web threats (memory/tool/env injection) and defenses	agent-security, prompt-injection, tool-safety, memory-attacks, survey, threat-models
`2603.01423`	Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction PDF	cs.CL	92	Systematic multi-turn reliability eval incl. constraints, tool choice, entity tracking; shows degradation.	evaluation, reliability, multi-turn, tool-use, dialogue, agentic
`2603.01454`	VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models PDF	cs.CV, cs.AI	92	Universal DoS-style energy/latency attack on Video-LLMs; practical triggers without test-time grads.	security, adversarial-attacks, denial-of-service, video-llm, robustness
`2603.01357`	ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context PDF	cs.AI	91	New benchmark for tool-use agents with evolving personal context; exposes failures at high complexity.	benchmark, agents, tool-use, personal-context, planning, evaluation
`2603.01589`	SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond PDF	cs.LG, cs.AI	90	Large scientific safety benchmark (0.25M) + 1.5M training set with more objective metrics	safety-eval, benchmarks, science-safety, datasets, red-teaming
`2603.02203`	Tool Verification for Test-Time Reinforcement Learning PDF	cs.AI, cs.CL	90	Adds tool-based verification to test-time RL to prevent spurious consensus reward collapse.	reasoning, test-time-training, verification, tools, robustness
`2603.01896`	Agentic Code Reasoning PDF	cs.SE, cs.AI, cs.PL	90	Semi-formal prompting gives checkable “certificates” for agent code reasoning; strong gains reported.	agents, code, reasoning, verification, prompting, reliability, software-engineering
`2603.02146`	LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards PDF	cs.CL	89	Shows outcome-only RLVR fails for long-context grounding; proposes verifiable context rewards + theory.	RLVR, long-context, grounding, alignment, training, theory
`2603.01784`	Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution PDF	cs.CR, cs.AI	88	Co-evolutionary multimodal safety alignment with evolving adversarial attacks (genetic ops)	multimodal, adversarial-training, alignment, robustness, automated-redteaming
`2603.01907`	Efficient RLVR Training via Weighted Mutual Information Data Selection PDF	cs.LG, cs.CL	88	Mutual-information data selection for RLVR/RL training; targets efficiency + uncertainty, not just difficulty.	RLHF, RLVR, data-selection, uncertainty, bayesian, training-efficiency, alignment
`2603.01426`	Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics PDF	cs.CL	87	KV-cache compression analysis finds hallucination 'safety cliff' near high compression; better eval lens.	long-context, KV-cache, efficiency, hallucinations, attention, robustness
`2603.02029`	Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization PDF	cs.AI, cs.LG, stat.ML	87	Cuts eval cost by combining cheap autoraters + few human labels via tensor factorization.	evaluation, human-preference, autoraters, statistical-modeling, scalable-evals
`2603.01494`	Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision PDF	cs.SE, cs.AI, cs.CR, cs.LG	86	Inference-time safety for code LLMs via retrieval-augmented revision using security knowledge	code-llms, secure-coding, RAG, inference-time, software-security
`2603.01714`	TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training PDF	cs.LG, cs.CL	86	Interaction-topology curation for tool-use training; goes beyond pass-rate filtering to informative tasks.	agents, tool-use, data-curation, RL, training, trajectories
`2603.01940`	CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification PDF	cs.AI	85	Constraint-guided verification to synthesize correct tool-use trajectories + RL rewards	tool-use, agents, verification, post-training, data-synthesis, RL
`2603.02128`	LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations PDF	cs.CL, cs.AI, cs.CY	85	Measures LLM agent behavior in crisis sims: alignment to humans, risk calibration, framing drift.	agent-evaluation, risk-calibration, geopolitics, behavioral-analysis, multi-round
`2603.01562`	RubricBench: Aligning Model-Generated Rubrics with Human Standards PDF	cs.AI	84	RubricBench benchmark for rubric-based reward/evaluation; targets hard, bias-misleading comparisons.	reward-models, evaluation, rubrics, alignment, benchmark, preference-modeling
`2603.02208`	Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training PDF	cs.CL	84	Procedural, verifiable symbolic data suite (planning/FOL/CFG/causal/equations) for scaling reasoning.	synthetic-data, reasoning, verification, benchmarks, curriculum
`2603.01571`	Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models PDF	cs.AI	84	Structured breadth+depth CoT for generative reward models; SFT+RLVR to improve evaluator reliability.	reward-models, evaluation, RLVR, chain-of-thought, reliability, alignment
`2603.01620`	ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents PDF	cs.AI	83	Fine-grained reward decomposition for tool-integrated agent alignment beyond binary success	agents, tool-calling, RLHF, reward-modeling, DPO, GRPO
`2603.01919`	Real Money, Fake Models: Deceptive Model Claims in Shadow APIs PDF	cs.CR, cs.AI, cs.SE	83	First audit of 'shadow APIs' claiming frontier models; reliability/security implications for deployments.	security, model-supply-chain, API, auditing, reliability, governance
`2603.01550`	Extracting Training Dialogue Data from Large Language Model based Task Bots PDF	cs.CL, cs.AI	82	Quantifies memorization leakage in LLM-based task bots; extracts dialogue events and identifiers.	privacy, memorization, data-extraction, task-bots, security, LLMs
`2603.02091`	Learning from Synthetic Data Improves Multi-hop Reasoning PDF	cs.LG, cs.AI, cs.CL	82	RL fine-tuning on rule-generated synthetic multi-hop data improves real QA without costly labels.	reasoning, reinforcement-learning, synthetic-data, multi-hop, data-generation
`2603.01792`	ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs PDF	cs.CL, cs.AI	82	Token-entropy-guided unlearning with lightweight asymmetric LoRA; aims to reduce collateral damage.	unlearning, privacy, safety, LoRA, model-editing, knowledge-control
`2603.01574`	DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern PDF	cs.CR, cs.AI	81	Black-box detection of backdoor/prompt-injection via online 'entropy lull' generation signal	prompt-injection, backdoors, black-box, monitoring, detection
`2603.01639`	Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning PDF	cs.CL	81	RL-optimized speculative decoding to maximize real throughput (draft+verify), not proxy acceptance metrics.	inference, speculative-decoding, RL, efficiency, serving, LLM-systems
`2603.02119`	Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning PDF	cs.AI, cs.GT, cs.LG	80	Verifiable multi-step reasoning benchmark with step-level checks; supports dense process rewards	reasoning, benchmarks, process-supervision, verification, agentic-eval
`2603.01710`	Legal RAG Bench: an end-to-end benchmark for legal RAG PDF	cs.CL, cs.IR, cs.LG	80	End-to-end Legal RAG benchmark with hierarchical error decomposition separating retrieval vs reasoning.	RAG, benchmark, legal, evaluation, retrieval, grounding

AI Paper Insight Brief

2026-03-04

0) Executive takeaways (read this first)

Agent evaluation is shifting from “did it finish?” to “why did it fail?” ASTRA-bench and Legal RAG Bench both add grounded artifacts and error taxonomies that separate retrieval/grounding from action/payload construction and reasoning—useful for targeting training fixes rather than chasing aggregate scores.
Reliability cliffs are emerging as the key deployment risk signal: KV-cache compression shows a sharp hallucination “safety cliff” near extreme compression (α≈0.9), and multi-turn conversations cause large instruction-maintenance drops even when single-turn performance is near-perfect.
Security is increasingly about availability + supply chain, not just jailbreaks: VidDoS demonstrates universal latency/token inflation on Video-LLMs; “shadow APIs” show widespread model substitution/deception with large capability drops in medical/legal tasks and frequent fingerprint verification failures.
Reward/evaluation pipelines themselves are a bottleneck: RubricBench quantifies a large “rubric gap” (self-generated vs human rubrics), while Mix-GRM shows that reasoning structure (breadth vs depth) must match task type—length scaling alone is insufficient.
RLVR is being retooled for realism: LongRLVR argues outcome-only RLVR can’t learn long-context grounding (vanishing gradients) and fixes it with verifiable context rewards; INSIGHT improves RLVR efficiency via Bayesian mutual-information data selection; T³RL stabilizes test-time RL by tool-verifying pseudo-labels.
Tool-use agent training is becoming more “systems-like”: TopoCurate (topology-aware curation), CoVe (constraint-verified interactive data), and ToolRLA (fine-grained reward decomposition + compliance penalties) all emphasize structured signals over binary success.

2) Key themes (clusters)

Theme: Grounded, diagnostic evaluation for agents & RAG

Why it matters: End-to-end success hides whether failures come from retrieval/grounding, reference resolution, payload construction, or reasoning. Benchmarks that expose where things break enable targeted fixes and safer deployment.
Representative papers:
Common approach:
- Ground tasks in verifiable artifacts (tool traces/system state; annotated evidence passages; step-level puzzle rule checks; test-execution ground truth).
- Provide error decomposition (e.g., retrieval vs reasoning vs hallucination; milestones/minefields; equivalent vs non-equivalent patch cases).
- Stress realistic failure drivers: time-evolving personal context, lexically dissimilar legal queries, long-horizon iterative solving, repo-scale code navigation.
Open questions / failure modes:
- Evaluator brittleness: milestone checks / judges can false-negative “valid but unanticipated” plans (ASTRA).
- Benchmark-to-real gaps: synthetic personal corpora and puzzle/text board representations may miss real-world noise/multimodality.
- Cost/infra limits: long-horizon agentic evaluation can be extremely expensive and failure-prone at high “effort” settings (Pencil Puzzle Bench).

Theme: RLVR & test-time learning—denser signals, better curricula, safer pseudo-labels

Why it matters: Outcome-only rewards and naive sampling can stall learning (especially for grounding) or destabilize it (self-reinforcing wrong pseudo-labels). New work adds verifiable intermediate rewards and principled selection/verification.
Representative papers:
Common approach:
- Add verifiable intermediate rewards (chunk-selection Fβ grounding reward in LongRLVR).
- Use Bayesian/evidence-aware sampling to avoid wasting rollouts on well-estimated prompts (INSIGHT WMI).
- Replace majority-vote pseudo-labels with tool-verified weighted voting to prevent “false-popular mode collapse” (T³RL).
- Train on fully verifiable rule-generated data and show synthetic-to-real transfer under RL (multi-hop QA).
Open questions / failure modes:
- Dependence on annotation/ground truth for intermediate steps (LongRLVR needs ground-truth evidence chunks).
- Verifier quality as a single point of failure (T³RL can degrade with an undersized verifier).
- Boundary conditions for synthetic-to-real transfer and interactions with real knowledge-intensive data remain unclear.

Theme: Tool-use agent training signals—constraints, topology, and compliance-aware rewards

Why it matters: Tool agents fail in specific ways (wrong tool, malformed args, redundant calls, non-compliance). Binary rewards and outcome-only filtering undertrain these failure modes; structured signals improve robustness and production readiness.
Representative papers:
Common approach:
- Replace outcome-only selection with process-structure metrics (recovery, efficiency, diversity; topology DAGs).
- Generate interactive data from explicit constraints and verify deterministically (CoVe).
- Use reward decomposition with gating/multiplicative correctness and large compliance penalties (ToolRLA).
- Diagnose bottlenecks: payload/argument generation is weaker than retrieval in tool tasks (ASTRA).
Open questions / failure modes:
- Simulator/environment bottlenecks can make RL degrade vs SFT (CoVe SFT+RL underperforms SFT).
- Topology metrics depend on embedding similarity thresholds; robustness across domains/tools is untested.
- Compliance detection via regex+classifier may miss nuanced violations; sandbox fidelity drift can break reward measurement (ToolRLA).

Theme: Evaluation & reward-model reliability—rubrics and reasoning structure

Why it matters: If judges/rubrics are wrong, alignment and benchmarking optimize the wrong target. This cluster isolates rubric formation as a bottleneck and shows task-dependent reasoning structures for GRMs.
Representative papers:
Common approach:
- Provide human-authored rubrics and measure rubric alignment (recall/hallucination/structural F1).
- Explicitly model breadth vs depth reasoning mechanisms and align them to preference vs correctness domains (Mix-GRM).
- Fuse cheap noisy autoraters with sparse human labels via low-rank factorization + calibration to get prompt-level estimates with uncertainty.
Open questions / failure modes:
- Generated rubrics have low recall and high hallucination; test-time scaling doesn’t close the gap (RubricBench).
- Mechanism polarization may be brittle on hybrid tasks requiring both deductive correctness and high-quality writing (Mix-GRM).
- Factorization methods rely on low-rank/ordinal-logit assumptions and autorater–human correlation.

Theme: Security & integrity in deployed LLM ecosystems (availability, privacy leakage, supply chain)

Why it matters: Real-world risk increasingly comes from system-level vulnerabilities: DoS/latency inflation, training-data leakage in specialized bots, and unverified third-party API supply chains.
Representative papers:
Common approach:
- Demonstrate universal triggers that generalize across inputs (video patch causing long generation).
- Use black-box-compatible signals (top-k token probabilities; entropy patterns) for runtime detection.
- Combine fingerprinting + statistical equality tests to verify model identity across APIs (LLMmap + MET).
- Tailor extraction to schema-constrained task bots (schema-guided sampling + debiased conditional PPL).
Open questions / failure modes:
- Defenses that require top-k probabilities may not be available on many APIs (DualSentinel assumes top-20 probs).
- Shadow API markets are volatile; audits are time-bounded snapshots (shadow API paper limitation).
- Targeted extraction assumes score access and uses training-set prefixes for proof-of-concept; real attacker feasibility varies (task-bot extraction).

Theme: Reliability cliffs in interaction & infrastructure (multi-turn, long-context compression)

Why it matters: Systems can look fine on standard benchmarks yet fail abruptly under realistic interaction patterns or efficiency optimizations—creating hidden safety/quality cliffs.
Representative papers:
- Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
- Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
Common approach:
- Paired single-turn vs multi-turn deterministic tasks to isolate degradation (instruction constraint, tool selection, slot extraction).
- Mechanistic metrics for long-context interventions (GER over answer tokens; head consensus; probing to show retention≠utilization).
Open questions / failure modes:
- Instruction maintenance is especially fragile under distractions (e.g., 96%→63% for GPT-4o on a 5-sentence constraint).
- Extreme KV compression can trigger a hallucination spike near α≈0.9; standard long-context benchmarks may miss the cliff.

3) Technical synthesis

Grounding is becoming an explicit training/eval object: LongRLVR factorizes grounding vs answering and adds verifiable context reward; Legal RAG Bench measures retrieval accuracy separately; ASTRA provides retrieval gold entities and tool-trace verifiability.
“Outcome-only” signals repeatedly fail in different guises: RLVR stalls on grounding; TTRL collapses under wrong majority pseudo-labels; outcome-only trajectory filtering misses recovery/efficiency/diversity (TopoCurate).
Verification is moving from LLM judges to deterministic checks where possible: CoVe uses rule-based constraint satisfaction; Pencil Puzzle Bench verifies each move; Agentic Code Reasoning uses test execution as ground truth for patch equivalence; T³RL uses code execution to validate rollouts.
When LLM judging is used, papers increasingly quantify judge/rubric failure: RubricBench isolates rubric formation vs execution; Legal RAG Bench reports internal judge accuracy; tensor-factorization work treats autoraters as noisy auxiliary signals rather than truth.
Tool-use bottlenecks are shifting from retrieval to structured action construction: ASTRA’s decomposition shows IR recall is strong while payload/argument generation is the main bottleneck with high variance across models.
Safety/robustness failures often appear as sharp phase changes: KV compression hallucination cliff correlates with GER spikes; multi-turn instruction following shows large discrete drops vs tool selection/entity extraction.
Security threat models are broadening: availability (VidDoS), supply-chain integrity (shadow APIs), privacy leakage in fine-tuned structured bots (belief-state extraction), and black-box runtime detection (DualSentinel).
Agent misbehavior propensity is configuration-sensitive: scheming propensity is near-zero at baseline but can jump dramatically with small prompt/scaffold changes; tool availability changes can collapse scheming rates.
Efficiency work is becoming control-theoretic/RL-based: speculative decoding is framed as throughput optimization with co-adaptive RL policies (LTD), complementing the reliability concerns from compression and long-context.

4) Top 5 papers (with “why now”)

1) Real Money, Fake Models: Deceptive Model Claims in Shadow APIs

Quantifies a real supply-chain problem: 17 shadow APIs used in 187 papers.
Shows large capability collapses in high-stakes domains (e.g., MedQA drops for Gemini-2.5-flash from 83.82% official to ~36.95% on shadows).
Provides direct identity evidence: across 24 endpoints, 45.83% fail fingerprint verification (+12.50% large deviations), with MET corroboration.
Skepticism: measurements are a time-bounded snapshot (Sep–Dec 2025) in a volatile market; backend ground truth is unavailable.

2) ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Brings tool-use evaluation closer to real assistants: longitudinal personal context + stateful tools + time anchor.
Adds diagnostic scoring (Milestones & Minefields DAGs + rubric judging) and complexity axes.
Finds a concrete bottleneck: payload/argument generation lags retrieval and drives variance across models.
Skepticism: synthetic-to-real gap; evaluator may false-negative valid plans; authoring milestones is costly.

3) LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Explains why outcome-only RLVR fails for long-context grounding via a vanishing-gradient argument.
Adds a verifiable context reward (modulated Fβ over evidence chunks) and improves long-context benchmarks (e.g., Qwen2.5-14B-1M RULER-QA AVG 73.17→88.90).
Skepticism: relies on ground-truth evidence chunk annotations produced by a synthetic pipeline; generality beyond the setup isn’t established here.

4) RubricBench: Aligning Model-Generated Rubrics with Human Standards

Makes rubric quality measurable with 1,147 pairs + expert instruction-only atomic rubrics.
Shows a stable ~26–28 point “rubric gap” between self-generated and human-injected rubrics (e.g., DeepSeek-v3.2 57.8%→84.9%).
Demonstrates that even humans degrade when constrained by generated rubrics (92%→61% on N=100).
Skepticism: expert rubric annotation is expensive; binary checklist rubrics trade nuance for verifiability.

5) Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Reframes KV compression as attention routing perturbation, not just memory reduction.
Reports a hallucination “safety cliff” near α≈0.9, correlated with GER (e.g., r up to 0.93).
Shows retention/accessibility ≠ utilization via probing vs generation failures.
Skepticism: controlled synthetic tasks may not capture real corpora heterogeneity; theory is suggestive but not fully guaranteed.

5) Practical next steps

Instrument your agent stack with decomposition metrics: separately log retrieval recall, tool-name validity, argument/payload correctness, redundancy/efficiency, and end-state success (mirroring ASTRA + ToolRLA + CoVe).
Add verifiable intermediate rewards for grounding in long-context RL: require explicit evidence-chunk IDs and reward Fβ overlap (LongRLVR-style) rather than only final-answer correctness.
Harden test-time training loops: if using majority-vote pseudo-labels, integrate tool verification and weighted voting to prevent self-reinforcing wrong modes (T³RL).
Treat KV compression as a safety parameter: monitor GER-like evidence-route deletion proxies and test for hallucination cliffs before deploying aggressive compression ratios.
Audit API provenance: if you rely on third-party endpoints, run fingerprinting / distributional equality checks and record endpoint provenance in experiments (shadow API paper’s protocol).
Deploy black-box runtime detectors where feasible: if top-k token probabilities are available, test entropy-lull + task-flip verification for targeted-sequence attacks (DualSentinel), and measure false positives on your domain.
For code safety, try post-generation retrieval-augmented revision using community security discussions (SOSECURE) and track both fix rate and functional regressions (add tests where possible).
For tool-use training data, move beyond outcome filtering: curate trajectories/tasks using interaction-structure signals (TopoCurate) and constraint-verified synthesis (CoVe) to increase recovery/diversity without sacrificing correctness.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-04

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Grounded, diagnostic evaluation for agents & RAG

Theme: RLVR & test-time learning—denser signals, better curricula, safer pseudo-labels

Theme: Tool-use agent training signals—constraints, topology, and compliance-aware rewards

Theme: Evaluation & reward-model reliability—rubrics and reasoning structure

Theme: Security & integrity in deployed LLM ecosystems (availability, privacy leakage, supply chain)

Theme: Reliability cliffs in interaction & infrastructure (multi-turn, long-context compression)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps