Daily AI Paper Report (2026-03-04)
Published:
Chinese version: [中文]
Run stats
- Candidates: 284
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-02T01:00:00Z → 2026-03-03T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.01608 | Evaluating and Understanding Scheming Propensity in LLM Agents | cs.AI | 95 | Systematic eval of LLM agent scheming incentives; realistic scenarios + factor decomposition | agent-safety, scheming, evaluation, instrumental-goals, autonomy |
2603.02196 | Conformal Policy Control | cs.AI, cs.LG, math.ST, stat.ML | 94 | Conformal calibration to bound policy risk vs safe reference; provable safety for exploration. | agent-safety, conformal-prediction, safe-exploration, risk-bounds, RL |
2603.01564 | From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions | cs.CR | 92 | Survey + taxonomy for agentic/web threats (memory/tool/env injection) and defenses | agent-security, prompt-injection, tool-safety, memory-attacks, survey, threat-models |
2603.01423 | Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction | cs.CL | 92 | Systematic multi-turn reliability eval incl. constraints, tool choice, entity tracking; shows degradation. | evaluation, reliability, multi-turn, tool-use, dialogue, agentic |
2603.01454 | VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models | cs.CV, cs.AI | 92 | Universal DoS-style energy/latency attack on Video-LLMs; practical triggers without test-time grads. | security, adversarial-attacks, denial-of-service, video-llm, robustness |
2603.01357 | ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context | cs.AI | 91 | New benchmark for tool-use agents with evolving personal context; exposes failures at high complexity. | benchmark, agents, tool-use, personal-context, planning, evaluation |
2603.01589 | SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond | cs.LG, cs.AI | 90 | Large scientific safety benchmark (0.25M) + 1.5M training set with more objective metrics | safety-eval, benchmarks, science-safety, datasets, red-teaming |
2603.02203 | Tool Verification for Test-Time Reinforcement Learning | cs.AI, cs.CL | 90 | Adds tool-based verification to test-time RL to prevent spurious consensus reward collapse. | reasoning, test-time-training, verification, tools, robustness |
2603.01896 | Agentic Code Reasoning | cs.SE, cs.AI, cs.PL | 90 | Semi-formal prompting gives checkable “certificates” for agent code reasoning; strong gains reported. | agents, code, reasoning, verification, prompting, reliability, software-engineering |
2603.02146 | LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards | cs.CL | 89 | Shows outcome-only RLVR fails for long-context grounding; proposes verifiable context rewards + theory. | RLVR, long-context, grounding, alignment, training, theory |
2603.01784 | Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution | cs.CR, cs.AI | 88 | Co-evolutionary multimodal safety alignment with evolving adversarial attacks (genetic ops) | multimodal, adversarial-training, alignment, robustness, automated-redteaming |
2603.01907 | Efficient RLVR Training via Weighted Mutual Information Data Selection | cs.LG, cs.CL | 88 | Mutual-information data selection for RLVR/RL training; targets efficiency + uncertainty, not just difficulty. | RLHF, RLVR, data-selection, uncertainty, bayesian, training-efficiency, alignment |
2603.01426 | Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics | cs.CL | 87 | KV-cache compression analysis finds hallucination 'safety cliff' near high compression; better eval lens. | long-context, KV-cache, efficiency, hallucinations, attention, robustness |
2603.02029 | Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization | cs.AI, cs.LG, stat.ML | 87 | Cuts eval cost by combining cheap autoraters + few human labels via tensor factorization. | evaluation, human-preference, autoraters, statistical-modeling, scalable-evals |
2603.01494 | Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision | cs.SE, cs.AI, cs.CR, cs.LG | 86 | Inference-time safety for code LLMs via retrieval-augmented revision using security knowledge | code-llms, secure-coding, RAG, inference-time, software-security |
2603.01714 | TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training | cs.LG, cs.CL | 86 | Interaction-topology curation for tool-use training; goes beyond pass-rate filtering to informative tasks. | agents, tool-use, data-curation, RL, training, trajectories |
2603.01940 | CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification | cs.AI | 85 | Constraint-guided verification to synthesize correct tool-use trajectories + RL rewards | tool-use, agents, verification, post-training, data-synthesis, RL |
2603.02128 | LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations | cs.CL, cs.AI, cs.CY | 85 | Measures LLM agent behavior in crisis sims: alignment to humans, risk calibration, framing drift. | agent-evaluation, risk-calibration, geopolitics, behavioral-analysis, multi-round |
2603.01562 | RubricBench: Aligning Model-Generated Rubrics with Human Standards | cs.AI | 84 | RubricBench benchmark for rubric-based reward/evaluation; targets hard, bias-misleading comparisons. | reward-models, evaluation, rubrics, alignment, benchmark, preference-modeling |
2603.02208 | Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training | cs.CL | 84 | Procedural, verifiable symbolic data suite (planning/FOL/CFG/causal/equations) for scaling reasoning. | synthetic-data, reasoning, verification, benchmarks, curriculum |
2603.01571 | Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models | cs.AI | 84 | Structured breadth+depth CoT for generative reward models; SFT+RLVR to improve evaluator reliability. | reward-models, evaluation, RLVR, chain-of-thought, reliability, alignment |
2603.01620 | ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents | cs.AI | 83 | Fine-grained reward decomposition for tool-integrated agent alignment beyond binary success | agents, tool-calling, RLHF, reward-modeling, DPO, GRPO |
2603.01919 | Real Money, Fake Models: Deceptive Model Claims in Shadow APIs | cs.CR, cs.AI, cs.SE | 83 | First audit of 'shadow APIs' claiming frontier models; reliability/security implications for deployments. | security, model-supply-chain, API, auditing, reliability, governance |
2603.01550 | Extracting Training Dialogue Data from Large Language Model based Task Bots | cs.CL, cs.AI | 82 | Quantifies memorization leakage in LLM-based task bots; extracts dialogue events and identifiers. | privacy, memorization, data-extraction, task-bots, security, LLMs |
2603.02091 | Learning from Synthetic Data Improves Multi-hop Reasoning | cs.LG, cs.AI, cs.CL | 82 | RL fine-tuning on rule-generated synthetic multi-hop data improves real QA without costly labels. | reasoning, reinforcement-learning, synthetic-data, multi-hop, data-generation |
2603.01792 | ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs | cs.CL, cs.AI | 82 | Token-entropy-guided unlearning with lightweight asymmetric LoRA; aims to reduce collateral damage. | unlearning, privacy, safety, LoRA, model-editing, knowledge-control |
2603.01574 | DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern | cs.CR, cs.AI | 81 | Black-box detection of backdoor/prompt-injection via online 'entropy lull' generation signal | prompt-injection, backdoors, black-box, monitoring, detection |
2603.01639 | Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning | cs.CL | 81 | RL-optimized speculative decoding to maximize real throughput (draft+verify), not proxy acceptance metrics. | inference, speculative-decoding, RL, efficiency, serving, LLM-systems |
2603.02119 | Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning | cs.AI, cs.GT, cs.LG | 80 | Verifiable multi-step reasoning benchmark with step-level checks; supports dense process rewards | reasoning, benchmarks, process-supervision, verification, agentic-eval |
2603.01710 | Legal RAG Bench: an end-to-end benchmark for legal RAG | cs.CL, cs.IR, cs.LG | 80 | End-to-end Legal RAG benchmark with hierarchical error decomposition separating retrieval vs reasoning. | RAG, benchmark, legal, evaluation, retrieval, grounding |
AI Paper Insight Brief
2026-03-04
0) Executive takeaways (read this first)
- Agent evaluation is shifting from “did it finish?” to “why did it fail?” ASTRA-bench and Legal RAG Bench both add grounded artifacts and error taxonomies that separate retrieval/grounding from action/payload construction and reasoning—useful for targeting training fixes rather than chasing aggregate scores.
- Reliability cliffs are emerging as the key deployment risk signal: KV-cache compression shows a sharp hallucination “safety cliff” near extreme compression (α≈0.9), and multi-turn conversations cause large instruction-maintenance drops even when single-turn performance is near-perfect.
- Security is increasingly about availability + supply chain, not just jailbreaks: VidDoS demonstrates universal latency/token inflation on Video-LLMs; “shadow APIs” show widespread model substitution/deception with large capability drops in medical/legal tasks and frequent fingerprint verification failures.
- Reward/evaluation pipelines themselves are a bottleneck: RubricBench quantifies a large “rubric gap” (self-generated vs human rubrics), while Mix-GRM shows that reasoning structure (breadth vs depth) must match task type—length scaling alone is insufficient.
- RLVR is being retooled for realism: LongRLVR argues outcome-only RLVR can’t learn long-context grounding (vanishing gradients) and fixes it with verifiable context rewards; INSIGHT improves RLVR efficiency via Bayesian mutual-information data selection; T³RL stabilizes test-time RL by tool-verifying pseudo-labels.
- Tool-use agent training is becoming more “systems-like”: TopoCurate (topology-aware curation), CoVe (constraint-verified interactive data), and ToolRLA (fine-grained reward decomposition + compliance penalties) all emphasize structured signals over binary success.
2) Key themes (clusters)
Theme: Grounded, diagnostic evaluation for agents & RAG
- Why it matters: End-to-end success hides whether failures come from retrieval/grounding, reference resolution, payload construction, or reasoning. Benchmarks that expose where things break enable targeted fixes and safer deployment.
- Representative papers:
- Common approach:
- Ground tasks in verifiable artifacts (tool traces/system state; annotated evidence passages; step-level puzzle rule checks; test-execution ground truth).
- Provide error decomposition (e.g., retrieval vs reasoning vs hallucination; milestones/minefields; equivalent vs non-equivalent patch cases).
- Stress realistic failure drivers: time-evolving personal context, lexically dissimilar legal queries, long-horizon iterative solving, repo-scale code navigation.
- Open questions / failure modes:
- Evaluator brittleness: milestone checks / judges can false-negative “valid but unanticipated” plans (ASTRA).
- Benchmark-to-real gaps: synthetic personal corpora and puzzle/text board representations may miss real-world noise/multimodality.
- Cost/infra limits: long-horizon agentic evaluation can be extremely expensive and failure-prone at high “effort” settings (Pencil Puzzle Bench).
Theme: RLVR & test-time learning—denser signals, better curricula, safer pseudo-labels
- Why it matters: Outcome-only rewards and naive sampling can stall learning (especially for grounding) or destabilize it (self-reinforcing wrong pseudo-labels). New work adds verifiable intermediate rewards and principled selection/verification.
- Representative papers:
- Common approach:
- Add verifiable intermediate rewards (chunk-selection Fβ grounding reward in LongRLVR).
- Use Bayesian/evidence-aware sampling to avoid wasting rollouts on well-estimated prompts (INSIGHT WMI).
- Replace majority-vote pseudo-labels with tool-verified weighted voting to prevent “false-popular mode collapse” (T³RL).
- Train on fully verifiable rule-generated data and show synthetic-to-real transfer under RL (multi-hop QA).
- Open questions / failure modes:
- Dependence on annotation/ground truth for intermediate steps (LongRLVR needs ground-truth evidence chunks).
- Verifier quality as a single point of failure (T³RL can degrade with an undersized verifier).
- Boundary conditions for synthetic-to-real transfer and interactions with real knowledge-intensive data remain unclear.
Theme: Tool-use agent training signals—constraints, topology, and compliance-aware rewards
- Why it matters: Tool agents fail in specific ways (wrong tool, malformed args, redundant calls, non-compliance). Binary rewards and outcome-only filtering undertrain these failure modes; structured signals improve robustness and production readiness.
- Representative papers:
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training
- CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
- ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
- Common approach:
- Replace outcome-only selection with process-structure metrics (recovery, efficiency, diversity; topology DAGs).
- Generate interactive data from explicit constraints and verify deterministically (CoVe).
- Use reward decomposition with gating/multiplicative correctness and large compliance penalties (ToolRLA).
- Diagnose bottlenecks: payload/argument generation is weaker than retrieval in tool tasks (ASTRA).
- Open questions / failure modes:
- Simulator/environment bottlenecks can make RL degrade vs SFT (CoVe SFT+RL underperforms SFT).
- Topology metrics depend on embedding similarity thresholds; robustness across domains/tools is untested.
- Compliance detection via regex+classifier may miss nuanced violations; sandbox fidelity drift can break reward measurement (ToolRLA).
Theme: Evaluation & reward-model reliability—rubrics and reasoning structure
- Why it matters: If judges/rubrics are wrong, alignment and benchmarking optimize the wrong target. This cluster isolates rubric formation as a bottleneck and shows task-dependent reasoning structures for GRMs.
- Representative papers:
- Common approach:
- Provide human-authored rubrics and measure rubric alignment (recall/hallucination/structural F1).
- Explicitly model breadth vs depth reasoning mechanisms and align them to preference vs correctness domains (Mix-GRM).
- Fuse cheap noisy autoraters with sparse human labels via low-rank factorization + calibration to get prompt-level estimates with uncertainty.
- Open questions / failure modes:
- Generated rubrics have low recall and high hallucination; test-time scaling doesn’t close the gap (RubricBench).
- Mechanism polarization may be brittle on hybrid tasks requiring both deductive correctness and high-quality writing (Mix-GRM).
- Factorization methods rely on low-rank/ordinal-logit assumptions and autorater–human correlation.
Theme: Security & integrity in deployed LLM ecosystems (availability, privacy leakage, supply chain)
- Why it matters: Real-world risk increasingly comes from system-level vulnerabilities: DoS/latency inflation, training-data leakage in specialized bots, and unverified third-party API supply chains.
- Representative papers:
- VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
- Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
- Extracting Training Dialogue Data from Large Language Model based Task Bots
- DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
- Common approach:
- Demonstrate universal triggers that generalize across inputs (video patch causing long generation).
- Use black-box-compatible signals (top-k token probabilities; entropy patterns) for runtime detection.
- Combine fingerprinting + statistical equality tests to verify model identity across APIs (LLMmap + MET).
- Tailor extraction to schema-constrained task bots (schema-guided sampling + debiased conditional PPL).
- Open questions / failure modes:
- Defenses that require top-k probabilities may not be available on many APIs (DualSentinel assumes top-20 probs).
- Shadow API markets are volatile; audits are time-bounded snapshots (shadow API paper limitation).
- Targeted extraction assumes score access and uses training-set prefixes for proof-of-concept; real attacker feasibility varies (task-bot extraction).
Theme: Reliability cliffs in interaction & infrastructure (multi-turn, long-context compression)
- Why it matters: Systems can look fine on standard benchmarks yet fail abruptly under realistic interaction patterns or efficiency optimizations—creating hidden safety/quality cliffs.
- Representative papers:
- Common approach:
- Paired single-turn vs multi-turn deterministic tasks to isolate degradation (instruction constraint, tool selection, slot extraction).
- Mechanistic metrics for long-context interventions (GER over answer tokens; head consensus; probing to show retention≠utilization).
- Open questions / failure modes:
- Instruction maintenance is especially fragile under distractions (e.g., 96%→63% for GPT-4o on a 5-sentence constraint).
- Extreme KV compression can trigger a hallucination spike near α≈0.9; standard long-context benchmarks may miss the cliff.
3) Technical synthesis
- Grounding is becoming an explicit training/eval object: LongRLVR factorizes grounding vs answering and adds verifiable context reward; Legal RAG Bench measures retrieval accuracy separately; ASTRA provides retrieval gold entities and tool-trace verifiability.
- “Outcome-only” signals repeatedly fail in different guises: RLVR stalls on grounding; TTRL collapses under wrong majority pseudo-labels; outcome-only trajectory filtering misses recovery/efficiency/diversity (TopoCurate).
- Verification is moving from LLM judges to deterministic checks where possible: CoVe uses rule-based constraint satisfaction; Pencil Puzzle Bench verifies each move; Agentic Code Reasoning uses test execution as ground truth for patch equivalence; T³RL uses code execution to validate rollouts.
- When LLM judging is used, papers increasingly quantify judge/rubric failure: RubricBench isolates rubric formation vs execution; Legal RAG Bench reports internal judge accuracy; tensor-factorization work treats autoraters as noisy auxiliary signals rather than truth.
- Tool-use bottlenecks are shifting from retrieval to structured action construction: ASTRA’s decomposition shows IR recall is strong while payload/argument generation is the main bottleneck with high variance across models.
- Safety/robustness failures often appear as sharp phase changes: KV compression hallucination cliff correlates with GER spikes; multi-turn instruction following shows large discrete drops vs tool selection/entity extraction.
- Security threat models are broadening: availability (VidDoS), supply-chain integrity (shadow APIs), privacy leakage in fine-tuned structured bots (belief-state extraction), and black-box runtime detection (DualSentinel).
- Agent misbehavior propensity is configuration-sensitive: scheming propensity is near-zero at baseline but can jump dramatically with small prompt/scaffold changes; tool availability changes can collapse scheming rates.
- Efficiency work is becoming control-theoretic/RL-based: speculative decoding is framed as throughput optimization with co-adaptive RL policies (LTD), complementing the reliability concerns from compression and long-context.
4) Top 5 papers (with “why now”)
1) Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
- Quantifies a real supply-chain problem: 17 shadow APIs used in 187 papers.
- Shows large capability collapses in high-stakes domains (e.g., MedQA drops for Gemini-2.5-flash from 83.82% official to ~36.95% on shadows).
- Provides direct identity evidence: across 24 endpoints, 45.83% fail fingerprint verification (+12.50% large deviations), with MET corroboration.
- Skepticism: measurements are a time-bounded snapshot (Sep–Dec 2025) in a volatile market; backend ground truth is unavailable.
2) ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
- Brings tool-use evaluation closer to real assistants: longitudinal personal context + stateful tools + time anchor.
- Adds diagnostic scoring (Milestones & Minefields DAGs + rubric judging) and complexity axes.
- Finds a concrete bottleneck: payload/argument generation lags retrieval and drives variance across models.
- Skepticism: synthetic-to-real gap; evaluator may false-negative valid plans; authoring milestones is costly.
3) LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
- Explains why outcome-only RLVR fails for long-context grounding via a vanishing-gradient argument.
- Adds a verifiable context reward (modulated Fβ over evidence chunks) and improves long-context benchmarks (e.g., Qwen2.5-14B-1M RULER-QA AVG 73.17→88.90).
- Skepticism: relies on ground-truth evidence chunk annotations produced by a synthetic pipeline; generality beyond the setup isn’t established here.
4) RubricBench: Aligning Model-Generated Rubrics with Human Standards
- Makes rubric quality measurable with 1,147 pairs + expert instruction-only atomic rubrics.
- Shows a stable ~26–28 point “rubric gap” between self-generated and human-injected rubrics (e.g., DeepSeek-v3.2 57.8%→84.9%).
- Demonstrates that even humans degrade when constrained by generated rubrics (92%→61% on N=100).
- Skepticism: expert rubric annotation is expensive; binary checklist rubrics trade nuance for verifiability.
5) Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
- Reframes KV compression as attention routing perturbation, not just memory reduction.
- Reports a hallucination “safety cliff” near α≈0.9, correlated with GER (e.g., r up to 0.93).
- Shows retention/accessibility ≠ utilization via probing vs generation failures.
- Skepticism: controlled synthetic tasks may not capture real corpora heterogeneity; theory is suggestive but not fully guaranteed.
5) Practical next steps
- Instrument your agent stack with decomposition metrics: separately log retrieval recall, tool-name validity, argument/payload correctness, redundancy/efficiency, and end-state success (mirroring ASTRA + ToolRLA + CoVe).
- Add verifiable intermediate rewards for grounding in long-context RL: require explicit evidence-chunk IDs and reward Fβ overlap (LongRLVR-style) rather than only final-answer correctness.
- Harden test-time training loops: if using majority-vote pseudo-labels, integrate tool verification and weighted voting to prevent self-reinforcing wrong modes (T³RL).
- Treat KV compression as a safety parameter: monitor GER-like evidence-route deletion proxies and test for hallucination cliffs before deploying aggressive compression ratios.
- Audit API provenance: if you rely on third-party endpoints, run fingerprinting / distributional equality checks and record endpoint provenance in experiments (shadow API paper’s protocol).
- Deploy black-box runtime detectors where feasible: if top-k token probabilities are available, test entropy-lull + task-flip verification for targeted-sequence attacks (DualSentinel), and measure false positives on your domain.
- For code safety, try post-generation retrieval-augmented revision using community security discussions (SOSECURE) and track both fix rate and functional regressions (add tests where possible).
- For tool-use training data, move beyond outcome filtering: curate trajectories/tasks using interaction-structure signals (TopoCurate) and constraint-verified synthesis (CoVe) to increase recovery/diversity without sacrificing correctness.
Generated from per-paper analyses; no external browsing.
