Daily AI Paper Report (2026-03-04)
Published:
Chinese version: [中文]
Run stats
- Candidates: 236
- Selected: 30
- Deepread completed: 32
- Window (UTC): 2026-03-03T01:00:00Z → 2026-03-04T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.01608 | 2603.01608 | 95 | Systematic eval of LLM agent scheming incentives; realistic scenarios + factor decomposition | agent-safety, scheming, evaluation, instrumental-goals, autonomy | |
2603.02196 | 2603.02196 | 94 | Conformal calibration to bound policy risk vs safe reference; provable safety for exploration. | agent-safety, conformal-prediction, safe-exploration, risk-bounds, RL | |
2603.01564 | 2603.01564 | 92 | Survey + taxonomy for agentic/web threats (memory/tool/env injection) and defenses | agent-security, prompt-injection, tool-safety, memory-attacks, survey, threat-models | |
2603.01423 | 2603.01423 | 92 | Systematic multi-turn reliability eval incl. constraints, tool choice, entity tracking; shows degradation. | evaluation, reliability, multi-turn, tool-use, dialogue, agentic | |
2603.01454 | 2603.01454 | 92 | Universal DoS-style energy/latency attack on Video-LLMs; practical triggers without test-time grads. | security, adversarial-attacks, denial-of-service, video-llm, robustness | |
2603.01357 | 2603.01357 | 91 | New benchmark for tool-use agents with evolving personal context; exposes failures at high complexity. | benchmark, agents, tool-use, personal-context, planning, evaluation | |
2603.01589 | 2603.01589 | 90 | Large scientific safety benchmark (0.25M) + 1.5M training set with more objective metrics | safety-eval, benchmarks, science-safety, datasets, red-teaming | |
2603.02203 | 2603.02203 | 90 | Adds tool-based verification to test-time RL to prevent spurious consensus reward collapse. | reasoning, test-time-training, verification, tools, robustness | |
2603.01896 | 2603.01896 | 90 | Semi-formal prompting gives checkable “certificates” for agent code reasoning; strong gains reported. | agents, code, reasoning, verification, prompting, reliability, software-engineering | |
2603.02146 | 2603.02146 | 89 | Shows outcome-only RLVR fails for long-context grounding; proposes verifiable context rewards + theory. | RLVR, long-context, grounding, alignment, training, theory | |
2603.01784 | 2603.01784 | 88 | Co-evolutionary multimodal safety alignment with evolving adversarial attacks (genetic ops) | multimodal, adversarial-training, alignment, robustness, automated-redteaming | |
2603.01907 | 2603.01907 | 88 | Mutual-information data selection for RLVR/RL training; targets efficiency + uncertainty, not just difficulty. | RLHF, RLVR, data-selection, uncertainty, bayesian, training-efficiency, alignment | |
2603.01426 | 2603.01426 | 87 | KV-cache compression analysis finds hallucination 'safety cliff' near high compression; better eval lens. | long-context, KV-cache, efficiency, hallucinations, attention, robustness | |
2603.02029 | 2603.02029 | 87 | Cuts eval cost by combining cheap autoraters + few human labels via tensor factorization. | evaluation, human-preference, autoraters, statistical-modeling, scalable-evals | |
2603.01494 | 2603.01494 | 86 | Inference-time safety for code LLMs via retrieval-augmented revision using security knowledge | code-llms, secure-coding, RAG, inference-time, software-security | |
2603.01714 | 2603.01714 | 86 | Interaction-topology curation for tool-use training; goes beyond pass-rate filtering to informative tasks. | agents, tool-use, data-curation, RL, training, trajectories | |
2603.01940 | 2603.01940 | 85 | Constraint-guided verification to synthesize correct tool-use trajectories + RL rewards | tool-use, agents, verification, post-training, data-synthesis, RL | |
2603.02128 | 2603.02128 | 85 | Measures LLM agent behavior in crisis sims: alignment to humans, risk calibration, framing drift. | agent-evaluation, risk-calibration, geopolitics, behavioral-analysis, multi-round | |
2603.01562 | 2603.01562 | 84 | RubricBench benchmark for rubric-based reward/evaluation; targets hard, bias-misleading comparisons. | reward-models, evaluation, rubrics, alignment, benchmark, preference-modeling | |
2603.02208 | 2603.02208 | 84 | Procedural, verifiable symbolic data suite (planning/FOL/CFG/causal/equations) for scaling reasoning. | synthetic-data, reasoning, verification, benchmarks, curriculum | |
2603.01571 | 2603.01571 | 84 | Structured breadth+depth CoT for generative reward models; SFT+RLVR to improve evaluator reliability. | reward-models, evaluation, RLVR, chain-of-thought, reliability, alignment | |
2603.01620 | 2603.01620 | 83 | Fine-grained reward decomposition for tool-integrated agent alignment beyond binary success | agents, tool-calling, RLHF, reward-modeling, DPO, GRPO | |
2603.01919 | 2603.01919 | 83 | First audit of 'shadow APIs' claiming frontier models; reliability/security implications for deployments. | security, model-supply-chain, API, auditing, reliability, governance | |
2603.01550 | 2603.01550 | 82 | Quantifies memorization leakage in LLM-based task bots; extracts dialogue events and identifiers. | privacy, memorization, data-extraction, task-bots, security, LLMs | |
2603.02091 | 2603.02091 | 82 | RL fine-tuning on rule-generated synthetic multi-hop data improves real QA without costly labels. | reasoning, reinforcement-learning, synthetic-data, multi-hop, data-generation | |
2603.01792 | 2603.01792 | 82 | Token-entropy-guided unlearning with lightweight asymmetric LoRA; aims to reduce collateral damage. | unlearning, privacy, safety, LoRA, model-editing, knowledge-control | |
2603.01574 | 2603.01574 | 81 | Black-box detection of backdoor/prompt-injection via online 'entropy lull' generation signal | prompt-injection, backdoors, black-box, monitoring, detection | |
2603.01639 | 2603.01639 | 81 | RL-optimized speculative decoding to maximize real throughput (draft+verify), not proxy acceptance metrics. | inference, speculative-decoding, RL, efficiency, serving, LLM-systems | |
2603.02119 | 2603.02119 | 80 | Verifiable multi-step reasoning benchmark with step-level checks; supports dense process rewards | reasoning, benchmarks, process-supervision, verification, agentic-eval | |
2603.01710 | 2603.01710 | 80 | End-to-end Legal RAG benchmark with hierarchical error decomposition separating retrieval vs reasoning. | RAG, benchmark, legal, evaluation, retrieval, grounding |
AI Paper Insight Brief
2026-03-04
0) Executive takeaways (read this first)
- Agent reliability is bottlenecked less by “finding info” and more by “acting correctly”: in personal-context tool agents, information retrieval recall is high while payload/argument construction is the main failure point (ASTRA-bench).
- Multi-turn interaction is a first-class robustness hazard: instruction maintenance collapses sharply in multi-turn settings (e.g., a global “≤5 sentences” constraint), while tool selection and slot extraction degrade less and in model-size-dependent ways (Conversational Reliability).
- Optimization shortcuts can hide cliffs: KV-cache compression can look fine on standard long-context benchmarks yet hit a hallucination “safety cliff” near extreme compression (~0.9) tied to attention-route deletion (KV compression physics).
- Availability attacks are now practical for Video-LLMs: a universal, offline-trained patch can induce 200× token inflation and >15s latency overhead, creating real-time safety risks (VidDoS).
- Evaluation infrastructure is itself a bottleneck: rubric-guided judging improves only when rubrics are correct; there’s a large, stable “rubric gap” (~26–28 points) between self-generated vs human rubrics (RubricBench).
- RLVR is splitting into two regimes: (i) cheap, fully verifiable synthetic data can transfer to real multi-hop QA (Synthetic→Real RLVR), but (ii) long-context grounding needs verifiable intermediate context rewards or RLVR stalls (LongRLVR).
2) Key themes (clusters)
Theme: Tool-using agents in realistic, stateful environments
- Why it matters: Real assistants must resolve references over evolving personal data and execute multi-step tool plans; failures often come from the interaction of context, time, and tool schemas rather than any single skill.
- Representative papers:
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
- CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training
- ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents
- Common approach:
- Build executable sandboxes and score via verifiable traces/state (ASTRA milestones/minefields; CoVe rule-based constraint verifier).
- Move beyond binary success: decompose into retrieval vs payload vs planning (ASTRA) or format/correctness/efficiency/compliance (ToolRLA).
- Improve training signal by curating trajectories/tasks using process structure (TopoCurate) or generating solvable ambiguity via constraint fuzzification (CoVe).
- Open questions / failure modes:
- Evaluators can false-negative “valid but unanticipated” plans; milestone authoring cost remains high (ASTRA).
- RL can underperform SFT if the online simulator/environment is weaker than the target interaction distribution (CoVe’s SFT+RL drop).
- How to robustly reduce payload/argument errors (ASTRA bottleneck) without overfitting to tool schemas.
Theme: Verifiable, multi-step reasoning + training signals (RLVR, process feedback)
- Why it matters: As models scale inference-time compute and agentic loops, progress depends on dense, reliable feedback—but outcome-only rewards can be too sparse, especially for grounding.
- Representative papers:
- Common approach:
- Use verifiable intermediate signals: context-chunk rewards (LongRLVR), step-level puzzle rule checks (Pencil Puzzle Bench), tool-executed verification for pseudo-labeling (T³RL).
- Leverage cheap, rule-generated synthetic worlds with exact rewards to train transferable reasoning (Synthetic multi-hop RLVR).
- Separate “answering” from “grounding/selection” explicitly (LongRLVR policy factorization).
- Open questions / failure modes:
- When does synthetic-to-real transfer fail, and for which reasoning skills beyond “knowledge composition”?
- Verifier quality is a hard dependency: weak verifiers can degrade test-time RL (T³RL failure case with small verifier).
- Infrastructure reliability becomes part of the metric (Pencil Puzzle Bench reports high failure rates at extreme reasoning effort).
Theme: Evaluation reliability (rubrics, autoraters, and diagnostic benchmarks)
- Why it matters: Alignment and progress claims increasingly depend on automated evaluation; if rubrics/judges are wrong, training signals and leaderboards drift.
- Representative papers:
- Common approach:
- Create grounded artifacts to diagnose where failures come from (ASTRA retrieval gold + success conditions; Legal RAG Bench retrieval vs reasoning vs hallucination taxonomy).
- Quantify evaluator failure sources: rubric formation vs rubric execution (RubricBench’s controlled oracle rubric setting).
- Fuse noisy scalable signals with scarce human labels via statistical calibration (tensor factorization with human calibration).
- Open questions / failure modes:
- Self-generated rubrics have low recall and high hallucination; test-time scaling doesn’t close the gap (RubricBench).
- LLM-as-judge bias remains; even “objective” setups still rely on judge accuracy claims (Legal RAG Bench uses GPT-5.2 judge with internal review).
- How to extend these methods to multi-turn/agentic evaluations where outcomes depend on trajectories, not single responses (noted as a future direction in tensor-factorization eval).
Theme: Security & privacy of deployed LLM systems (availability, extraction, supply chain)
- Why it matters: As LLMs become embedded in real-time and API ecosystems, threats shift from “bad text” to system-level failures: latency DoS, training-data leakage, and model provenance deception.
- Representative papers:
- VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
- Extracting Training Dialogue Data from Large Language Model based Task Bots
- Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
- DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
- Common approach:
- Demonstrate practical attacks under realistic constraints: universal offline patch for video (VidDoS), score-based black-box extraction for task bots (Task-bot extraction), black-box runtime detection using top-k probs (DualSentinel).
- Add verification/auditing layers: fingerprinting + hypothesis testing for API identity (Shadow APIs), dual-check runtime verification (DualSentinel).
- Open questions / failure modes:
- Universal triggers and long-generation attacks suggest availability needs first-class evaluation/mitigation in multimodal systems.
- Extraction risk is much higher in targeted settings with partial prefixes; real attacker feasibility beyond proof-of-concept prefixes remains open (Task bots).
- Shadow API markets are volatile; audits are snapshots and lack backend ground truth (Shadow APIs limitations).
Theme: Robustness cliffs in long-context + inference optimization
- Why it matters: Systems optimizations (compression, decoding speedups) can introduce non-linear failure modes that standard benchmarks miss.
- Representative papers:
- Common approach:
- Replace proxy metrics with mechanistic or wall-clock objectives (KV routing metrics like GER; speculative decoding optimizes throughput LA/(Tdraft+Tverify)).
- Use controlled synthetic tests to expose hidden brittleness (KV compression route-sensitive datasets; paired single-turn vs multi-turn reliability tasks).
- Open questions / failure modes:
- Compression can preserve decodability while breaking utilization (representation–behavior gap); how to design compression policies that preserve routing redundancy?
- Multi-turn reliability failures (instruction drift) remain severe even when other skills are robust; how to train persistent global constraints?
3) Technical synthesis
- Grounded evaluation is converging on “trace-first” signals: ASTRA uses tool traces + milestone DAGs; CoVe uses deterministic constraint satisfaction; Pencil Puzzle Bench verifies every move; Legal RAG Bench separates retrieval vs reasoning vs hallucination.
- A recurring bottleneck is “structured action correctness”: ASTRA finds payload/argument generation weakest; ToolRLA explicitly gates correctness via tool-name/coverage/parameter accuracy; CoVe filters for zero-redundancy constraint satisfaction.
- Outcome-only RLVR is insufficient when success requires a rare prerequisite event: LongRLVR formalizes vanishing gradients for evidence selection and fixes it with verifiable context rewards.
- But outcome-only RLVR can still improve intermediate reasoning quality in synthetic multi-hop settings: training increases inclusion of correct intermediate answers in traces (Synthetic multi-hop RLVR), suggesting task structure matters.
- Test-time learning needs external grounding to avoid self-reinforcing errors: T³RL replaces majority-vote pseudo-labeling with tool-verified weighted voting to prevent “false-popular mode collapse.”
- Evaluation reliability is now a measurable object: RubricBench isolates rubric formation vs execution; tensor-factorization evaluation treats autoraters as noisy sensors calibrated to scarce human labels.
- Long-context optimizations can create safety cliffs: KV compression shows a hallucination spike near α≈0.9 correlated with global eviction of answer-relevant routes (GER).
- Multi-turn interaction is a distinct robustness axis: instruction maintenance degrades far more than tool selection or slot extraction, and smaller models degrade more (Conversational Reliability).
- Security threats are shifting to system properties: availability (VidDoS), provenance/integrity (Shadow APIs), and structured-label memorization (task bots) are all “non-text-only” failure modes.
- Agent security is being reframed ecosystem-wide: the Agentic Web survey emphasizes identity/authorization, provenance, and ecosystem response (quarantine/revocation/recovery) as primitives beyond single-agent defenses.
4) Top 5 papers (with “why now”)
1) Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
- Quantifies a real supply-chain problem: 17 shadow APIs used in 187 papers.
- Shows large utility collapses in high-stakes domains (e.g., MedQA accuracy drops reported for shadow vs official) and safety behavior divergence.
- Provides two complementary verification methods (LLMmap fingerprinting + MET) with controlled validation.
- Skepticism: market is volatile; results are a snapshot (Sep–Dec 2025) and backend ground truth is unavailable.
2) ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
- Brings tool-use evaluation closer to real assistants: longitudinal personal context + stateful tools + time anchor.
- Diagnostic decomposition identifies payload/argument generation as the main bottleneck vs retrieval.
- Stress tests quantify drops under misinformation/insufficient context.
- Skepticism: synthetic-to-real gap; milestone authoring cost and evaluator false negatives for unanticipated valid plans.
3) LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
- Explains why outcome-only RLVR stalls in long-context grounding (vanishing gradients) and provides a verifiable fix.
- Demonstrates gains on long-context benchmarks (e.g., Qwen2.5-14B-1M improves RULER-QA AVG 73.17→88.90).
- Skepticism: relies on ground-truth evidence chunk annotations from a synthetic pipeline; generality beyond studied setups isn’t established here.
4) RubricBench: Aligning Model-Generated Rubrics with Human Standards
- Makes rubric quality measurable with expert instruction-only rubrics and strict alignment metrics.
- Finds a stable ~26–28 point “rubric gap”: self-generated rubrics vs human-injected rubrics.
- Shows test-time scaling doesn’t fix rubric formation; even humans degrade when constrained by generated rubrics.
- Skepticism: expert rubric annotation limits scale; binary checklist rubrics may miss nuance in subjective tasks.
5) VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
- Demonstrates a universal, deploy-anywhere patch that induces extreme generation/latency inflation (token ratios >200×; latency overheads >15s reported).
- Connects directly to real-time pipeline safety (cumulative latency threshold violations).
- Skepticism: limitations section not explicit in provided content; real-world defenses/mitigations and transfer to diverse deployments need further study.
5) Practical next steps
- Instrument tool agents with “payload correctness” metrics (argument validity, schema adherence, parameter accuracy) separately from retrieval and planning—ASTRA suggests this is the dominant bottleneck.
- Add deterministic verifiers wherever possible: constraint-based tool verifiers (CoVe), step-level state checkers (Pencil Puzzle Bench), and tool-executed math verification (T³RL) to reduce judge noise.
- For long-context RLVR, reward grounding explicitly: implement chunk-selection outputs + an Fβ-style context reward (LongRLVR) and track contextual recall to detect early stagnation.
- Stress-test multi-turn reliability with paired single-turn vs multi-turn tasks (global constraints, tool routing, slot extraction) to quantify “conversation tax” before deployment.
- Treat KV compression as a safety parameter: monitor route-deletion proxies (e.g., GER-like measures) and hallucination rates as compression increases; avoid operating near reported cliff regimes without guardrails.
- Add availability red-teaming for multimodal systems: include long-generation/latency inflation tests (VidDoS-style) in CI for Video-LLMs and real-time pipelines.
- Audit API provenance in research and production: adopt fingerprinting / distributional equality tests (Shadow APIs) and log endpoint provenance to prevent silent model substitution.
- If using rubric-guided judging, measure rubric quality directly: track rubric recall/hallucination vs human rubrics (RubricBench) and avoid assuming “rubric prompting” is sufficient.
Generated from per-paper analyses; no external browsing.
