Daily AI Paper Report (2026-03-04)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 236
  • Selected: 30
  • Deepread completed: 32
  • Window (UTC): 2026-03-03T01:00:00Z → 2026-03-04T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.016082603.01608
PDF
95Systematic eval of LLM agent scheming incentives; realistic scenarios + factor decompositionagent-safety, scheming, evaluation, instrumental-goals, autonomy
2603.021962603.02196
PDF
94Conformal calibration to bound policy risk vs safe reference; provable safety for exploration.agent-safety, conformal-prediction, safe-exploration, risk-bounds, RL
2603.015642603.01564
PDF
92Survey + taxonomy for agentic/web threats (memory/tool/env injection) and defensesagent-security, prompt-injection, tool-safety, memory-attacks, survey, threat-models
2603.014232603.01423
PDF
92Systematic multi-turn reliability eval incl. constraints, tool choice, entity tracking; shows degradation.evaluation, reliability, multi-turn, tool-use, dialogue, agentic
2603.014542603.01454
PDF
92Universal DoS-style energy/latency attack on Video-LLMs; practical triggers without test-time grads.security, adversarial-attacks, denial-of-service, video-llm, robustness
2603.013572603.01357
PDF
91New benchmark for tool-use agents with evolving personal context; exposes failures at high complexity.benchmark, agents, tool-use, personal-context, planning, evaluation
2603.015892603.01589
PDF
90Large scientific safety benchmark (0.25M) + 1.5M training set with more objective metricssafety-eval, benchmarks, science-safety, datasets, red-teaming
2603.022032603.02203
PDF
90Adds tool-based verification to test-time RL to prevent spurious consensus reward collapse.reasoning, test-time-training, verification, tools, robustness
2603.018962603.01896
PDF
90Semi-formal prompting gives checkable “certificates” for agent code reasoning; strong gains reported.agents, code, reasoning, verification, prompting, reliability, software-engineering
2603.021462603.02146
PDF
89Shows outcome-only RLVR fails for long-context grounding; proposes verifiable context rewards + theory.RLVR, long-context, grounding, alignment, training, theory
2603.017842603.01784
PDF
88Co-evolutionary multimodal safety alignment with evolving adversarial attacks (genetic ops)multimodal, adversarial-training, alignment, robustness, automated-redteaming
2603.019072603.01907
PDF
88Mutual-information data selection for RLVR/RL training; targets efficiency + uncertainty, not just difficulty.RLHF, RLVR, data-selection, uncertainty, bayesian, training-efficiency, alignment
2603.014262603.01426
PDF
87KV-cache compression analysis finds hallucination 'safety cliff' near high compression; better eval lens.long-context, KV-cache, efficiency, hallucinations, attention, robustness
2603.020292603.02029
PDF
87Cuts eval cost by combining cheap autoraters + few human labels via tensor factorization.evaluation, human-preference, autoraters, statistical-modeling, scalable-evals
2603.014942603.01494
PDF
86Inference-time safety for code LLMs via retrieval-augmented revision using security knowledgecode-llms, secure-coding, RAG, inference-time, software-security
2603.017142603.01714
PDF
86Interaction-topology curation for tool-use training; goes beyond pass-rate filtering to informative tasks.agents, tool-use, data-curation, RL, training, trajectories
2603.019402603.01940
PDF
85Constraint-guided verification to synthesize correct tool-use trajectories + RL rewardstool-use, agents, verification, post-training, data-synthesis, RL
2603.021282603.02128
PDF
85Measures LLM agent behavior in crisis sims: alignment to humans, risk calibration, framing drift.agent-evaluation, risk-calibration, geopolitics, behavioral-analysis, multi-round
2603.015622603.01562
PDF
84RubricBench benchmark for rubric-based reward/evaluation; targets hard, bias-misleading comparisons.reward-models, evaluation, rubrics, alignment, benchmark, preference-modeling
2603.022082603.02208
PDF
84Procedural, verifiable symbolic data suite (planning/FOL/CFG/causal/equations) for scaling reasoning.synthetic-data, reasoning, verification, benchmarks, curriculum
2603.015712603.01571
PDF
84Structured breadth+depth CoT for generative reward models; SFT+RLVR to improve evaluator reliability.reward-models, evaluation, RLVR, chain-of-thought, reliability, alignment
2603.016202603.01620
PDF
83Fine-grained reward decomposition for tool-integrated agent alignment beyond binary successagents, tool-calling, RLHF, reward-modeling, DPO, GRPO
2603.019192603.01919
PDF
83First audit of 'shadow APIs' claiming frontier models; reliability/security implications for deployments.security, model-supply-chain, API, auditing, reliability, governance
2603.015502603.01550
PDF
82Quantifies memorization leakage in LLM-based task bots; extracts dialogue events and identifiers.privacy, memorization, data-extraction, task-bots, security, LLMs
2603.020912603.02091
PDF
82RL fine-tuning on rule-generated synthetic multi-hop data improves real QA without costly labels.reasoning, reinforcement-learning, synthetic-data, multi-hop, data-generation
2603.017922603.01792
PDF
82Token-entropy-guided unlearning with lightweight asymmetric LoRA; aims to reduce collateral damage.unlearning, privacy, safety, LoRA, model-editing, knowledge-control
2603.015742603.01574
PDF
81Black-box detection of backdoor/prompt-injection via online 'entropy lull' generation signalprompt-injection, backdoors, black-box, monitoring, detection
2603.016392603.01639
PDF
81RL-optimized speculative decoding to maximize real throughput (draft+verify), not proxy acceptance metrics.inference, speculative-decoding, RL, efficiency, serving, LLM-systems
2603.021192603.02119
PDF
80Verifiable multi-step reasoning benchmark with step-level checks; supports dense process rewardsreasoning, benchmarks, process-supervision, verification, agentic-eval
2603.017102603.01710
PDF
80End-to-end Legal RAG benchmark with hierarchical error decomposition separating retrieval vs reasoning.RAG, benchmark, legal, evaluation, retrieval, grounding

AI Paper Insight Brief

2026-03-04

0) Executive takeaways (read this first)

  • Agent reliability is bottlenecked less by “finding info” and more by “acting correctly”: in personal-context tool agents, information retrieval recall is high while payload/argument construction is the main failure point (ASTRA-bench).
  • Multi-turn interaction is a first-class robustness hazard: instruction maintenance collapses sharply in multi-turn settings (e.g., a global “≤5 sentences” constraint), while tool selection and slot extraction degrade less and in model-size-dependent ways (Conversational Reliability).
  • Optimization shortcuts can hide cliffs: KV-cache compression can look fine on standard long-context benchmarks yet hit a hallucination “safety cliff” near extreme compression (~0.9) tied to attention-route deletion (KV compression physics).
  • Availability attacks are now practical for Video-LLMs: a universal, offline-trained patch can induce 200× token inflation and >15s latency overhead, creating real-time safety risks (VidDoS).
  • Evaluation infrastructure is itself a bottleneck: rubric-guided judging improves only when rubrics are correct; there’s a large, stable “rubric gap” (~26–28 points) between self-generated vs human rubrics (RubricBench).
  • RLVR is splitting into two regimes: (i) cheap, fully verifiable synthetic data can transfer to real multi-hop QA (Synthetic→Real RLVR), but (ii) long-context grounding needs verifiable intermediate context rewards or RLVR stalls (LongRLVR).

2) Key themes (clusters)

Theme: Tool-using agents in realistic, stateful environments

Theme: Verifiable, multi-step reasoning + training signals (RLVR, process feedback)

Theme: Evaluation reliability (rubrics, autoraters, and diagnostic benchmarks)

Theme: Security & privacy of deployed LLM systems (availability, extraction, supply chain)

Theme: Robustness cliffs in long-context + inference optimization

3) Technical synthesis

  • Grounded evaluation is converging on “trace-first” signals: ASTRA uses tool traces + milestone DAGs; CoVe uses deterministic constraint satisfaction; Pencil Puzzle Bench verifies every move; Legal RAG Bench separates retrieval vs reasoning vs hallucination.
  • A recurring bottleneck is “structured action correctness”: ASTRA finds payload/argument generation weakest; ToolRLA explicitly gates correctness via tool-name/coverage/parameter accuracy; CoVe filters for zero-redundancy constraint satisfaction.
  • Outcome-only RLVR is insufficient when success requires a rare prerequisite event: LongRLVR formalizes vanishing gradients for evidence selection and fixes it with verifiable context rewards.
  • But outcome-only RLVR can still improve intermediate reasoning quality in synthetic multi-hop settings: training increases inclusion of correct intermediate answers in traces (Synthetic multi-hop RLVR), suggesting task structure matters.
  • Test-time learning needs external grounding to avoid self-reinforcing errors: T³RL replaces majority-vote pseudo-labeling with tool-verified weighted voting to prevent “false-popular mode collapse.”
  • Evaluation reliability is now a measurable object: RubricBench isolates rubric formation vs execution; tensor-factorization evaluation treats autoraters as noisy sensors calibrated to scarce human labels.
  • Long-context optimizations can create safety cliffs: KV compression shows a hallucination spike near α≈0.9 correlated with global eviction of answer-relevant routes (GER).
  • Multi-turn interaction is a distinct robustness axis: instruction maintenance degrades far more than tool selection or slot extraction, and smaller models degrade more (Conversational Reliability).
  • Security threats are shifting to system properties: availability (VidDoS), provenance/integrity (Shadow APIs), and structured-label memorization (task bots) are all “non-text-only” failure modes.
  • Agent security is being reframed ecosystem-wide: the Agentic Web survey emphasizes identity/authorization, provenance, and ecosystem response (quarantine/revocation/recovery) as primitives beyond single-agent defenses.

4) Top 5 papers (with “why now”)

1) Real Money, Fake Models: Deceptive Model Claims in Shadow APIs

  • Quantifies a real supply-chain problem: 17 shadow APIs used in 187 papers.
  • Shows large utility collapses in high-stakes domains (e.g., MedQA accuracy drops reported for shadow vs official) and safety behavior divergence.
  • Provides two complementary verification methods (LLMmap fingerprinting + MET) with controlled validation.
  • Skepticism: market is volatile; results are a snapshot (Sep–Dec 2025) and backend ground truth is unavailable.

2) ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

  • Brings tool-use evaluation closer to real assistants: longitudinal personal context + stateful tools + time anchor.
  • Diagnostic decomposition identifies payload/argument generation as the main bottleneck vs retrieval.
  • Stress tests quantify drops under misinformation/insufficient context.
  • Skepticism: synthetic-to-real gap; milestone authoring cost and evaluator false negatives for unanticipated valid plans.

3) LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

  • Explains why outcome-only RLVR stalls in long-context grounding (vanishing gradients) and provides a verifiable fix.
  • Demonstrates gains on long-context benchmarks (e.g., Qwen2.5-14B-1M improves RULER-QA AVG 73.17→88.90).
  • Skepticism: relies on ground-truth evidence chunk annotations from a synthetic pipeline; generality beyond studied setups isn’t established here.

4) RubricBench: Aligning Model-Generated Rubrics with Human Standards

  • Makes rubric quality measurable with expert instruction-only rubrics and strict alignment metrics.
  • Finds a stable ~26–28 point “rubric gap”: self-generated rubrics vs human-injected rubrics.
  • Shows test-time scaling doesn’t fix rubric formation; even humans degrade when constrained by generated rubrics.
  • Skepticism: expert rubric annotation limits scale; binary checklist rubrics may miss nuance in subjective tasks.

5) VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

  • Demonstrates a universal, deploy-anywhere patch that induces extreme generation/latency inflation (token ratios >200×; latency overheads >15s reported).
  • Connects directly to real-time pipeline safety (cumulative latency threshold violations).
  • Skepticism: limitations section not explicit in provided content; real-world defenses/mitigations and transfer to diverse deployments need further study.

5) Practical next steps

  • Instrument tool agents with “payload correctness” metrics (argument validity, schema adherence, parameter accuracy) separately from retrieval and planning—ASTRA suggests this is the dominant bottleneck.
  • Add deterministic verifiers wherever possible: constraint-based tool verifiers (CoVe), step-level state checkers (Pencil Puzzle Bench), and tool-executed math verification (T³RL) to reduce judge noise.
  • For long-context RLVR, reward grounding explicitly: implement chunk-selection outputs + an Fβ-style context reward (LongRLVR) and track contextual recall to detect early stagnation.
  • Stress-test multi-turn reliability with paired single-turn vs multi-turn tasks (global constraints, tool routing, slot extraction) to quantify “conversation tax” before deployment.
  • Treat KV compression as a safety parameter: monitor route-deletion proxies (e.g., GER-like measures) and hallucination rates as compression increases; avoid operating near reported cliff regimes without guardrails.
  • Add availability red-teaming for multimodal systems: include long-generation/latency inflation tests (VidDoS-style) in CI for Video-LLMs and real-time pipelines.
  • Audit API provenance in research and production: adopt fingerprinting / distributional equality tests (Shadow APIs) and log endpoint provenance to prevent silent model substitution.
  • If using rubric-guided judging, measure rubric quality directly: track rubric recall/hallucination vs human rubrics (RubricBench) and avoid assuming “rubric prompting” is sufficient.

Generated from per-paper analyses; no external browsing.