Daily AI Paper Report (2026-03-05)
Published:
Chinese version: [中文]
Run stats
- Candidates: 236
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-03T01:00:00Z → 2026-03-04T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.03205 | Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use | cs.CL | 95 | Post-training framework for safe multi-step tool use with explicit act/refuse loop. | agent-safety, tool-use, refusal, post-training, sequential-decision-making, alignment |
2603.02601 | AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows | cs.AI, cs.SE | 94 | Token-efficient regression testing w/ stats guarantees for non-deterministic agent workflows | agents, testing, regression, nondeterminism, evaluation, ci-cd, mutation-testing, metamorphic-testing |
2603.03116 | Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation | cs.AI | 93 | Procedure-aware eval catches “corrupt success” in LLM agents; multi-axis gating on tau-bench. | agents, evaluation, reliability, process-supervision, benchmarking |
2603.02983 | Contextualized Privacy Defense for LLM Agents | cs.CR, cs.AI, cs.CL | 92 | Proactive, step-wise privacy guidance for agents trained via RL on failure trajectories. | privacy, agents, tool-use, reinforcement-learning, data-protection, execution-monitoring |
2603.03000 | Why Does RLAIF Work At All? | cs.LG, cs.AI | 92 | Rare theory for why RLAIF self-improves; latent value hypothesis + formal results | alignment, RLAIF, theory, constitutional AI, preference-learning |
2603.03081 | TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models | cs.CL | 91 | Stronger optimization-based jailbreak method; improves refusal suppression and harmfulness targeting. | jailbreaks, red-teaming, adversarial-attacks, alignment, security |
2603.02578 | How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities | cs.CL, cs.AI, cs.HC, cs.LG | 91 | Hierarchical benchmark for LLM controllability; shows steering degrades at fine granularity | alignment, controllability, steering, benchmark, evaluation, personality, sentiment |
2603.02675 | From Shallow to Deep: Pinning Semantic Intent via Causal GRPO | cs.LG | 90 | Targets adversarial prefixes via causal intent probing + GRPO to prevent shallow alignment. | jailbreaks, adversarial-prompts, alignment, GRPO, interpretability, intent |
2603.03194 | BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? | cs.CL, cs.SE | 90 | BeyondSWE benchmark exposes big gaps for code agents beyond single-repo bugfixing; 500 instances. | code-agents, benchmarks, software-engineering, evaluation, search |
2603.03192 | MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization | cs.CV, cs.CL, cs.LG | 89 | DPO variant to reduce cross-modal hallucinations via modality decoupling + debiasing | multimodal, hallucinations, DPO, grounding, robustness |
2603.03258 | Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals | cs.AI | 88 | Empirical study of goal drift in newer agents; shows brittle robustness via inherited drift. | agent-safety, goal-drift, long-context, robustness, evaluation, agentic-risk |
2603.02798 | Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification | cs.AI, cs.CL | 88 | Guideline-grounded evidence accumulation for calibrated high-stakes agent verification (Bayesian). | verification, calibration, high-stakes, agents, clinical, bayesian |
2603.02586 | LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges | cs.AI | 88 | Real-world agent benchmark (104 scenarios) from public sources; compares models/products | agents, benchmark, evaluation, real-world-tasks, tool-use, reliability |
2603.03206 | Understanding and Mitigating Dataset Corruption in LLM Steering | cs.LG, cs.AI, cs.CL | 86 | Analyzes contrastive steering robustness; shows how corrupted data can induce side effects. | steering, robustness, data-poisoning, activation-editing, safety, inference-control |
2603.03111 | Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems | cs.CL | 86 | Measures silent performance drift when multi-turn systems switch models mid-dialogue; switch-matrix. | deployment, evaluation, multi-turn, model-routing, reliability, drift |
2603.02626 | See and Remember: A Multimodal Agent for Web Traversal | cs.AI | 86 | Web agent architecture with explicit memory + visual grounding; adds dynamic benchmark | agents, web-navigation, memory, multimodal, benchmark |
2603.03242 | Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals | cs.AI, cs.CL | 86 | Aligns to community norms using implicit acceptance signals; density structure in repr space | alignment, preference-learning, implicit-feedback, rlhf-alternatives, social-norms, representation |
2603.02588 | ExpGuard: LLM Content Moderation in Specialized Domains | cs.CL | 84 | Domain-specific moderation model + 58k dataset for finance/medical/legal guardrails. | content-moderation, guardrails, datasets, domain-specific, safety-eval |
2603.03163 | Conditioned Activation Transport for T2I Safety Steering | cs.CV, cs.AI | 84 | Inference-time T2I safety steering via conditioned nonlinear activation transport; adds dataset. | image-safety, activation-steering, diffusion, datasets, content-moderation |
2603.02663 | Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory | cs.CL, cs.CV | 84 | M3IRT separates image/text/cross-modal difficulty to detect shortcut items in MLLM evals | evaluation, multimodal, benchmarks, item-response-theory, shortcut-learning |
2603.02635 | SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety | cs.LG | 83 | Protocolized multimodal safety via virtual tool traces; curriculum incl. DPO and GRPO. | multimodal-safety, tool-traces, jailbreaks, DPO, GRPO, structured-reasoning |
2603.03047 | TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health | cs.CL, cs.AI | 82 | Comprehensive mental-health trustworthiness benchmark across safety, privacy, fairness, etc. | benchmarks, mental-health, trustworthiness, safety, privacy, evaluation |
2603.03018 | REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agentic AI in Enterprise Telemetry | cs.AI, cs.SE | 82 | Enterprise agent grounding via deterministic, versioned action space over telemetry; practical safety. | agent-architecture, grounding, tool-use, enterprise, observability, governance |
2603.03002 | SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models | cs.AI | 82 | Pure-text benchmark targeting true spatial mental models; avoids vision confounds | benchmark, reasoning, spatial-reasoning, evaluation, LLMs |
2603.02615 | Think, But Don't Overthink: Reproducing Recursive Language Models | cs.CL | 82 | Reproduces Recursive LMs; finds deeper recursion can cause 'overthinking' on long-context evals | long-context, agents, recursion, evaluation, reasoning, reproducibility |
2603.03172 | Less Noise, Same Certificate: Retain Sensitivity for Unlearning | cs.LG | 81 | Certified unlearning with 'retain sensitivity' to cut DP-style noise; privacy/reliability | machine-unlearning, privacy, certification, differential-privacy, robustness |
2603.03054 | PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems | cs.CL | 80 | End-to-end DP-RLHF pipeline for medical dialogue to reduce memorization/extraction risk. | differential-privacy, RLHF, medical, memorization, membership-inference, privacy |
2603.03078 | RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization | cs.AI | 80 | Retrieval-augmented policy optimization to expand exploration for agentic RL at step-level granularity. | agentic-RL, retrieval, exploration, policy-optimization, tool-use |
2603.02540 | A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities | cs.AI | 80 | Neuropsychology-grounded benchmark probing core cognitive abilities beyond task completion | evaluation, reasoning, cognitive-benchmarks, robustness, multimodal |
2603.02590 | Extending the Formalism and Theoretical Foundations of Cryptography to AI | cs.CR | 78 | Formal foundations + taxonomy for securing LM agents via access control/permissioning. | agent-security, formal-methods, access-control, permissioning, taxonomy, governance |
AI Paper Insight Brief
2026-03-05
0) Executive takeaways (read this first)
- Agent evaluation is shifting from “did it finish?” to “did it behave correctly along the way?” Procedure-aware evaluation on τ-bench finds 27–78% of apparent successes are procedurally corrupt, collapsing gated Pass^4 and exposing integrity failures that outcome metrics miss.
- Real-world agent readiness remains low on dynamic, tool-heavy tasks. LiveAgentBench reports LLMs ≈13.48% Pass@1 and agents still far from humans (Manus 35.29% vs human 69.25%), with tool instability and missing environment knowledge as recurring blockers.
- Steering/control is brittle at fine granularity. In SteerEval, prompting is stable across granularities, while activation-based steering (PCA/DiffMean/RePS) drops sharply from L1→L3, revealing a practical limit for token-level controllability.
- Safety is moving “inside the model” via structured traces and representation-level objectives. SaFeR-ToolKit’s virtual tool traces dramatically raise strict safety/helpfulness/rigor scores on Qwen2.5-VL, while Causal-GRPO targets “semantic representation decay” to reduce jailbreak ASR without sacrificing utility.
- Privacy defenses for agents are becoming contextual and trainable. Contextualized Defense Instructing (CDI) plus adversarial experience-driven GRPO reaches PP 94.2 / HS 80.6 / AD 86.5 on unseen simulations—substantially better than static prompting/guarding.
- Benchmarking is becoming more diagnostic and sample-efficient. NeuroCognition and SpatialText probe foundational cognitive primitives (working memory, flexibility, egocentric transforms), while multimodal IRT (M3IRT) can reconstruct rankings with ~10% of items by selecting truly cross-modal questions.
2) Key themes (clusters)
Theme: Procedure-aware and trajectory-aware agent evaluation
- Why it matters: Outcome-only metrics can overestimate safety and reliability by counting “corrupt success” as success. Trajectory-aware verification/evaluation enables deployment gating and calibrated escalation in high-stakes workflows.
- Representative papers:
- Common approach:
- Log and score process signals (read/write/communicate integrity; stepwise guideline alignment; handoff-induced deltas) rather than only terminal success.
- Use calibration/uncertainty (Bayesian logistic regression in GLEAN; bootstrap CIs in switch matrices) to support abstention/escalation decisions.
- Introduce gating or decompositions that compress risk (PAE’s gated utility; switch drift factorization into prefix influence/suffix susceptibility).
- Open questions / failure modes:
- Reliance on LLM judges (bias, positional effects, prompt sensitivity) and how to validate them at scale.
- Extending beyond constrained setups: final-turn handoffs → earlier/multi-turn switches; guideline coverage gaps; domains without explicit policies.
Theme: Real-world agent benchmarks + robustness bottlenecks
- Why it matters: Tool instability, environment knowledge gaps, and dynamic web/OS interactions dominate failures in practice; static benchmarks understate these issues.
- Representative papers:
- Common approach:
- Build verifiable, tool-dependent tasks with automatic checking (string matching; Dockerized tests).
- Add explicit state + symbolic tools (URL stack backtracking; symbolic counter) to reduce hallucination and amnesia.
- Study search/tool augmentation explicitly (SearchSWE; EverWebQA’s live-web pipeline).
- Open questions / failure modes:
- Tool instability and execution failures (e.g., reported execution failures attributed to instability).
- Search augmentation can be inconsistent or negative due to temporal misalignment/semantic drift.
- Latency/cost overhead from multimodal perception (adaptive VLM calls still add compute).
Theme: Controllability and steering under granularity + data corruption
- Why it matters: Many alignment/steering methods work at coarse behavior levels but fail at fine constraints; additionally, steering datasets are an attack surface.
- Representative papers:
- Common approach:
- Evaluate steering across hierarchical granularities (intent → strategy → instantiation) and domains.
- Compare prompt-based vs activation-based steering; tune steering strength and measure trade-offs (concept vs instruction vs fluency).
- Use robust statistics (Lee–Valiant robust mean) to mitigate poisoned/corrupted steering datasets.
- Open questions / failure modes:
- Activation steering collapses at fine granularity (L3) and shows strength trade-offs that harm instruction/fluency.
- Coordinated behavior injection can pull steering direction toward an attacker’s behavior; robust means only partially mitigate.
Theme: Multimodal safety + hallucination mitigation via structured traces and modality-aware objectives
- Why it matters: Multimodal models fail via jailbreaks, over-refusal, and cross-modal hallucinations; making intermediate decisions auditable or enforcing modality sensitivity/invariance can reduce these failures.
- Representative papers:
- Common approach:
- Enforce structured intermediate traces (typed tool calls; constrained topologies) and train with SFT→DPO→GRPO.
- Add modality-aware regularizers (invariance to irrelevant corruption; sensitivity to relevant corruption) and debias text priors.
- Use conditional/gated steering (Mahalanobis/GDA/OOD gating) to reduce unsafe outputs while preserving utility.
- Open questions / failure modes:
- Dependence on large judge models and automated safety judges; human validation remains limited.
- Inference-time steering can be bypassed under distribution shift; mean-pooled activations may miss localized unsafe features.
- Synthetic preference data and stop-gradient approximations may limit real-world generalization.
Theme: Privacy and security foundations for agents (practical + formal)
- Why it matters: Agents handle sensitive data and tool actions; defenses need contextual decision-making, formal guarantees, and clear threat models.
- Representative papers:
- Common approach:
- Contextual interventions during agent loops (post-tool-result guidance in CDI) and adversarial experience-driven optimization (GRPO).
- End-to-end privacy guarantees via DP-SGD across SFT, reward modeling, and PPO with composed accounting.
- Formalize systems as AIOracles with completeness vs security games; map attacker capabilities via taxonomies.
- Strengthen red-teaming with improved optimization-based jailbreaks (two-stage loss + DPTO).
- Open questions / failure modes:
- Simulation-to-reality gap for contextual privacy defenses; brittleness to strategic attackers without optimization.
- DP-RLHF compute overhead and reliance on proxy preference construction.
- Strong jailbreak attacks achieving very high ASR highlight persistent deployment risk.
Theme: Cognitive/psychometric evaluation beyond standard benchmarks
- Why it matters: Standard benchmarks show a dominant “general factor,” yet models fail on basic cognitive primitives; better diagnostics can guide training and predict failure modes (state loss, hallucination).
- Representative papers:
- Common approach:
- Adapt human cognitive tests (RAPM/SWM/WCST) with process-aware metrics (perseveration, failure-to-maintain-set, structural errors).
- Isolate specific cognition (egocentric/allocentric transforms; global spatial consistency) in text-only settings.
- Use psychometrics (multidimensional IRT + adaptive testing) to identify high-signal items and reduce evaluation cost.
- Open questions / failure modes:
- Whether neuropsych constructs transfer cleanly from humans to LLMs; limited sample sizes for expensive modalities.
- Persistent failures in working memory/state tracking and egocentric transformations; reasoning modes can sometimes hurt.
3) Technical synthesis
- Multiple papers converge on trajectory-level supervision and scoring: PAE (integrity invariants), GLEAN (stepwise guideline evidence), MOSAIC (pairwise trajectory preferences), and AgentAssay (behavioral fingerprints + sequential tests) all treat agent behavior as a distribution over traces, not a single output.
- Gating is emerging as a unifying safety pattern: PAE gates utility on integrity; SaFeR-ToolKit constrains tool-transition topologies; CAT gates activation steering by Mahalanobis/OOD; CDI gates behavior via step-specific privacy guidance.
- LLM-as-judge is pervasive but increasingly instrumented: SteerEval uses gpt-4.1-mini scoring; MOSAIC notes positional bias; PAE reports manual validation precision; GLEAN uses token-prob YES/NO ratings plus Bayesian calibration.
- Representation-level alignment is gaining traction: Causal-GRPO targets persistence of malicious intent representations; MoD-DPO explicitly shapes modality sensitivity/invariance; steering-corruption work analyzes how dataset poisoning rotates/shrinks activation directions.
- Operational robustness is being formalized: model switching drift uses paired deltas + bootstrap CIs and factorization; AgentAssay frames regressions as hypothesis tests with SPRT and multivariate Hotelling T² fingerprints.
- Benchmarks are becoming “live” and updateable (LiveAgentBench, EverWebQA) to resist staleness/contamination, while psychometric methods (M3IRT) aim to keep evaluation compact and high-signal.
- Tooling and determinism are treated as first-class: REGAL pushes deterministic telemetry computation upstream and compiles bounded MCP tools; V-GEMS externalizes counting and state; BeyondSWE uses Dockerized reproducibility.
- Safety and privacy are increasingly trained against adaptive adversaries: CDI uses search-optimized attackers to generate failure trajectories; TAO-Attack improves optimization-based jailbreaks; EXPGUARD+ adds domain jailbreaks.
- “Reasoning” is not monotonic: NeuroCognition finds disabling reasoning can improve RAPM text MC; RLM reproduction shows deeper recursion harms accuracy and explodes latency/cost.
4) Top 5 papers (with “why now”)
1) Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation
- Introduces Procedure-Aware Evaluation with explicit Read/Write/Communicate decomposition and consistency checks.
- Shows 27–78% of τ-bench “successes” are corrupt; gated utility can collapse (e.g., 0.68→0.16 for Mistral Retail).
- Provides model-specific integrity failure signatures and manual validation of judge precision (~93–95%).
- Skepticism: depends on explicit policies/Octx and LLM-judge semantics; binary gating may be too coarse for real risk tiers.
2) LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
- Real-user-derived, tool-dependent, multimodal tasks with closed-form verification (string matching; no judge model).
- Quantifies the gap: LLMs ≈13.48%, agents better but still far from human 69.25% (e.g., Manus 35.29%).
- Surfaces concrete blockers: tool instability and missing environment background knowledge.
- Skepticism: current scope is Chinese-language concentrated; converting queries to closed tasks can introduce unnatural artifacts.
3) Contextualized Privacy Defense for LLM Agents
- Proposes CDI: step-specific privacy guidance injected after tool results, not just static prompting or blocking.
- Uses adversarial failure trajectories + GRPO; optimized CDI reaches PP 94.2 / HS 80.6 / AD 86.5 on unseen tests.
- Demonstrates that optimizing only privacy can overprotect; staged PP→AD warmup matters.
- Skepticism: evaluation is simulation-based with synthetic configurations and LLM judges; real deployment transfer is unproven.
4) AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows
- Formalizes stochastic regression testing with Pass/Fail/Inconclusive semantics and sequential testing (SPRT).
- Uses behavioral fingerprint vectors + Hotelling T² to boost power; reports ~78% fewer trials and large power gains (univariate 0% → fingerprint+SPRT ~86% in one setting).
- Practical CI/CD integration (pytest plugin; trace-first offline analysis enabling some checks at $0).
- Skepticism: assumes i.i.d. trials; evaluator stochasticity and provider drift can violate assumptions.
- Adds modality-aware KL regularizers for invariance/sensitivity plus Language-Prior Debiasing to reduce text-only shortcuts.
- Reports strong gains on AVHBench (e.g., 88.19% for Qwen 2.5 Omni + MoD-DPO++) and improvements on CMM and general benchmarks.
- Provides a scalable synthetic preference dataset (18,112 samples over 10,854 videos).
- Skepticism: relies on synthetic preferences and stop-gradient approximations; extra forward passes increase cost and hyperparameter sensitivity is noted.
5) Practical next steps
- Add procedure-aware gating to your agent evals: log read/write/communicate events and disqualify “success” when integrity invariants fail (PAE-style), then track the delta vs outcome-only success.
- Stand up a switch-matrix handoff test for any multi-model routing/upgrade plan; compute paired deltas with bootstrap CIs and monitor prefix-influence/suffix-susceptibility factors.
- For stochastic agents, adopt three-valued regression verdicts + SPRT and store traces for trace-first offline checks to cut CI token cost.
- If using activation steering, treat the steering dataset as security-critical: test robustness under coordinated behavior injection and consider robust mean estimation (Lee–Valiant) rather than raw means.
- For privacy in tool-using agents, prototype a post-tool-result instructor (CDI-like) and train it on adversarially discovered failure prefixes; measure PP/HS/AD trade-offs and cold-start behavior.
- For multimodal systems, evaluate cross-modal hallucination with modality corruption tests and consider preference objectives that explicitly enforce invariance/sensitivity (MoD-DPO-style) rather than only response-level preferences.
- Use “live” agent benchmarks (or internal equivalents) that include tool instability and environment knowledge; track failure causes separately (execution failure vs reasoning vs missing info).
- Expand cognitive diagnostics beyond standard benchmarks: add at least one working-memory/state task (SWM-like) and one flexibility task (WCST-like) with process metrics to catch “trivial-for-humans” failures.
Generated from per-paper analyses; no external browsing.
