Daily AI Paper Report (2026-03-05)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 236
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-03T01:00:00Z → 2026-03-04T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.03205Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
PDF
cs.CL95Post-training framework for safe multi-step tool use with explicit act/refuse loop.agent-safety, tool-use, refusal, post-training, sequential-decision-making, alignment
2603.02601AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows
PDF
cs.AI, cs.SE94Token-efficient regression testing w/ stats guarantees for non-deterministic agent workflowsagents, testing, regression, nondeterminism, evaluation, ci-cd, mutation-testing, metamorphic-testing
2603.03116Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation
PDF
cs.AI93Procedure-aware eval catches “corrupt success” in LLM agents; multi-axis gating on tau-bench.agents, evaluation, reliability, process-supervision, benchmarking
2603.02983Contextualized Privacy Defense for LLM Agents
PDF
cs.CR, cs.AI, cs.CL92Proactive, step-wise privacy guidance for agents trained via RL on failure trajectories.privacy, agents, tool-use, reinforcement-learning, data-protection, execution-monitoring
2603.03000Why Does RLAIF Work At All?
PDF
cs.LG, cs.AI92Rare theory for why RLAIF self-improves; latent value hypothesis + formal resultsalignment, RLAIF, theory, constitutional AI, preference-learning
2603.03081TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
PDF
cs.CL91Stronger optimization-based jailbreak method; improves refusal suppression and harmfulness targeting.jailbreaks, red-teaming, adversarial-attacks, alignment, security
2603.02578How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
PDF
cs.CL, cs.AI, cs.HC, cs.LG91Hierarchical benchmark for LLM controllability; shows steering degrades at fine granularityalignment, controllability, steering, benchmark, evaluation, personality, sentiment
2603.02675From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
PDF
cs.LG90Targets adversarial prefixes via causal intent probing + GRPO to prevent shallow alignment.jailbreaks, adversarial-prompts, alignment, GRPO, interpretability, intent
2603.03194BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
PDF
cs.CL, cs.SE90BeyondSWE benchmark exposes big gaps for code agents beyond single-repo bugfixing; 500 instances.code-agents, benchmarks, software-engineering, evaluation, search
2603.03192MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
PDF
cs.CV, cs.CL, cs.LG89DPO variant to reduce cross-modal hallucinations via modality decoupling + debiasingmultimodal, hallucinations, DPO, grounding, robustness
2603.03258Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals
PDF
cs.AI88Empirical study of goal drift in newer agents; shows brittle robustness via inherited drift.agent-safety, goal-drift, long-context, robustness, evaluation, agentic-risk
2603.02798Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
PDF
cs.AI, cs.CL88Guideline-grounded evidence accumulation for calibrated high-stakes agent verification (Bayesian).verification, calibration, high-stakes, agents, clinical, bayesian
2603.02586LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
PDF
cs.AI88Real-world agent benchmark (104 scenarios) from public sources; compares models/productsagents, benchmark, evaluation, real-world-tasks, tool-use, reliability
2603.03206Understanding and Mitigating Dataset Corruption in LLM Steering
PDF
cs.LG, cs.AI, cs.CL86Analyzes contrastive steering robustness; shows how corrupted data can induce side effects.steering, robustness, data-poisoning, activation-editing, safety, inference-control
2603.03111Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems
PDF
cs.CL86Measures silent performance drift when multi-turn systems switch models mid-dialogue; switch-matrix.deployment, evaluation, multi-turn, model-routing, reliability, drift
2603.02626See and Remember: A Multimodal Agent for Web Traversal
PDF
cs.AI86Web agent architecture with explicit memory + visual grounding; adds dynamic benchmarkagents, web-navigation, memory, multimodal, benchmark
2603.03242Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals
PDF
cs.AI, cs.CL86Aligns to community norms using implicit acceptance signals; density structure in repr spacealignment, preference-learning, implicit-feedback, rlhf-alternatives, social-norms, representation
2603.02588ExpGuard: LLM Content Moderation in Specialized Domains
PDF
cs.CL84Domain-specific moderation model + 58k dataset for finance/medical/legal guardrails.content-moderation, guardrails, datasets, domain-specific, safety-eval
2603.03163Conditioned Activation Transport for T2I Safety Steering
PDF
cs.CV, cs.AI84Inference-time T2I safety steering via conditioned nonlinear activation transport; adds dataset.image-safety, activation-steering, diffusion, datasets, content-moderation
2603.02663Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory
PDF
cs.CL, cs.CV84M3IRT separates image/text/cross-modal difficulty to detect shortcut items in MLLM evalsevaluation, multimodal, benchmarks, item-response-theory, shortcut-learning
2603.02635SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety
PDF
cs.LG83Protocolized multimodal safety via virtual tool traces; curriculum incl. DPO and GRPO.multimodal-safety, tool-traces, jailbreaks, DPO, GRPO, structured-reasoning
2603.03047TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
PDF
cs.CL, cs.AI82Comprehensive mental-health trustworthiness benchmark across safety, privacy, fairness, etc.benchmarks, mental-health, trustworthiness, safety, privacy, evaluation
2603.03018REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agentic AI in Enterprise Telemetry
PDF
cs.AI, cs.SE82Enterprise agent grounding via deterministic, versioned action space over telemetry; practical safety.agent-architecture, grounding, tool-use, enterprise, observability, governance
2603.03002SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models
PDF
cs.AI82Pure-text benchmark targeting true spatial mental models; avoids vision confoundsbenchmark, reasoning, spatial-reasoning, evaluation, LLMs
2603.02615Think, But Don't Overthink: Reproducing Recursive Language Models
PDF
cs.CL82Reproduces Recursive LMs; finds deeper recursion can cause 'overthinking' on long-context evalslong-context, agents, recursion, evaluation, reasoning, reproducibility
2603.03172Less Noise, Same Certificate: Retain Sensitivity for Unlearning
PDF
cs.LG81Certified unlearning with 'retain sensitivity' to cut DP-style noise; privacy/reliabilitymachine-unlearning, privacy, certification, differential-privacy, robustness
2603.03054PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
PDF
cs.CL80End-to-end DP-RLHF pipeline for medical dialogue to reduce memorization/extraction risk.differential-privacy, RLHF, medical, memorization, membership-inference, privacy
2603.03078RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization
PDF
cs.AI80Retrieval-augmented policy optimization to expand exploration for agentic RL at step-level granularity.agentic-RL, retrieval, exploration, policy-optimization, tool-use
2603.02540A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities
PDF
cs.AI80Neuropsychology-grounded benchmark probing core cognitive abilities beyond task completionevaluation, reasoning, cognitive-benchmarks, robustness, multimodal
2603.02590Extending the Formalism and Theoretical Foundations of Cryptography to AI
PDF
cs.CR78Formal foundations + taxonomy for securing LM agents via access control/permissioning.agent-security, formal-methods, access-control, permissioning, taxonomy, governance

AI Paper Insight Brief

2026-03-05

0) Executive takeaways (read this first)

  • Agent evaluation is shifting from “did it finish?” to “did it behave correctly along the way?” Procedure-aware evaluation on τ-bench finds 27–78% of apparent successes are procedurally corrupt, collapsing gated Pass^4 and exposing integrity failures that outcome metrics miss.
  • Real-world agent readiness remains low on dynamic, tool-heavy tasks. LiveAgentBench reports LLMs ≈13.48% Pass@1 and agents still far from humans (Manus 35.29% vs human 69.25%), with tool instability and missing environment knowledge as recurring blockers.
  • Steering/control is brittle at fine granularity. In SteerEval, prompting is stable across granularities, while activation-based steering (PCA/DiffMean/RePS) drops sharply from L1→L3, revealing a practical limit for token-level controllability.
  • Safety is moving “inside the model” via structured traces and representation-level objectives. SaFeR-ToolKit’s virtual tool traces dramatically raise strict safety/helpfulness/rigor scores on Qwen2.5-VL, while Causal-GRPO targets “semantic representation decay” to reduce jailbreak ASR without sacrificing utility.
  • Privacy defenses for agents are becoming contextual and trainable. Contextualized Defense Instructing (CDI) plus adversarial experience-driven GRPO reaches PP 94.2 / HS 80.6 / AD 86.5 on unseen simulations—substantially better than static prompting/guarding.
  • Benchmarking is becoming more diagnostic and sample-efficient. NeuroCognition and SpatialText probe foundational cognitive primitives (working memory, flexibility, egocentric transforms), while multimodal IRT (M3IRT) can reconstruct rankings with ~10% of items by selecting truly cross-modal questions.

2) Key themes (clusters)

Theme: Procedure-aware and trajectory-aware agent evaluation

  • Why it matters: Outcome-only metrics can overestimate safety and reliability by counting “corrupt success” as success. Trajectory-aware verification/evaluation enables deployment gating and calibrated escalation in high-stakes workflows.
  • Representative papers:
  • Common approach:
    • Log and score process signals (read/write/communicate integrity; stepwise guideline alignment; handoff-induced deltas) rather than only terminal success.
    • Use calibration/uncertainty (Bayesian logistic regression in GLEAN; bootstrap CIs in switch matrices) to support abstention/escalation decisions.
    • Introduce gating or decompositions that compress risk (PAE’s gated utility; switch drift factorization into prefix influence/suffix susceptibility).
  • Open questions / failure modes:
    • Reliance on LLM judges (bias, positional effects, prompt sensitivity) and how to validate them at scale.
    • Extending beyond constrained setups: final-turn handoffs → earlier/multi-turn switches; guideline coverage gaps; domains without explicit policies.

Theme: Real-world agent benchmarks + robustness bottlenecks

Theme: Controllability and steering under granularity + data corruption

  • Why it matters: Many alignment/steering methods work at coarse behavior levels but fail at fine constraints; additionally, steering datasets are an attack surface.
  • Representative papers:
  • Common approach:
    • Evaluate steering across hierarchical granularities (intent → strategy → instantiation) and domains.
    • Compare prompt-based vs activation-based steering; tune steering strength and measure trade-offs (concept vs instruction vs fluency).
    • Use robust statistics (Lee–Valiant robust mean) to mitigate poisoned/corrupted steering datasets.
  • Open questions / failure modes:
    • Activation steering collapses at fine granularity (L3) and shows strength trade-offs that harm instruction/fluency.
    • Coordinated behavior injection can pull steering direction toward an attacker’s behavior; robust means only partially mitigate.

Theme: Multimodal safety + hallucination mitigation via structured traces and modality-aware objectives

Theme: Privacy and security foundations for agents (practical + formal)

Theme: Cognitive/psychometric evaluation beyond standard benchmarks

3) Technical synthesis

  • Multiple papers converge on trajectory-level supervision and scoring: PAE (integrity invariants), GLEAN (stepwise guideline evidence), MOSAIC (pairwise trajectory preferences), and AgentAssay (behavioral fingerprints + sequential tests) all treat agent behavior as a distribution over traces, not a single output.
  • Gating is emerging as a unifying safety pattern: PAE gates utility on integrity; SaFeR-ToolKit constrains tool-transition topologies; CAT gates activation steering by Mahalanobis/OOD; CDI gates behavior via step-specific privacy guidance.
  • LLM-as-judge is pervasive but increasingly instrumented: SteerEval uses gpt-4.1-mini scoring; MOSAIC notes positional bias; PAE reports manual validation precision; GLEAN uses token-prob YES/NO ratings plus Bayesian calibration.
  • Representation-level alignment is gaining traction: Causal-GRPO targets persistence of malicious intent representations; MoD-DPO explicitly shapes modality sensitivity/invariance; steering-corruption work analyzes how dataset poisoning rotates/shrinks activation directions.
  • Operational robustness is being formalized: model switching drift uses paired deltas + bootstrap CIs and factorization; AgentAssay frames regressions as hypothesis tests with SPRT and multivariate Hotelling T² fingerprints.
  • Benchmarks are becoming “live” and updateable (LiveAgentBench, EverWebQA) to resist staleness/contamination, while psychometric methods (M3IRT) aim to keep evaluation compact and high-signal.
  • Tooling and determinism are treated as first-class: REGAL pushes deterministic telemetry computation upstream and compiles bounded MCP tools; V-GEMS externalizes counting and state; BeyondSWE uses Dockerized reproducibility.
  • Safety and privacy are increasingly trained against adaptive adversaries: CDI uses search-optimized attackers to generate failure trajectories; TAO-Attack improves optimization-based jailbreaks; EXPGUARD+ adds domain jailbreaks.
  • “Reasoning” is not monotonic: NeuroCognition finds disabling reasoning can improve RAPM text MC; RLM reproduction shows deeper recursion harms accuracy and explodes latency/cost.

4) Top 5 papers (with “why now”)

1) Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

  • Introduces Procedure-Aware Evaluation with explicit Read/Write/Communicate decomposition and consistency checks.
  • Shows 27–78% of τ-bench “successes” are corrupt; gated utility can collapse (e.g., 0.68→0.16 for Mistral Retail).
  • Provides model-specific integrity failure signatures and manual validation of judge precision (~93–95%).
  • Skepticism: depends on explicit policies/Octx and LLM-judge semantics; binary gating may be too coarse for real risk tiers.

2) LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

  • Real-user-derived, tool-dependent, multimodal tasks with closed-form verification (string matching; no judge model).
  • Quantifies the gap: LLMs ≈13.48%, agents better but still far from human 69.25% (e.g., Manus 35.29%).
  • Surfaces concrete blockers: tool instability and missing environment background knowledge.
  • Skepticism: current scope is Chinese-language concentrated; converting queries to closed tasks can introduce unnatural artifacts.

3) Contextualized Privacy Defense for LLM Agents

  • Proposes CDI: step-specific privacy guidance injected after tool results, not just static prompting or blocking.
  • Uses adversarial failure trajectories + GRPO; optimized CDI reaches PP 94.2 / HS 80.6 / AD 86.5 on unseen tests.
  • Demonstrates that optimizing only privacy can overprotect; staged PP→AD warmup matters.
  • Skepticism: evaluation is simulation-based with synthetic configurations and LLM judges; real deployment transfer is unproven.

4) AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

  • Formalizes stochastic regression testing with Pass/Fail/Inconclusive semantics and sequential testing (SPRT).
  • Uses behavioral fingerprint vectors + Hotelling T² to boost power; reports ~78% fewer trials and large power gains (univariate 0% → fingerprint+SPRT ~86% in one setting).
  • Practical CI/CD integration (pytest plugin; trace-first offline analysis enabling some checks at $0).
  • Skepticism: assumes i.i.d. trials; evaluator stochasticity and provider drift can violate assumptions.

5) MoD-DPO: Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

  • Adds modality-aware KL regularizers for invariance/sensitivity plus Language-Prior Debiasing to reduce text-only shortcuts.
  • Reports strong gains on AVHBench (e.g., 88.19% for Qwen 2.5 Omni + MoD-DPO++) and improvements on CMM and general benchmarks.
  • Provides a scalable synthetic preference dataset (18,112 samples over 10,854 videos).
  • Skepticism: relies on synthetic preferences and stop-gradient approximations; extra forward passes increase cost and hyperparameter sensitivity is noted.

5) Practical next steps

  • Add procedure-aware gating to your agent evals: log read/write/communicate events and disqualify “success” when integrity invariants fail (PAE-style), then track the delta vs outcome-only success.
  • Stand up a switch-matrix handoff test for any multi-model routing/upgrade plan; compute paired deltas with bootstrap CIs and monitor prefix-influence/suffix-susceptibility factors.
  • For stochastic agents, adopt three-valued regression verdicts + SPRT and store traces for trace-first offline checks to cut CI token cost.
  • If using activation steering, treat the steering dataset as security-critical: test robustness under coordinated behavior injection and consider robust mean estimation (Lee–Valiant) rather than raw means.
  • For privacy in tool-using agents, prototype a post-tool-result instructor (CDI-like) and train it on adversarially discovered failure prefixes; measure PP/HS/AD trade-offs and cold-start behavior.
  • For multimodal systems, evaluate cross-modal hallucination with modality corruption tests and consider preference objectives that explicitly enforce invariance/sensitivity (MoD-DPO-style) rather than only response-level preferences.
  • Use “live” agent benchmarks (or internal equivalents) that include tool instability and environment knowledge; track failure causes separately (execution failure vs reasoning vs missing info).
  • Expand cognitive diagnostics beyond standard benchmarks: add at least one working-memory/state task (SWM-like) and one flexibility task (WCST-like) with process metrics to catch “trivial-for-humans” failures.

Generated from per-paper analyses; no external browsing.