Daily AI Paper Report (2026-03-05)

Published: March 05, 2026

Chinese version: [中文]

Run stats

Candidates: 236
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-03T01:00:00Z → 2026-03-04T01:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.03205`	Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use PDF	cs.CL	95	Post-training framework for safe multi-step tool use with explicit act/refuse loop.	agent-safety, tool-use, refusal, post-training, sequential-decision-making, alignment
`2603.02601`	AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows PDF	cs.AI, cs.SE	94	Token-efficient regression testing w/ stats guarantees for non-deterministic agent workflows	agents, testing, regression, nondeterminism, evaluation, ci-cd, mutation-testing, metamorphic-testing
`2603.03116`	Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation PDF	cs.AI	93	Procedure-aware eval catches “corrupt success” in LLM agents; multi-axis gating on tau-bench.	agents, evaluation, reliability, process-supervision, benchmarking
`2603.02983`	Contextualized Privacy Defense for LLM Agents PDF	cs.CR, cs.AI, cs.CL	92	Proactive, step-wise privacy guidance for agents trained via RL on failure trajectories.	privacy, agents, tool-use, reinforcement-learning, data-protection, execution-monitoring
`2603.03000`	Why Does RLAIF Work At All? PDF	cs.LG, cs.AI	92	Rare theory for why RLAIF self-improves; latent value hypothesis + formal results	alignment, RLAIF, theory, constitutional AI, preference-learning
`2603.03081`	TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models PDF	cs.CL	91	Stronger optimization-based jailbreak method; improves refusal suppression and harmfulness targeting.	jailbreaks, red-teaming, adversarial-attacks, alignment, security
`2603.02578`	How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities PDF	cs.CL, cs.AI, cs.HC, cs.LG	91	Hierarchical benchmark for LLM controllability; shows steering degrades at fine granularity	alignment, controllability, steering, benchmark, evaluation, personality, sentiment
`2603.02675`	From Shallow to Deep: Pinning Semantic Intent via Causal GRPO PDF	cs.LG	90	Targets adversarial prefixes via causal intent probing + GRPO to prevent shallow alignment.	jailbreaks, adversarial-prompts, alignment, GRPO, interpretability, intent
`2603.03194`	BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? PDF	cs.CL, cs.SE	90	BeyondSWE benchmark exposes big gaps for code agents beyond single-repo bugfixing; 500 instances.	code-agents, benchmarks, software-engineering, evaluation, search
`2603.03192`	MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization PDF	cs.CV, cs.CL, cs.LG	89	DPO variant to reduce cross-modal hallucinations via modality decoupling + debiasing	multimodal, hallucinations, DPO, grounding, robustness
`2603.03258`	Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals PDF	cs.AI	88	Empirical study of goal drift in newer agents; shows brittle robustness via inherited drift.	agent-safety, goal-drift, long-context, robustness, evaluation, agentic-risk
`2603.02798`	Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification PDF	cs.AI, cs.CL	88	Guideline-grounded evidence accumulation for calibrated high-stakes agent verification (Bayesian).	verification, calibration, high-stakes, agents, clinical, bayesian
`2603.02586`	LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges PDF	cs.AI	88	Real-world agent benchmark (104 scenarios) from public sources; compares models/products	agents, benchmark, evaluation, real-world-tasks, tool-use, reliability
`2603.03206`	Understanding and Mitigating Dataset Corruption in LLM Steering PDF	cs.LG, cs.AI, cs.CL	86	Analyzes contrastive steering robustness; shows how corrupted data can induce side effects.	steering, robustness, data-poisoning, activation-editing, safety, inference-control
`2603.03111`	Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems PDF	cs.CL	86	Measures silent performance drift when multi-turn systems switch models mid-dialogue; switch-matrix.	deployment, evaluation, multi-turn, model-routing, reliability, drift
`2603.02626`	See and Remember: A Multimodal Agent for Web Traversal PDF	cs.AI	86	Web agent architecture with explicit memory + visual grounding; adds dynamic benchmark	agents, web-navigation, memory, multimodal, benchmark
`2603.03242`	Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals PDF	cs.AI, cs.CL	86	Aligns to community norms using implicit acceptance signals; density structure in repr space	alignment, preference-learning, implicit-feedback, rlhf-alternatives, social-norms, representation
`2603.02588`	ExpGuard: LLM Content Moderation in Specialized Domains PDF	cs.CL	84	Domain-specific moderation model + 58k dataset for finance/medical/legal guardrails.	content-moderation, guardrails, datasets, domain-specific, safety-eval
`2603.03163`	Conditioned Activation Transport for T2I Safety Steering PDF	cs.CV, cs.AI	84	Inference-time T2I safety steering via conditioned nonlinear activation transport; adds dataset.	image-safety, activation-steering, diffusion, datasets, content-moderation
`2603.02663`	Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory PDF	cs.CL, cs.CV	84	M3IRT separates image/text/cross-modal difficulty to detect shortcut items in MLLM evals	evaluation, multimodal, benchmarks, item-response-theory, shortcut-learning
`2603.02635`	SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety PDF	cs.LG	83	Protocolized multimodal safety via virtual tool traces; curriculum incl. DPO and GRPO.	multimodal-safety, tool-traces, jailbreaks, DPO, GRPO, structured-reasoning
`2603.03047`	TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health PDF	cs.CL, cs.AI	82	Comprehensive mental-health trustworthiness benchmark across safety, privacy, fairness, etc.	benchmarks, mental-health, trustworthiness, safety, privacy, evaluation
`2603.03018`	REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agentic AI in Enterprise Telemetry PDF	cs.AI, cs.SE	82	Enterprise agent grounding via deterministic, versioned action space over telemetry; practical safety.	agent-architecture, grounding, tool-use, enterprise, observability, governance
`2603.03002`	SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models PDF	cs.AI	82	Pure-text benchmark targeting true spatial mental models; avoids vision confounds	benchmark, reasoning, spatial-reasoning, evaluation, LLMs
`2603.02615`	Think, But Don't Overthink: Reproducing Recursive Language Models PDF	cs.CL	82	Reproduces Recursive LMs; finds deeper recursion can cause 'overthinking' on long-context evals	long-context, agents, recursion, evaluation, reasoning, reproducibility
`2603.03172`	Less Noise, Same Certificate: Retain Sensitivity for Unlearning PDF	cs.LG	81	Certified unlearning with 'retain sensitivity' to cut DP-style noise; privacy/reliability	machine-unlearning, privacy, certification, differential-privacy, robustness
`2603.03054`	PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems PDF	cs.CL	80	End-to-end DP-RLHF pipeline for medical dialogue to reduce memorization/extraction risk.	differential-privacy, RLHF, medical, memorization, membership-inference, privacy
`2603.03078`	RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization PDF	cs.AI	80	Retrieval-augmented policy optimization to expand exploration for agentic RL at step-level granularity.	agentic-RL, retrieval, exploration, policy-optimization, tool-use
`2603.02540`	A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities PDF	cs.AI	80	Neuropsychology-grounded benchmark probing core cognitive abilities beyond task completion	evaluation, reasoning, cognitive-benchmarks, robustness, multimodal
`2603.02590`	Extending the Formalism and Theoretical Foundations of Cryptography to AI PDF	cs.CR	78	Formal foundations + taxonomy for securing LM agents via access control/permissioning.	agent-security, formal-methods, access-control, permissioning, taxonomy, governance

AI Paper Insight Brief

2026-03-05

0) Executive takeaways (read this first)

Agent evaluation is shifting from “did it finish?” to “did it behave correctly along the way?” Procedure-aware evaluation on τ-bench finds 27–78% of apparent successes are procedurally corrupt, collapsing gated Pass^4 and exposing integrity failures that outcome metrics miss.
Real-world agent readiness remains low on dynamic, tool-heavy tasks. LiveAgentBench reports LLMs ≈13.48% Pass@1 and agents still far from humans (Manus 35.29% vs human 69.25%), with tool instability and missing environment knowledge as recurring blockers.
Steering/control is brittle at fine granularity. In SteerEval, prompting is stable across granularities, while activation-based steering (PCA/DiffMean/RePS) drops sharply from L1→L3, revealing a practical limit for token-level controllability.
Safety is moving “inside the model” via structured traces and representation-level objectives. SaFeR-ToolKit’s virtual tool traces dramatically raise strict safety/helpfulness/rigor scores on Qwen2.5-VL, while Causal-GRPO targets “semantic representation decay” to reduce jailbreak ASR without sacrificing utility.
Privacy defenses for agents are becoming contextual and trainable. Contextualized Defense Instructing (CDI) plus adversarial experience-driven GRPO reaches PP 94.2 / HS 80.6 / AD 86.5 on unseen simulations—substantially better than static prompting/guarding.
Benchmarking is becoming more diagnostic and sample-efficient. NeuroCognition and SpatialText probe foundational cognitive primitives (working memory, flexibility, egocentric transforms), while multimodal IRT (M3IRT) can reconstruct rankings with ~10% of items by selecting truly cross-modal questions.

2) Key themes (clusters)

Theme: Procedure-aware and trajectory-aware agent evaluation

Why it matters: Outcome-only metrics can overestimate safety and reliability by counting “corrupt success” as success. Trajectory-aware verification/evaluation enables deployment gating and calibrated escalation in high-stakes workflows.
Representative papers:
Common approach:
- Log and score process signals (read/write/communicate integrity; stepwise guideline alignment; handoff-induced deltas) rather than only terminal success.
- Use calibration/uncertainty (Bayesian logistic regression in GLEAN; bootstrap CIs in switch matrices) to support abstention/escalation decisions.
- Introduce gating or decompositions that compress risk (PAE’s gated utility; switch drift factorization into prefix influence/suffix susceptibility).
Open questions / failure modes:
- Reliance on LLM judges (bias, positional effects, prompt sensitivity) and how to validate them at scale.
- Extending beyond constrained setups: final-turn handoffs → earlier/multi-turn switches; guideline coverage gaps; domains without explicit policies.

Theme: Real-world agent benchmarks + robustness bottlenecks

Why it matters: Tool instability, environment knowledge gaps, and dynamic web/OS interactions dominate failures in practice; static benchmarks understate these issues.
Representative papers:
Common approach:
- Build verifiable, tool-dependent tasks with automatic checking (string matching; Dockerized tests).
- Add explicit state + symbolic tools (URL stack backtracking; symbolic counter) to reduce hallucination and amnesia.
- Study search/tool augmentation explicitly (SearchSWE; EverWebQA’s live-web pipeline).
Open questions / failure modes:
- Tool instability and execution failures (e.g., reported execution failures attributed to instability).
- Search augmentation can be inconsistent or negative due to temporal misalignment/semantic drift.
- Latency/cost overhead from multimodal perception (adaptive VLM calls still add compute).

Theme: Controllability and steering under granularity + data corruption

Why it matters: Many alignment/steering methods work at coarse behavior levels but fail at fine constraints; additionally, steering datasets are an attack surface.
Representative papers:
- How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
- Understanding and Mitigating Dataset Corruption in LLM Steering
Common approach:
- Evaluate steering across hierarchical granularities (intent → strategy → instantiation) and domains.
- Compare prompt-based vs activation-based steering; tune steering strength and measure trade-offs (concept vs instruction vs fluency).
- Use robust statistics (Lee–Valiant robust mean) to mitigate poisoned/corrupted steering datasets.
Open questions / failure modes:
- Activation steering collapses at fine granularity (L3) and shows strength trade-offs that harm instruction/fluency.
- Coordinated behavior injection can pull steering direction toward an attacker’s behavior; robust means only partially mitigate.

Theme: Multimodal safety + hallucination mitigation via structured traces and modality-aware objectives

Why it matters: Multimodal models fail via jailbreaks, over-refusal, and cross-modal hallucinations; making intermediate decisions auditable or enforcing modality sensitivity/invariance can reduce these failures.
Representative papers:
Common approach:
- Enforce structured intermediate traces (typed tool calls; constrained topologies) and train with SFT→DPO→GRPO.
- Add modality-aware regularizers (invariance to irrelevant corruption; sensitivity to relevant corruption) and debias text priors.
- Use conditional/gated steering (Mahalanobis/GDA/OOD gating) to reduce unsafe outputs while preserving utility.
Open questions / failure modes:
- Dependence on large judge models and automated safety judges; human validation remains limited.
- Inference-time steering can be bypassed under distribution shift; mean-pooled activations may miss localized unsafe features.
- Synthetic preference data and stop-gradient approximations may limit real-world generalization.

Theme: Privacy and security foundations for agents (practical + formal)

Why it matters: Agents handle sensitive data and tool actions; defenses need contextual decision-making, formal guarantees, and clear threat models.
Representative papers:
Common approach:
- Contextual interventions during agent loops (post-tool-result guidance in CDI) and adversarial experience-driven optimization (GRPO).
- End-to-end privacy guarantees via DP-SGD across SFT, reward modeling, and PPO with composed accounting.
- Formalize systems as AIOracles with completeness vs security games; map attacker capabilities via taxonomies.
- Strengthen red-teaming with improved optimization-based jailbreaks (two-stage loss + DPTO).
Open questions / failure modes:
- Simulation-to-reality gap for contextual privacy defenses; brittleness to strategic attackers without optimization.
- DP-RLHF compute overhead and reliance on proxy preference construction.
- Strong jailbreak attacks achieving very high ASR highlight persistent deployment risk.

Theme: Cognitive/psychometric evaluation beyond standard benchmarks

Why it matters: Standard benchmarks show a dominant “general factor,” yet models fail on basic cognitive primitives; better diagnostics can guide training and predict failure modes (state loss, hallucination).
Representative papers:
Common approach:
- Adapt human cognitive tests (RAPM/SWM/WCST) with process-aware metrics (perseveration, failure-to-maintain-set, structural errors).
- Isolate specific cognition (egocentric/allocentric transforms; global spatial consistency) in text-only settings.
- Use psychometrics (multidimensional IRT + adaptive testing) to identify high-signal items and reduce evaluation cost.
Open questions / failure modes:
- Whether neuropsych constructs transfer cleanly from humans to LLMs; limited sample sizes for expensive modalities.
- Persistent failures in working memory/state tracking and egocentric transformations; reasoning modes can sometimes hurt.

3) Technical synthesis

Multiple papers converge on trajectory-level supervision and scoring: PAE (integrity invariants), GLEAN (stepwise guideline evidence), MOSAIC (pairwise trajectory preferences), and AgentAssay (behavioral fingerprints + sequential tests) all treat agent behavior as a distribution over traces, not a single output.
Gating is emerging as a unifying safety pattern: PAE gates utility on integrity; SaFeR-ToolKit constrains tool-transition topologies; CAT gates activation steering by Mahalanobis/OOD; CDI gates behavior via step-specific privacy guidance.
LLM-as-judge is pervasive but increasingly instrumented: SteerEval uses gpt-4.1-mini scoring; MOSAIC notes positional bias; PAE reports manual validation precision; GLEAN uses token-prob YES/NO ratings plus Bayesian calibration.
Representation-level alignment is gaining traction: Causal-GRPO targets persistence of malicious intent representations; MoD-DPO explicitly shapes modality sensitivity/invariance; steering-corruption work analyzes how dataset poisoning rotates/shrinks activation directions.
Operational robustness is being formalized: model switching drift uses paired deltas + bootstrap CIs and factorization; AgentAssay frames regressions as hypothesis tests with SPRT and multivariate Hotelling T² fingerprints.
Benchmarks are becoming “live” and updateable (LiveAgentBench, EverWebQA) to resist staleness/contamination, while psychometric methods (M3IRT) aim to keep evaluation compact and high-signal.
Tooling and determinism are treated as first-class: REGAL pushes deterministic telemetry computation upstream and compiles bounded MCP tools; V-GEMS externalizes counting and state; BeyondSWE uses Dockerized reproducibility.
Safety and privacy are increasingly trained against adaptive adversaries: CDI uses search-optimized attackers to generate failure trajectories; TAO-Attack improves optimization-based jailbreaks; EXPGUARD+ adds domain jailbreaks.
“Reasoning” is not monotonic: NeuroCognition finds disabling reasoning can improve RAPM text MC; RLM reproduction shows deeper recursion harms accuracy and explodes latency/cost.

4) Top 5 papers (with “why now”)

1) Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Introduces Procedure-Aware Evaluation with explicit Read/Write/Communicate decomposition and consistency checks.
Shows 27–78% of τ-bench “successes” are corrupt; gated utility can collapse (e.g., 0.68→0.16 for Mistral Retail).
Provides model-specific integrity failure signatures and manual validation of judge precision (~93–95%).
Skepticism: depends on explicit policies/Octx and LLM-judge semantics; binary gating may be too coarse for real risk tiers.

2) LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

Real-user-derived, tool-dependent, multimodal tasks with closed-form verification (string matching; no judge model).
Quantifies the gap: LLMs ≈13.48%, agents better but still far from human 69.25% (e.g., Manus 35.29%).
Surfaces concrete blockers: tool instability and missing environment background knowledge.
Skepticism: current scope is Chinese-language concentrated; converting queries to closed tasks can introduce unnatural artifacts.

3) Contextualized Privacy Defense for LLM Agents

Proposes CDI: step-specific privacy guidance injected after tool results, not just static prompting or blocking.
Uses adversarial failure trajectories + GRPO; optimized CDI reaches PP 94.2 / HS 80.6 / AD 86.5 on unseen tests.
Demonstrates that optimizing only privacy can overprotect; staged PP→AD warmup matters.
Skepticism: evaluation is simulation-based with synthetic configurations and LLM judges; real deployment transfer is unproven.

4) AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Formalizes stochastic regression testing with Pass/Fail/Inconclusive semantics and sequential testing (SPRT).
Uses behavioral fingerprint vectors + Hotelling T² to boost power; reports ~78% fewer trials and large power gains (univariate 0% → fingerprint+SPRT ~86% in one setting).
Practical CI/CD integration (pytest plugin; trace-first offline analysis enabling some checks at $0).
Skepticism: assumes i.i.d. trials; evaluator stochasticity and provider drift can violate assumptions.

5) MoD-DPO: Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Adds modality-aware KL regularizers for invariance/sensitivity plus Language-Prior Debiasing to reduce text-only shortcuts.
Reports strong gains on AVHBench (e.g., 88.19% for Qwen 2.5 Omni + MoD-DPO++) and improvements on CMM and general benchmarks.
Provides a scalable synthetic preference dataset (18,112 samples over 10,854 videos).
Skepticism: relies on synthetic preferences and stop-gradient approximations; extra forward passes increase cost and hyperparameter sensitivity is noted.

5) Practical next steps

Add procedure-aware gating to your agent evals: log read/write/communicate events and disqualify “success” when integrity invariants fail (PAE-style), then track the delta vs outcome-only success.
Stand up a switch-matrix handoff test for any multi-model routing/upgrade plan; compute paired deltas with bootstrap CIs and monitor prefix-influence/suffix-susceptibility factors.
For stochastic agents, adopt three-valued regression verdicts + SPRT and store traces for trace-first offline checks to cut CI token cost.
If using activation steering, treat the steering dataset as security-critical: test robustness under coordinated behavior injection and consider robust mean estimation (Lee–Valiant) rather than raw means.
For privacy in tool-using agents, prototype a post-tool-result instructor (CDI-like) and train it on adversarially discovered failure prefixes; measure PP/HS/AD trade-offs and cold-start behavior.
For multimodal systems, evaluate cross-modal hallucination with modality corruption tests and consider preference objectives that explicitly enforce invariance/sensitivity (MoD-DPO-style) rather than only response-level preferences.
Use “live” agent benchmarks (or internal equivalents) that include tool instability and environment knowledge; track failure causes separately (execution failure vs reasoning vs missing info).
Expand cognitive diagnostics beyond standard benchmarks: add at least one working-memory/state task (SWM-like) and one flexibility task (WCST-like) with process metrics to catch “trivial-for-humans” failures.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-05

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Procedure-aware and trajectory-aware agent evaluation

Theme: Real-world agent benchmarks + robustness bottlenecks

Theme: Controllability and steering under granularity + data corruption

Theme: Multimodal safety + hallucination mitigation via structured traces and modality-aware objectives

Theme: Privacy and security foundations for agents (practical + formal)

Theme: Cognitive/psychometric evaluation beyond standard benchmarks

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps