Daily AI Paper Report (2026-03-02)

Published: March 02, 2026

Chinese version: [中文]

Run stats

Candidates: 262
Selected: 30
Deepread completed: 30
Window (UTC): 2026-02-26T01:00:00Z → 2026-02-28T01:00:00Z (arxiv_announce, expanded=1)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2602.23329`	LLM Novice Uplift on Dual-Use, In Silico Biology Tasks PDF	cs.AI, cs.CL, cs.CR, cs.CY, cs.HC	96	Human uplift study on biosecurity-relevant tasks; quantifies dual-use risk from LLM access	dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
`2602.22755`	AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors PDF	cs.CL	95	Benchmark of hidden misalignment behaviors + agentic auditing; strong for real-world evals	alignment auditing, benchmarks, hidden behaviors, agent tools, evaluation, deception
`2602.22724`	AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification PDF	cs.CR, cs.AI	93	Directly targets indirect prompt injection in tool/RAG agents with inference-time mitigation	agent security, prompt injection, tool use, RAG safety, inference-time defense, causal diagnostics
`2602.22525`	Systems-Level Attack Surface of Edge Agent Deployments on IoT PDF	cs.CR	93	Empirical security analysis of edge LLM agents; concrete attack surfaces + measurable security metrics.	agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance, MQTT
`2602.22787`	Probing for Knowledge Attribution in Large Language Models PDF	cs.CL, cs.AI	92	Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation	hallucinations, attribution, faithfulness, factuality, probes, interpretability
`2602.22953`	General Agent Evaluation PDF	cs.AI	91	Proposes unified protocol + framework for general-agent evaluation; high reuse for benchmarking agents.	agent-evaluation, benchmarks, protocols, framework, general-agents, tool-use
`2602.22557`	CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety PDF	cs.AI, cs.LG	90	Model-agnostic zero-shot safety policy adaptation via RAG multi-agent debate over policies	policy compliance, RAG, multi-agent debate, governance, safety evaluation, zero-shot
`2602.22576`	Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training PDF	cs.CL, cs.IR, cs.LG	90	Reward shaping for agentic RAG RL; extracts signal from failures, improves stability/sample efficiency	agentic-RAG, reinforcement-learning, reward-shaping, retrieval, training, efficiency
`2602.22603`	SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning PDF	cs.AI, cs.LG	90	LRM-driven KV-cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.	agents, long-context, kv-cache, memory, efficiency, reasoning, systems
`2602.22775`	TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation PDF	cs.HC, cs.AI, cs.CL	89	Multi-turn relational safety failures in therapy chatbots via adversarial simulation method	mental health, conversational safety, multi-turn evaluation, red teaming, agent simulation
`2602.22675`	Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization PDF	cs.CL	88	Agentic search framework emphasizing parallel evidence over deep reasoning; targets cost+generalization	agents, search, efficiency, long-horizon, generalization, evidence-gathering
`2602.22897`	OmniGAIA: Towards Native Omni-Modal AI Agents PDF	cs.AI, cs.CL, cs.CV, cs.LG, cs.MM	88	Omni-modal agent benchmark requiring multi-turn tool use across video/audio/image; likely impactful eval.	multimodal, agents, benchmark, tool-use, evaluation, long-horizon, reasoning
`2602.22556`	Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation PDF	cs.LG, cs.AI, cs.CL	88	RL method to reduce overthinking while preserving hard-query reasoning; practical accuracy/latency tradeoff.	reasoning, test-time, rl, efficiency, adaptive-compute, alignment-adjacent
`2602.22968`	Certified Circuits: Stability Guarantees for Mechanistic Circuits PDF	cs.AI, cs.CV, cs.CY	87	Provable stability guarantees for mechanistic circuit discovery; improves interpretability rigor	mechanistic interpretability, circuits, certification, robustness, auditing
`2602.22554`	Multilingual Safety Alignment Via Sparse Weight Editing PDF	cs.LG	86	Training-free sparse weight editing to reduce multilingual safety gaps; high practical value	multilingual safety, weight editing, alignment, low-resource languages, guardrails
`2602.22638`	MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios PDF	cs.AI	86	Real-world route-planning agent benchmark with deterministic API-replay sandbox for reproducibility	benchmarks, agents, tool-use, evaluation, reproducibility, mobility
`2602.23271`	Evaluating Stochasticity in Deep Research Agents PDF	cs.AI	86	Formalizes and measures stochasticity/variance in research agents; targets deployment reliability.	agents, reliability, evaluation, stochasticity, variance, MDP, citations
`2602.22719`	Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks PDF	cs.LG	86	Mechanistic interp for Mamba SSMs + simple test-time steering via bottleneck scaling; broad gains claimed.	interpretability, steering, state-space-models, mamba, mechanistic, test-time
`2602.23136`	Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs PDF	cs.CL, cs.AI, cs.LG	85	Info-theoretic account of modality collapse as mismatched decoding; useful theory for multimodal LLMs.	multimodal-llm, information-theory, decoding, modality-collapse, representation, GMI
`2602.22769`	AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications PDF	cs.AI, cs.LG	84	New benchmark for long-horizon agent memory on machine-generated trajectories, not chat	agent memory, benchmarks, long-horizon, evaluation, trajectories
`2602.23193`	ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering PDF	cs.AI	84	Event-sourcing architecture for LLM agents: structured intentions + deterministic orchestrator/state	agent-architecture, state, reliability, software-engineering, orchestration, logging
`2602.22871`	Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching PDF	cs.CL, cs.AI	84	Step-level PRM scoring + stitching across diffusion CoTs; improves test-time scaling beyond voting.	reasoning, diffusion-lm, process-reward-model, self-consistency, test-time-scaling
`2602.23200`	InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models PDF	cs.LG, cs.CL	83	Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding; practical gain.	efficiency, KV-cache, quantization, long-context, inference, hardware-aware
`2602.23258`	AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning PDF	cs.AI, cs.CL	82	Test-time rectify-or-reject pruning to stop error cascades in multi-agent systems	multi-agent systems, robustness, test-time methods, error correction, RAG
`2602.22585`	Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach PDF	cs.AI, cs.LG	82	IRT/Rasch correction for rater effects in human evals; improves validity of AI evaluation conclusions	evaluation, human-ratings, psychometrics, item-response-theory, measurement, RLHF
`2602.22642`	Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning PDF	cs.LG	82	Difficulty-aware entropy regularization to compress CoT without entropy collapse; targets efficient reasoning.	reasoning, CoT, efficiency, entropy-regularization, inference-cost, exploration
`2602.22758`	Decomposing Physician Disagreement in HealthBench PDF	cs.AI, stat.AP	82	Analyzes physician disagreement sources in HealthBench; highlights irreducible uncertainty in eval labels.	evaluation, medical-ai, uncertainty, benchmarking, human-judgment, reliability
`2602.22584`	Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA PDF	cs.CL	80	Industrial RAG reliability: joint retrieval+generation optimization to reduce hallucinated URLs	RAG, hallucinations, faithfulness, reinforcement learning, industrial deployment
`2602.23079`	Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent PDF	cs.CL, cs.CR, cs.LG	80	Stylometry+LLM agent for authorship inference; highlights and mitigates deanonymization/privacy risks	privacy, deanonymization, stylometry, LLM-agents, security, authorship-attribution
`2602.23262`	Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling PDF	cs.CV, cs.CR	80	DP image generation via wavelet coarse-to-fine; aims to preserve quality while improving privacy guarantees.	privacy, differential-privacy, image-generation, wavelets, memorization, DP-SGD

AI Paper Insight Brief

2026-03-02

0) Executive takeaways (read this first)

Agent safety is shifting “down the stack”: multiple papers show that systems architecture (edge IoT deployment, event-sourced orchestration, KV-cache/memory management) can dominate risk and reliability, often bypassing prompt/model-level defenses.
Inference-time, model-agnostic safety is getting sharper: retrieval-grounded policy adjudication (CourtGuard) and counterfactual causal diagnostics for indirect prompt injection (AgentSentry) both report strong results without weight updates—at the cost of extra inference.
RL for agents is moving from sparse outcome rewards to structured process signals: path-centric reward shaping for agentic RAG (Search-P1) and difficulty-aware entropy/length control for reasoning compression (CEEH) target stability and sample efficiency failures in GRPO-style training.
Evaluation is becoming more “operational”: new benchmarks/harnesses emphasize reproducibility and decomposition (MobilityBench API replay; AuditBench for hidden behaviors; AMA-Bench for agent memory; General Agent Evaluation’s Unified Protocol), plus work quantifying evaluator noise (IRT rater effects; physician disagreement decomposition).
Compute efficiency for long-horizon agents is now a first-class research axis: semantic KV eviction (SideQuest) and hardware-aware KV quantization (InnerQ) report large throughput/latency gains with limited accuracy loss, directly enabling longer agent horizons under fixed budgets.
Dual-use risk is being measured in humans, not just models: a long-form uplift study finds LLM access makes novices substantially more accurate on biosecurity-relevant in silico tasks (OR 4.16), and most users report no difficulty with safeguards.

2) Key themes (clusters)

Theme: Systems-level agent security & governance (beyond prompts)

Why it matters: Tool-using agents expand trust boundaries; deployment choices (edge vs cloud, orchestration/auditing transport, immutable logs) can create bypasses and blind spots even if the model is “aligned.”
Representative papers:
Common approach:
- Treat agent security as architecture + protocol problems (MQTT as C2 plane; tool-return boundaries as control points; intention/effect separation).
- Add auditable structure (provenance fields, append-only event logs, deterministic replay + hashing).
- Use inference-time gating around tool use (authorization policies; purification/rewrites before high-impact actions).
Open questions / failure modes:
- MQTT-style coordination can accept spoofing/replay/direct publishes without cryptographic enforcement; provenance can be “present” but meaningless.
- Failover can create long monitoring gaps (e.g., measured 35.7s blackout) and silent sovereignty boundary crossings (DNS evidence only).
- Counterfactual defenses add overhead; out-of-scope runtime compromises (tool runtime tampering) remain.

Theme: Dynamic policy enforcement & auditing hidden behaviors

Why it matters: Safety policies change faster than models; auditing must work even when models don’t confess and when tools that look good in isolation don’t help end-to-end investigators.
Representative papers:
- CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
- AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Common approach:
- Decouple safety logic from weights via retrieval-grounded reasoning over policy corpora (RAG + debate + judge).
- Evaluate auditing as an agentic task with tool configurations, not just static probes.
- Stress targets with anti-confession training (KTO vs SFT) and multiple behavior instillation methods (TD vs SDF).
Open questions / failure modes:
- Inference latency/cost for debate + retrieval; formatting brittleness on smaller backbones.
- “Tool-to-agent gap”: evidence surfaced by tools may not translate into correct hypotheses for an investigator agent.
- Auditability is highly sensitive to how the hidden behavior was trained (easiest: SDF+SFT; hardest: TD+KTO).

Theme: RL stabilization for reasoning/agentic RAG (process signals over sparse outcomes)

Why it matters: GRPO/RLVR pipelines can collapse (mode collapse to no-thinking; entropy collapse under length pressure; slow learning under sparse rewards). Process-aware shaping is emerging as the fix.
Representative papers:
Common approach:
- Add trajectory/process rewards (plan execution ratios; reference step coverage; partial credit for “good path, wrong answer”).
- Stabilize RL under length heterogeneity via advantage shaping and length-aware gradient weighting.
- Make exploration/compression instance-dependent (hard questions get stronger entropy regularization; easy ones get stronger compression).
Open questions / failure modes:
- Reliance on external LLM judges/evaluators during training (cost, bias, brittleness).
- Translation-built safety/benchmark data can introduce artifacts (relevant for multilingual and some RAG settings).
- Domain transfer beyond verifiable math/QA remains constrained by availability of trustworthy rewards/verifiers.

Theme: Long-horizon agent efficiency (KV cache, memory, and search parallelism)

Why it matters: Long-horizon agents are often memory-bandwidth bound (KV reads) and context-budget bound; efficiency improvements directly expand feasible autonomy and reduce cost.
Representative papers:
Common approach:
- Replace heuristics with semantic/model-driven decisions (LLM decides which tool outputs to evict; parallel auxiliary thread).
- Hardware co-design for decode: inner-dimension grouping + 2-bit KV quantization + sink/recent high-precision windows.
- Shift scaling from sequential deliberation to parallel evidence acquisition and structured context resets.
- Benchmark memory on machine-generated, causally grounded trajectories, not just dialogue/doc QA.
Open questions / failure modes:
- Small fine-tuning sets can cause OOD degradation (SideQuest reports up to 5% on BrowseComp).
- Quantization results shown on limited tasks/models (e.g., GSM8K few-shot; specific GPUs).
- Memory construction loss and retrieval unreliability compound over horizons (needle protocol drops in AMA-Bench).

Theme: Evaluation reliability & reproducibility (humans, APIs, and protocols)

Why it matters: If evaluation is noisy or non-reproducible, optimization targets drift; agent comparisons become artifacts of raters, live APIs, or protocol mismatches.
Representative papers:
Common approach:
- Make tool environments reproducible via API replay sandboxes and schema validation.
- Standardize cross-benchmark execution via canonical task/context/action protocols and adapters.
- Model human label noise explicitly (MFRM rater severity/thresholds; mixed models/ICCs for disagreement).
Open questions / failure modes:
- Human disagreement is largely case-specific/residual (HealthBench residual 81.8% for labels), limiting achievable “ground truth.”
- Rater-model estimability requires overlap/linkage; short scales constrain IRT robustness.
- Tool-count limits and protocol constraints can dominate outcomes (e.g., GPT 5.2 tool cap vs AppWorld’s 468 tools).

3) Technical synthesis

Boundary control is converging: AgentSentry’s tool-return boundary diagnostics, ESAA’s intention/effect boundary, and edge IoT’s MQTT boundary all treat “where state crosses trust domains” as the key security lever.
GRPO is the common substrate, but papers diverge on how to fix its pathologies: CPAS/LAGR target length heterogeneity and mode collapse; Search-P1 densifies reward via plan/path scoring; CEEH targets entropy collapse via difficulty-aware entropy.
“Process supervision” is being operationalized without full supervision: Search-P1 uses offline reference planners + step coverage; diffusion stitching uses PRM step scores; industrial RAG uses multi-dimensional rewards including URL validity checks.
RAG is splitting into two concerns: (i) retrieval quality/coverage (GraphRAG + parallel channels; agentic multi-step search), and (ii) faithful use of evidence (URL validity, faithfulness rewards, and knowledge attribution probes).
Agent reliability is increasingly measured as variance, not just mean: stochasticity metrics for deep research agents (total variance over answers/findings/citations) complement success-rate leaderboards and highlight early-step randomness amplification.
Memory and KV cache are treated as first-class optimization targets: SideQuest reduces peak token utilization and KV reads; InnerQ targets decode-phase matmul layout to reduce latency, not just memory footprint.
Evaluation infrastructure is becoming a research contribution: deterministic replay (MobilityBench), unified protocol harnesses (Exgentic), and auditing benchmarks with non-confessing targets (AuditBench) aim to prevent “benchmark overfitting to quirks.”
Human factors are now part of capability/risk measurement: bio uplift shows novices improve with LLMs but can still underperform LLM-only baselines; safety governance needs to model human–LLM systems, not models alone.
Interpretability is branching: from neuron/subspace editing for multilingual safety (sparse weight edits) to formal robustness guarantees for circuits (certified circuits) to SSM-specific bottleneck steering (Mamba subspace bottlenecks).

4) Top 5 papers (with “why now”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Introduces a counterfactual, temporal causal framing of multi-turn indirect prompt injection at tool-return boundaries.
Reports ASR = 0% across three IPI families and three black-box LLMs on AgentDojo, with substantial utility retained (UA reported per model).
Ablations show sanitized counterfactuals and temporal diagnostics are critical (removing them raises ASR to ~22% on a suite).
Skepticism: overhead can be large in general (up to four regimes per boundary with sampling); benchmark settings use lightweight K/w.

2) CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Makes safety policy-updatable at inference time via RAG-grounded adversarial debate and citation-based adjudication.
Strong reported benchmark performance (macro Acc 0.87 / F1 0.86) and high recall on a human-verified suite.
Demonstrates policy/domain swapping (Wikipedia vandalism) without weight updates.
Skepticism: increased inference latency/cost; performance bounded by policy corpus coverage and backbone formatting adherence.

3) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Provides 56 non-confessing target models across 14 hidden behaviors with varied instillation and anti-confession training.
Finds scaffolded black-box tools best in agentic auditing and surfaces a tool-to-agent gap (static tool signal ≠ agent success).
Shows auditability depends strongly on training recipe (TD+KTO hardest).
Skepticism: targets are LoRA fine-tunes with explicit behaviors; may be easier to audit than diffuse emergent behaviors.

4) Systems-Level Attack Surface of Edge Agent Deployments on IoT

Empirically demonstrates that edge-local swarms can reduce egress (0 external IPs/bytes in a session) but introduce MQTT C2 vulnerabilities and failover blind spots.
Shows provenance metadata can be trivially bypassed (missing/spoofed sender, replay, direct safety-topic publish accepted).
Measures concrete operational metrics (actuation-to-audit latency; blackout windows; DNS-only sovereignty crossings).
Skepticism: single small testbed; no mitigation prototypes evaluated; cloud comparison not workload-matched.

5) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Human-subject evidence that LLM access yields 4.16× higher novice accuracy vs internet-only across bio task suites (7/8 benchmarks).
Highlights a governance-relevant nuance: LLM-only often beats LLM-assisted novices, so uplift depends on user strategy/task structure.
Reports most users had no difficulty overcoming safeguards, relevant for dual-use risk assessment.
Skepticism: not double-blind; model availability changed mid-study; confined to in silico tasks (wet-lab translation unknown).

5) Practical next steps

For agent security: add cryptographic enforcement/ACLs to agent coordination planes (e.g., MQTT) and measure whether provenance becomes non-bypassable under adversarial publish/replay.
Instrument sovereignty boundaries: treat “fallback to cloud inference” as a security event; log and alert on DNS/API boundary crossings and correlate with agent-level traces.
Adopt boundary-anchored defenses: prototype AgentSentry-style tool-return checks (even simplified) and measure ASR/utility trade-offs under multi-turn IPI.
Make policy updates operational: stand up a CourtGuard-like policy RAG store for your org’s governance docs; measure latency and failure modes on smaller backbones (formatting/parsing).
Train agents with process rewards: if using GRPO/RLVR, test path-centric or difficulty-aware shaping (Search-P1/CEEH ideas) and explicitly monitor entropy/mode-collapse indicators.
Optimize long-horizon cost: evaluate SideQuest-like semantic eviction and/or InnerQ-like KV quantization on your agent workloads; track KV reads, throughput, and task completion rates.
Benchmark memory realistically: run your memory system on agent-trajectory benchmarks (AMA-Bench-style) and include needle protocols to quantify construction loss vs retrieval loss.
Harden evaluation pipelines: where humans rate outputs, consider IRT/MFRM adjustments; where tools/APIs are involved, prefer replayable sandboxes (MobilityBench pattern) to reduce variance.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-02

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Systems-level agent security & governance (beyond prompts)

Theme: Dynamic policy enforcement & auditing hidden behaviors

Theme: RL stabilization for reasoning/agentic RAG (process signals over sparse outcomes)

Theme: Long-horizon agent efficiency (KV cache, memory, and search parallelism)

Theme: Evaluation reliability & reproducibility (humans, APIs, and protocols)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps