Daily AI Paper Report (2026-03-01)

Published: March 01, 2026

Chinese version: [中文]

Run stats

Candidates: 262
Selected: 30
Deepread completed: 30
Window (UTC): 2026-02-26T01:00:00Z → 2026-02-28T01:00:00Z (arxiv_announce, expanded=1)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2602.23329`	LLM Novice Uplift on Dual-Use, In Silico Biology Tasks PDF	cs.AI, cs.CL, cs.CR, cs.CY, cs.HC	96	Careful human study shows large LLM uplift on bio dual-use tasks; key for risk assessment.	dual-use, biosecurity, human-uplift, evaluation, misuse-risk
`2602.22755`	AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors PDF	cs.CL	95	Benchmark of hidden misalignment behaviors + agentic auditing tools; strong for eval & oversight.	alignment auditing, benchmarks, hidden behaviors, agent evaluators, model honesty
`2602.22724`	AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification PDF	cs.CR, cs.AI	93	Directly targets indirect prompt injection in agents with trajectory-aware diagnostics + mitigation.	agent security, prompt injection, tool outputs, inference-time defense, causal diagnostics
`2602.22525`	Systems-Level Attack Surface of Edge Agent Deployments on IoT PDF	cs.CR	93	Empirical security analysis of edge LLM agents; concrete attack surfaces + measurable security metrics.	agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance
`2602.22603`	SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning PDF	cs.AI, cs.LG	92	LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.	agents, long-context, memory, KV-cache, efficiency, reasoning
`2602.22557`	CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety PDF	cs.AI, cs.LG	91	Model-agnostic zero-shot safety policy adaptation via RAG multi-agent debate grounded in policies.	policy compliance, RAG, multi-agent debate, governance, safety evaluation
`2602.22787`	Probing for Knowledge Attribution in Large Language Models PDF	cs.CL, cs.AI	91	Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination control.	hallucinations, attribution, faithfulness, factuality, interpretability
`2602.22953`	General Agent Evaluation PDF	cs.AI	91	Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration bias.	agent-evaluation, benchmarks, general-agents, protocols, reproducibility
`2602.22775`	TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation PDF	cs.HC, cs.AI, cs.CL	90	Adversarial multi-agent simulation to surface long-horizon relational safety failures in therapy bots.	mental health, conversational safety, multi-turn evaluation, red teaming, agent simulation
`2602.22576`	Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training PDF	cs.CL, cs.IR, cs.LG	89	Reward shaping for agentic RAG RL improves sample efficiency using trajectory-level signals.	agentic-RAG, reinforcement-learning, reward-shaping, retrieval, reasoning
`2602.22897`	OmniGAIA: Towards Native Omni-Modal AI Agents PDF	cs.AI, cs.CL, cs.CV, cs.LG, cs.MM	89	Omni-modal agent benchmark (audio+video+image+tools) with event-graph construction; high reuse potential.	multimodal, agents, benchmark, tool-use, evaluation, long-horizon
`2602.22556`	Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation PDF	cs.LG, cs.AI, cs.CL	89	RL framework to curb overthinking while preserving correctness; practical for reliable reasoning models.	reasoning, RL, efficiency, adaptive-compute, alignment, robustness
`2602.22554`	Multilingual Safety Alignment Via Sparse Weight Editing PDF	cs.LG	88	Training-free sparse weight editing to reduce multilingual safety gaps; practical alignment lever.	multilingual safety, weight editing, safety neurons, alignment, low-resource languages
`2602.22675`	Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization PDF	cs.CL	87	Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost and generalization.	agents, search, efficiency, long-horizon, deep-research
`2602.23271`	Evaluating Stochasticity in Deep Research Agents PDF	cs.AI	87	Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP framing.	agents, evaluation, stochasticity, reliability, research-agents, variance
`2602.22769`	AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications PDF	cs.AI, cs.LG	86	AMA-Bench evaluates long-horizon agent memory on real agent trajectories beyond dialogue setups.	agent memory, benchmarks, long-horizon, evaluation, trajectories
`2602.23136`	Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs PDF	cs.CL, cs.AI, cs.LG	86	Theory for multimodal 'modality collapse' as mismatched decoding; probes + info-theoretic limits (GMI).	multimodal-LLMs, information-theory, decoding, representation, robustness
`2602.22719`	Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks PDF	cs.LG	86	Mechanistic interpretability + test-time steering for Mamba/SSMs; notable gains via simple intervention.	interpretability, steering, SSM, Mamba, mechanistic, reliability
`2602.22968`	Certified Circuits: Stability Guarantees for Mechanistic Circuits PDF	cs.AI, cs.CV, cs.CY	85	Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliability.	mechanistic interpretability, circuits, certification, robustness, auditing
`2602.22638`	MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios PDF	cs.AI	84	Real-world route-planning benchmark with deterministic API-replay sandbox for reproducible agent eval.	agents, benchmark, tool-use, evaluation, sandbox
`2602.23200`	InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models PDF	cs.LG, cs.CL	84	Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding; practical impact.	LLM-efficiency, KV-cache, quantization, long-context, inference
`2602.22871`	Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching PDF	cs.CL, cs.AI	84	Step-level PRM-guided stitching for diffusion LMs; improves test-time scaling beyond trace voting.	test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
`2602.22689`	No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings PDF	cs.CV, cs.CR	82	Caption-free membership inference for diffusion models; strengthens privacy auditing realism.	privacy, membership inference, diffusion models, data memorization, security
`2602.23193`	ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering PDF	cs.AI	82	Event-sourcing architecture for LLM agents: structured intentions + deterministic state/logging.	agents, software-engineering, state, reliability, orchestration
`2602.22642`	Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning PDF	cs.LG	82	Reasoning compression via difficulty-aware entropy regularization to avoid exploration collapse on hard tasks.	LLM-reasoning, CoT, efficiency, entropy-regularization, RL
`2602.22758`	Decomposing Physician Disagreement in HealthBench PDF	cs.AI, stat.AP	82	Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.	evaluation, medical-AI, uncertainty, human-judgment, benchmarks, reliability
`2602.23262`	Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling PDF	cs.CV, cs.CR	81	DP image generation via wavelet coarse-to-fine; targets privacy/utility tradeoff with spectral hypothesis.	privacy, differential-privacy, image-generation, wavelets, memorization
`2602.22699`	DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule PDF	cs.CR, cs.DB, cs.LG	80	DP SQL system enforcing minimum frequency rule; relevant for governance-grade privacy releases.	differential privacy, data governance, SQL, minimum frequency rule, privacy engineering
`2602.22585`	Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach PDF	cs.AI, cs.LG	80	Uses IRT/Rasch to correct rater effects in human eval; improves reliability of AI conclusions.	evaluation, human-raters, psychometrics, RLHF, measurement
`2602.22983`	Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search PDF	cs.AI, cs.CR	79	Shows classical Chinese as jailbreak vector + automated black-box prompt search; useful for red-teaming.	jailbreaks, multilingual attacks, adversarial prompts, red teaming, prompt optimization

AI Paper Insight Brief

2026-03-01

0) Executive takeaways (read this first)

Agent safety is shifting from “prompt-level” to “systems-level”: edge IoT swarms show that coordination buses (MQTT), failover behavior, and silent cloud fallback can dominate real risk—even when model behavior is unchanged.
Inference-time, policy-grounded safety is getting more updateable: CourtGuard demonstrates zero-shot policy swapping via RAG + adversarial debate with strong benchmark performance, suggesting a path to reducing “alignment lag” without retraining.
Multi-turn agent attacks/defenses are becoming causal and temporal: AgentSentry reports 0% attack success on AgentDojo by localizing takeover at tool-return boundaries using counterfactual re-executions, then purifying only the untrusted mediator context to continue safely.
Efficiency work is converging on “adaptive compute” with stability fixes: multiple papers tackle overthinking/long-horizon cost via GRPO stabilizers (CPAS/LAGR), difficulty-aware entropy regularization (CEEH), and step-level reuse (diffusion stitching) rather than blunt length penalties.
Evaluation is maturing toward variance/noise-aware measurement: rater-effect correction (IRT) can change system rankings; HealthBench disagreement is mostly case-specific; deep-research agents show measurable run-to-run variance with module attribution and mitigation.
Dual-use risk evidence is becoming more direct: a human study finds LLM access yields 4.16× higher novice accuracy on in silico biology tasks and most users report little difficulty overcoming safeguards.

2) Key themes (clusters)

Theme: Tool-using agent security beyond prompts (systems + temporal defenses)

Why it matters: As agents act through tools and physical devices, the main vulnerabilities increasingly come from coordination substrates, context persistence, and runtime fallbacks—not just prompt injection in a single turn.
Representative papers:
Common approach:
- Treat safety properties as systems metrics (audit delay, provenance completeness, egress, failover windows) rather than purely model behavior.
- Insert boundary checks at tool-return / state-transition points (where untrusted content enters).
- Prefer auditable state kernels (append-only logs, deterministic replay, contracts) to reduce state drift and enable governance.
Open questions / failure modes:
- How to harden coordination layers (e.g., MQTT) with cryptographic provenance/ACLs under edge constraints without breaking latency.
- Counterfactual diagnostics add overhead; unclear robustness on long-horizon delayed takeovers beyond current benchmarks.
- Event-sourcing kernels validate compliance/replay, but don’t directly measure software quality or broader security side channels.

Theme: Dynamic, policy-grounded alignment and multilingual safety transfer

Why it matters: Deployed policies change faster than retraining cycles, and safety gaps across languages remain exploitable; both push toward updateable, modular alignment.
Representative papers:
Common approach:
- Ground safety decisions in retrieved policy text and produce auditable rationales (CourtGuard).
- Transfer safety via sparse, low-rank edits on identified “safety neurons,” avoiding full retraining (Sparse Weight Editing).
- Stress-test guardrails with distributionally shifted language and automated black-box optimization (Classical Chinese FOA search).
Open questions / failure modes:
- RAG+debate evaluators trade accuracy for latency/cost; smaller backbones may fail structured outputs.
- Weight edits may be brittle under adaptive jailbreaks or when safety representations differ strongly by language.
- Classical-Chinese jailbreak results suggest defenses tuned to modern-language surface forms may fail catastrophically.

Theme: Stable efficiency scaling for reasoning and agentic RAG

Why it matters: Serving cost is now a first-order constraint; naive length penalties can collapse exploration or accuracy. The trend is toward stability-aware efficiency mechanisms.
Representative papers:
Common approach:
- Modify GRPO/RLVR to avoid mode collapse under length heterogeneity (CPAS/LAGR; difficulty-aware entropy).
- Replace sparse/binary outcome rewards with trajectory/process signals (path-centric scoring; step-level PRM scoring).
- Reuse partial progress (stitching steps across diffusion traces) instead of selecting whole trajectories.
Open questions / failure modes:
- Reliance on verifiable rewards (math/QA) limits domain transfer; verification for open-ended tasks remains hard.
- PRM misranking can discard crucial steps or over-trust incorrect anchors in stitching.
- Difficulty estimation and selective entropy may allocate too much budget to unsolved hard instances.

Theme: Long-horizon agent memory + inference infrastructure

Why it matters: Long-horizon agents hit context/KV bottlenecks and memory retrieval failures; improvements increasingly require systems + model co-design.
Representative papers:
Common approach:
- Treat tool outputs as first-class cache occupants; evict them with semantic, model-driven decisions (SideQuest).
- Align quantization layout with decode-time kernels (inner-dimension grouping) to reduce memory traffic (InnerQ).
- Benchmark memory on agent–environment trajectories with machine-generated artifacts and causal state transitions (AMA-Bench).
Open questions / failure modes:
- SideQuest currently evicts only tool responses (not “thought” tokens) and shows some OOD degradation.
- KV quantization results are shown on GSM8K few-shot; broader task impacts and interactions with eviction are open.
- Memory benchmarks rely on LLM-as-judge for some scoring; robustness of judging remains a concern.

Theme: Evaluation reliability, stochasticity, and disagreement as first-class signals

Why it matters: As models converge, measurement error (rater effects, disagreement, run-to-run variance) can dominate perceived progress and mis-rank systems.
Representative papers:
Common approach:
- Model annotators explicitly (MFRM) to adjust scores and diagnose severity/centrality.
- Decompose variance into rater/rubric/residual or module/time-step contributions.
- Evaluate tools in agentic use, exposing tool-to-agent gaps rather than static tool quality.
Open questions / failure modes:
- HealthBench disagreement is mostly residual/case-specific; predicting it from embeddings is near chance, limiting automation.
- API non-determinism can persist even at temperature 0, complicating reproducibility.
- Auditing tools can hurt on harder targets; effectiveness depends strongly on target training (TD vs SDF; KTO vs SFT).

3) Technical synthesis

Boundary-centric thinking is recurring: AgentSentry’s tool-return boundaries, ESAA’s intention/effect boundary, and edge-IoT’s MQTT coordination boundary all treat “where state changes” as the right place to measure/control risk.
GRPO is becoming a common substrate for both reasoning efficiency (adaptive thinking; CEEH) and agentic RAG training (Search-P1), with papers focusing on stabilizing gradients/rewards under heterogeneity.
Process signals are replacing binary outcomes: Search-P1’s path-centric scoring and diffusion step-stitching both extract learning/selection signal from partially correct trajectories.
“Model as systems component” is expanding: SideQuest uses the LRM to manage its own KV cache; AgentSentry uses the model in controlled re-executions; CourtGuard uses multiple roles (attacker/defender/judge) to structure evaluation.
Evaluation work is converging on variance decomposition: rater effects (IRT), physician disagreement ICCs, and DRA stochasticity all formalize “where variance comes from” rather than treating it as noise.
Language distribution shift remains a primary jailbreak vector: Classical Chinese optimization shows near-complete compromise across multiple closed models; Sparse Weight Editing tries to close multilingual gaps without retraining.
Privacy auditing is broadening threat models: MOFIT removes the “ground-truth caption” assumption for diffusion MIAs; DP-Wavelet and DPSQL+ focus on deployable DP with practical constraints (post-processing, minimum frequency rules).
Agent benchmarks are becoming more environment-faithful and reproducible: MobilityBench’s API replay sandbox and General Agent Evaluation’s Unified Protocol both target reproducibility and cross-system comparability.
Interpretability is increasingly tied to interventions: SSM bottleneck steering (Mamba) and certified circuit stability both aim to make mechanistic artifacts actionable and reliable.

4) Top 5 papers (with “why now”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Introduces boundary-anchored counterfactual diagnostics (orig/mask/mask_sanitized/orig_sanitized) to localize mediator takeover.
Reports ASR = 0% on AgentDojo across multiple attack families and backbones while keeping high utility under attack.
Mitigates by rewriting only untrusted mediator content into evidence-only form, enabling continuation rather than termination.
Be skeptical about: added inference overhead from counterfactual runs; benchmark may underrepresent long-horizon delayed takeovers.

2) Systems-Level Attack Surface of Edge Agent Deployments on IoT

Makes agent security measurable: actuation-to-audit delay (~23 ms mean on one path), provenance completeness, egress, failover windows.
Shows MQTT broker accepts spoofing/replay/direct safety-topic publishes without cryptographic enforcement.
Demonstrates silent sovereignty boundary crossing via forced fallback (DNS to api.anthropic.com) with no app-layer anomaly.
Be skeptical about: single testbed/topology; cloud egress comparison not workload-matched.

3) CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Retrieval-grounded adversarial debate produces interpretable verdicts with threat scores and policy citations.
Strong reported benchmark performance (macro Acc 0.87 / F1 0.86) and high recall on a human-verified suite.
Demonstrates zero-shot policy swapping (e.g., Wikipedia vandalism) by changing the policy corpus.
Be skeptical about: latency/cost of RAG + multi-turn debate; dependence on backbone format adherence.

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Provides 56 hardened target models with 14 hidden behaviors and reduced confession rates (KTO harder than SFT).
Agentic evaluation reveals scaffolded black-box tools outperform many white-box tools; effectiveness depends on target training.
Surfaces a concrete “tool-to-agent gap” (underuse, noise distraction, hypothesis failures).
Be skeptical about: targets are fine-tuned “model organisms” on one base model; may not match naturally emerging hidden behaviors.

5) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Human-subject evidence: LLM access yields 4.16× higher novice accuracy; Treatment beats Control on 7/8 benchmarks.
Treatment novices sometimes exceed expert baselines; but standalone LLMs often exceed LLM-assisted novices (elicitation gap).
Reports most Treatment participants indicated no difficulty overcoming safeguards (89.6%).
Be skeptical about: study limitations include changing model availability, possible information leakage (some questions found online), and lack of full blinding.

5) Practical next steps

For tool-using agents, add tool-return boundary instrumentation: log mediator content, proposed action, and a lightweight “takeover risk” proxy; measure how often high-impact actions are mediator-attributed.
In edge/IoT deployments, treat message bus security as safety-critical: test spoof/replay/direct-topic publish in your MQTT (or equivalent) setup; measure actuation-to-audit delay and failover blackout windows.
If you need rapid policy updates, prototype a policy-RAG evaluator with explicit citations and a deterministic verdict mapping; benchmark latency vs static classifiers.
For multilingual safety, evaluate language-shift jailbreaks (including stylistic shifts) and consider sparse interventions; measure utility drift on non-safety tasks.
For reasoning efficiency, avoid blunt length penalties: try difficulty-aware exploration control (entropy only on hard instances) or advantage/gradient regulation under length heterogeneity; track mode collapse.
For long-horizon agents, combine semantic KV eviction (tool-response garbage collection) with hardware-aligned KV quantization; measure throughput and non-completion/parsing failures.
Upgrade evaluation pipelines: (i) model rater effects when using human labels, (ii) report disagreement-aware metrics, and (iii) for research agents, report run-to-run variance on answers/findings/citations plus module attribution.
For dual-use governance, incorporate human+LLM uplift studies into risk assessments (not just LLM-only benchmarks), and explicitly test whether safeguards meaningfully slow task completion.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-01

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Tool-using agent security beyond prompts (systems + temporal defenses)

Theme: Dynamic, policy-grounded alignment and multilingual safety transfer

Theme: Stable efficiency scaling for reasoning and agentic RAG

Theme: Long-horizon agent memory + inference infrastructure

Theme: Evaluation reliability, stochasticity, and disagreement as first-class signals

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps