Daily AI Paper Report (2026-03-01)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 262
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-02-26T01:00:00Z → 2026-02-28T01:00:00Z (arxiv_announce, expanded=1)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2602.23329LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
PDF
cs.AI, cs.CL, cs.CR, cs.CY, cs.HC96Careful human study shows large LLM uplift on bio dual-use tasks; key for risk assessment.dual-use, biosecurity, human-uplift, evaluation, misuse-risk
2602.22755AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
PDF
cs.CL95Benchmark of hidden misalignment behaviors + agentic auditing tools; strong for eval & oversight.alignment auditing, benchmarks, hidden behaviors, agent evaluators, model honesty
2602.22724AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
PDF
cs.CR, cs.AI93Directly targets indirect prompt injection in agents with trajectory-aware diagnostics + mitigation.agent security, prompt injection, tool outputs, inference-time defense, causal diagnostics
2602.22525Systems-Level Attack Surface of Edge Agent Deployments on IoT
PDF
cs.CR93Empirical security analysis of edge LLM agents; concrete attack surfaces + measurable security metrics.agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance
2602.22603SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
PDF
cs.AI, cs.LG92LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.agents, long-context, memory, KV-cache, efficiency, reasoning
2602.22557CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
PDF
cs.AI, cs.LG91Model-agnostic zero-shot safety policy adaptation via RAG multi-agent debate grounded in policies.policy compliance, RAG, multi-agent debate, governance, safety evaluation
2602.22787Probing for Knowledge Attribution in Large Language Models
PDF
cs.CL, cs.AI91Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination control.hallucinations, attribution, faithfulness, factuality, interpretability
2602.22953General Agent Evaluation
PDF
cs.AI91Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration bias.agent-evaluation, benchmarks, general-agents, protocols, reproducibility
2602.22775TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
PDF
cs.HC, cs.AI, cs.CL90Adversarial multi-agent simulation to surface long-horizon relational safety failures in therapy bots.mental health, conversational safety, multi-turn evaluation, red teaming, agent simulation
2602.22576Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
PDF
cs.CL, cs.IR, cs.LG89Reward shaping for agentic RAG RL improves sample efficiency using trajectory-level signals.agentic-RAG, reinforcement-learning, reward-shaping, retrieval, reasoning
2602.22897OmniGAIA: Towards Native Omni-Modal AI Agents
PDF
cs.AI, cs.CL, cs.CV, cs.LG, cs.MM89Omni-modal agent benchmark (audio+video+image+tools) with event-graph construction; high reuse potential.multimodal, agents, benchmark, tool-use, evaluation, long-horizon
2602.22556Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
PDF
cs.LG, cs.AI, cs.CL89RL framework to curb overthinking while preserving correctness; practical for reliable reasoning models.reasoning, RL, efficiency, adaptive-compute, alignment, robustness
2602.22554Multilingual Safety Alignment Via Sparse Weight Editing
PDF
cs.LG88Training-free sparse weight editing to reduce multilingual safety gaps; practical alignment lever.multilingual safety, weight editing, safety neurons, alignment, low-resource languages
2602.22675Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
PDF
cs.CL87Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost and generalization.agents, search, efficiency, long-horizon, deep-research
2602.23271Evaluating Stochasticity in Deep Research Agents
PDF
cs.AI87Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP framing.agents, evaluation, stochasticity, reliability, research-agents, variance
2602.22769AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
PDF
cs.AI, cs.LG86AMA-Bench evaluates long-horizon agent memory on real agent trajectories beyond dialogue setups.agent memory, benchmarks, long-horizon, evaluation, trajectories
2602.23136Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
PDF
cs.CL, cs.AI, cs.LG86Theory for multimodal 'modality collapse' as mismatched decoding; probes + info-theoretic limits (GMI).multimodal-LLMs, information-theory, decoding, representation, robustness
2602.22719Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
PDF
cs.LG86Mechanistic interpretability + test-time steering for Mamba/SSMs; notable gains via simple intervention.interpretability, steering, SSM, Mamba, mechanistic, reliability
2602.22968Certified Circuits: Stability Guarantees for Mechanistic Circuits
PDF
cs.AI, cs.CV, cs.CY85Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliability.mechanistic interpretability, circuits, certification, robustness, auditing
2602.22638MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
PDF
cs.AI84Real-world route-planning benchmark with deterministic API-replay sandbox for reproducible agent eval.agents, benchmark, tool-use, evaluation, sandbox
2602.23200InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
PDF
cs.LG, cs.CL84Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding; practical impact.LLM-efficiency, KV-cache, quantization, long-context, inference
2602.22871Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
PDF
cs.CL, cs.AI84Step-level PRM-guided stitching for diffusion LMs; improves test-time scaling beyond trace voting.test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
2602.22689No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
PDF
cs.CV, cs.CR82Caption-free membership inference for diffusion models; strengthens privacy auditing realism.privacy, membership inference, diffusion models, data memorization, security
2602.23193ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering
PDF
cs.AI82Event-sourcing architecture for LLM agents: structured intentions + deterministic state/logging.agents, software-engineering, state, reliability, orchestration
2602.22642Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
PDF
cs.LG82Reasoning compression via difficulty-aware entropy regularization to avoid exploration collapse on hard tasks.LLM-reasoning, CoT, efficiency, entropy-regularization, RL
2602.22758Decomposing Physician Disagreement in HealthBench
PDF
cs.AI, stat.AP82Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.evaluation, medical-AI, uncertainty, human-judgment, benchmarks, reliability
2602.23262Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
PDF
cs.CV, cs.CR81DP image generation via wavelet coarse-to-fine; targets privacy/utility tradeoff with spectral hypothesis.privacy, differential-privacy, image-generation, wavelets, memorization
2602.22699DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule
PDF
cs.CR, cs.DB, cs.LG80DP SQL system enforcing minimum frequency rule; relevant for governance-grade privacy releases.differential privacy, data governance, SQL, minimum frequency rule, privacy engineering
2602.22585Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
PDF
cs.AI, cs.LG80Uses IRT/Rasch to correct rater effects in human eval; improves reliability of AI conclusions.evaluation, human-raters, psychometrics, RLHF, measurement
2602.22983Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
PDF
cs.AI, cs.CR79Shows classical Chinese as jailbreak vector + automated black-box prompt search; useful for red-teaming.jailbreaks, multilingual attacks, adversarial prompts, red teaming, prompt optimization

AI Paper Insight Brief

2026-03-01

0) Executive takeaways (read this first)

  • Agent safety is shifting from “prompt-level” to “systems-level”: edge IoT swarms show that coordination buses (MQTT), failover behavior, and silent cloud fallback can dominate real risk—even when model behavior is unchanged.
  • Inference-time, policy-grounded safety is getting more updateable: CourtGuard demonstrates zero-shot policy swapping via RAG + adversarial debate with strong benchmark performance, suggesting a path to reducing “alignment lag” without retraining.
  • Multi-turn agent attacks/defenses are becoming causal and temporal: AgentSentry reports 0% attack success on AgentDojo by localizing takeover at tool-return boundaries using counterfactual re-executions, then purifying only the untrusted mediator context to continue safely.
  • Efficiency work is converging on “adaptive compute” with stability fixes: multiple papers tackle overthinking/long-horizon cost via GRPO stabilizers (CPAS/LAGR), difficulty-aware entropy regularization (CEEH), and step-level reuse (diffusion stitching) rather than blunt length penalties.
  • Evaluation is maturing toward variance/noise-aware measurement: rater-effect correction (IRT) can change system rankings; HealthBench disagreement is mostly case-specific; deep-research agents show measurable run-to-run variance with module attribution and mitigation.
  • Dual-use risk evidence is becoming more direct: a human study finds LLM access yields 4.16× higher novice accuracy on in silico biology tasks and most users report little difficulty overcoming safeguards.

2) Key themes (clusters)

Theme: Tool-using agent security beyond prompts (systems + temporal defenses)

  • Why it matters: As agents act through tools and physical devices, the main vulnerabilities increasingly come from coordination substrates, context persistence, and runtime fallbacks—not just prompt injection in a single turn.
  • Representative papers:
  • Common approach:
    • Treat safety properties as systems metrics (audit delay, provenance completeness, egress, failover windows) rather than purely model behavior.
    • Insert boundary checks at tool-return / state-transition points (where untrusted content enters).
    • Prefer auditable state kernels (append-only logs, deterministic replay, contracts) to reduce state drift and enable governance.
  • Open questions / failure modes:
    • How to harden coordination layers (e.g., MQTT) with cryptographic provenance/ACLs under edge constraints without breaking latency.
    • Counterfactual diagnostics add overhead; unclear robustness on long-horizon delayed takeovers beyond current benchmarks.
    • Event-sourcing kernels validate compliance/replay, but don’t directly measure software quality or broader security side channels.

Theme: Dynamic, policy-grounded alignment and multilingual safety transfer

Theme: Stable efficiency scaling for reasoning and agentic RAG

Theme: Long-horizon agent memory + inference infrastructure

Theme: Evaluation reliability, stochasticity, and disagreement as first-class signals

3) Technical synthesis

  • Boundary-centric thinking is recurring: AgentSentry’s tool-return boundaries, ESAA’s intention/effect boundary, and edge-IoT’s MQTT coordination boundary all treat “where state changes” as the right place to measure/control risk.
  • GRPO is becoming a common substrate for both reasoning efficiency (adaptive thinking; CEEH) and agentic RAG training (Search-P1), with papers focusing on stabilizing gradients/rewards under heterogeneity.
  • Process signals are replacing binary outcomes: Search-P1’s path-centric scoring and diffusion step-stitching both extract learning/selection signal from partially correct trajectories.
  • “Model as systems component” is expanding: SideQuest uses the LRM to manage its own KV cache; AgentSentry uses the model in controlled re-executions; CourtGuard uses multiple roles (attacker/defender/judge) to structure evaluation.
  • Evaluation work is converging on variance decomposition: rater effects (IRT), physician disagreement ICCs, and DRA stochasticity all formalize “where variance comes from” rather than treating it as noise.
  • Language distribution shift remains a primary jailbreak vector: Classical Chinese optimization shows near-complete compromise across multiple closed models; Sparse Weight Editing tries to close multilingual gaps without retraining.
  • Privacy auditing is broadening threat models: MOFIT removes the “ground-truth caption” assumption for diffusion MIAs; DP-Wavelet and DPSQL+ focus on deployable DP with practical constraints (post-processing, minimum frequency rules).
  • Agent benchmarks are becoming more environment-faithful and reproducible: MobilityBench’s API replay sandbox and General Agent Evaluation’s Unified Protocol both target reproducibility and cross-system comparability.
  • Interpretability is increasingly tied to interventions: SSM bottleneck steering (Mamba) and certified circuit stability both aim to make mechanistic artifacts actionable and reliable.

4) Top 5 papers (with “why now”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

  • Introduces boundary-anchored counterfactual diagnostics (orig/mask/mask_sanitized/orig_sanitized) to localize mediator takeover.
  • Reports ASR = 0% on AgentDojo across multiple attack families and backbones while keeping high utility under attack.
  • Mitigates by rewriting only untrusted mediator content into evidence-only form, enabling continuation rather than termination.
  • Be skeptical about: added inference overhead from counterfactual runs; benchmark may underrepresent long-horizon delayed takeovers.

2) Systems-Level Attack Surface of Edge Agent Deployments on IoT

  • Makes agent security measurable: actuation-to-audit delay (~23 ms mean on one path), provenance completeness, egress, failover windows.
  • Shows MQTT broker accepts spoofing/replay/direct safety-topic publishes without cryptographic enforcement.
  • Demonstrates silent sovereignty boundary crossing via forced fallback (DNS to api.anthropic.com) with no app-layer anomaly.
  • Be skeptical about: single testbed/topology; cloud egress comparison not workload-matched.

3) CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

  • Retrieval-grounded adversarial debate produces interpretable verdicts with threat scores and policy citations.
  • Strong reported benchmark performance (macro Acc 0.87 / F1 0.86) and high recall on a human-verified suite.
  • Demonstrates zero-shot policy swapping (e.g., Wikipedia vandalism) by changing the policy corpus.
  • Be skeptical about: latency/cost of RAG + multi-turn debate; dependence on backbone format adherence.

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

  • Provides 56 hardened target models with 14 hidden behaviors and reduced confession rates (KTO harder than SFT).
  • Agentic evaluation reveals scaffolded black-box tools outperform many white-box tools; effectiveness depends on target training.
  • Surfaces a concrete “tool-to-agent gap” (underuse, noise distraction, hypothesis failures).
  • Be skeptical about: targets are fine-tuned “model organisms” on one base model; may not match naturally emerging hidden behaviors.

5) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

  • Human-subject evidence: LLM access yields 4.16× higher novice accuracy; Treatment beats Control on 7/8 benchmarks.
  • Treatment novices sometimes exceed expert baselines; but standalone LLMs often exceed LLM-assisted novices (elicitation gap).
  • Reports most Treatment participants indicated no difficulty overcoming safeguards (89.6%).
  • Be skeptical about: study limitations include changing model availability, possible information leakage (some questions found online), and lack of full blinding.

5) Practical next steps

  • For tool-using agents, add tool-return boundary instrumentation: log mediator content, proposed action, and a lightweight “takeover risk” proxy; measure how often high-impact actions are mediator-attributed.
  • In edge/IoT deployments, treat message bus security as safety-critical: test spoof/replay/direct-topic publish in your MQTT (or equivalent) setup; measure actuation-to-audit delay and failover blackout windows.
  • If you need rapid policy updates, prototype a policy-RAG evaluator with explicit citations and a deterministic verdict mapping; benchmark latency vs static classifiers.
  • For multilingual safety, evaluate language-shift jailbreaks (including stylistic shifts) and consider sparse interventions; measure utility drift on non-safety tasks.
  • For reasoning efficiency, avoid blunt length penalties: try difficulty-aware exploration control (entropy only on hard instances) or advantage/gradient regulation under length heterogeneity; track mode collapse.
  • For long-horizon agents, combine semantic KV eviction (tool-response garbage collection) with hardware-aligned KV quantization; measure throughput and non-completion/parsing failures.
  • Upgrade evaluation pipelines: (i) model rater effects when using human labels, (ii) report disagreement-aware metrics, and (iii) for research agents, report run-to-run variance on answers/findings/citations plus module attribution.
  • For dual-use governance, incorporate human+LLM uplift studies into risk assessments (not just LLM-only benchmarks), and explicitly test whether safeguards meaningfully slow task completion.

Generated from per-paper analyses; no external browsing.