Daily AI Paper Report (2026-02-28)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 262
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2602.22755AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
PDF
cs.CL96Audit benchmark w/ 56 models hiding 14 bad traits; evaluates auditing tools + autonomous investigator agent.alignment auditing, hidden behaviors, benchmarks, red-teaming, agent evaluation, model honesty
2602.23329LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
PDF
cs.AI, cs.CL, cs.CR, cs.CY, cs.HC96Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLMsdual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
2602.22724AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
PDF
cs.CR, cs.AI94Targets indirect prompt injection in tool/RAG agents with multi-turn causal diagnostics + context purification.agent security, prompt injection, tool use, RAG safety, inference-time defense, trajectory attacks
2602.22525Systems-Level Attack Surface of Edge Agent Deployments on IoT
PDF
cs.CR94Empirical security analysis of edge LLM agents on IoT; identifies concrete attack surfaces + metrics.agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance, MQTT
2602.22557CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
PDF
cs.AI, cs.LG92Model-agnostic zero-shot safety policy adaptation via retrieval-grounded multi-agent evidentiary debate.policy compliance, RAG, multi-agent debate, governance, safety evaluation, zero-shot
2602.22787Probing for Knowledge Attribution in Large Language Models
PDF
cs.CL, cs.AI92Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigationhallucinations, attribution, faithfulness, factuality, probes, evaluation
2602.22953General Agent Evaluation
PDF
cs.AI92Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration gaps.agent-evaluation, benchmarks, evaluation-protocol, agentic-systems, framework
2602.22603SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
PDF
cs.AI, cs.LG92LRM-driven KV-cache compression for long-horizon agents; targets real bottleneck in agentic RAG.agents, long-context, kv-cache, efficiency, reasoning, memory-management, RAG
2602.22554Multilingual Safety Alignment Via Sparse Weight Editing
PDF
cs.LG90Training-free sparse weight editing to reduce multilingual safety gaps; claims closed-form cross-lingual mapping.multilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
2602.23271Evaluating Stochasticity in Deep Research Agents
PDF
cs.AI90Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP view.agents, evaluation, reliability, stochasticity, research-agents, variance
2602.22675Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
PDF
cs.CL89Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost+generalizationagents, search, efficiency, long-horizon, generalization, deep-research
2602.22556Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
PDF
cs.LG, cs.AI, cs.CL89RL method to curb overthinking while preserving hard-query reasoning; practical accuracy/latency tradeoff.reasoning, test-time-compute, RL, efficiency, adaptive-computation, alignment-adjacent
2602.22775TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
PDF
cs.HC, cs.AI, cs.CL88Adversarial multi-agent simulation to surface multi-turn relational safety failures in mental health chatbots.relational safety, mental health, multi-agent simulation, evaluation, conversation dynamics, harm modes
2602.22576Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
PDF
cs.CL, cs.IR, cs.LG88Reward shaping for RL-trained agentic RAG; extracts signal from failures to improve sample efficiencyRAG, agents, reinforcement-learning, reward-shaping, training, retrieval
2602.22897OmniGAIA: Towards Native Omni-Modal AI Agents
PDF
cs.AI, cs.CL, cs.CV, cs.LG, cs.MM88Omni-modal agent benchmark (video+audio+image) with tool use and multi-hop reasoning; likely reusable.multimodal, agents, benchmark, tool-use, evaluation, video, audio
2602.23136Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
PDF
cs.CL, cs.AI, cs.LG87Info-theoretic account of modality collapse as mismatched decoding; actionable framing for multimodal LLMs.multimodal-llms, information-theory, decoding, representation, modality-collapse, theory
2602.22871Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
PDF
cs.CL, cs.AI87Step-level PRM scoring + stitching across diffusion CoTs; strong test-time scaling idea for reasoning.reasoning, process-reward-model, test-time-scaling, diffusion-LM, self-consistency
2602.22968Certified Circuits: Stability Guarantees for Mechanistic Circuits
PDF
cs.AI, cs.CV, cs.CY86Provable stability guarantees for mechanistic circuit discovery via randomized subsampling certification.mechanistic interpretability, circuits, robustness, certification, auditing, OOD stability
2602.22638MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
PDF
cs.AI86Real-world route-planning agent benchmark with deterministic API-replay sandbox for reproducibilityagents, benchmark, evaluation, tool-use, reproducibility, planning
2602.22769AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
PDF
cs.AI, cs.LG85AMA-Bench evaluates long-horizon agent memory on real agent-environment trajectories (not just dialogue).agent memory, benchmarks, long-horizon, evaluation, trajectories, agentic applications
2602.22719Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
PDF
cs.LG85Mechanistic interpretability for Mamba SSMs + simple activation steering yields broad gains.interpretability, steering, state-space-models, Mamba, mechanistic-interpretability, reliability
2602.23193ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering
PDF
cs.AI84Event-sourcing architecture for LLM agents: structured intentions + deterministic state/loggingagents, software-engineering, orchestration, state, reliability, audit-logs
2602.23200InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
PDF
cs.LG, cs.CL84Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.efficiency, KV-cache, quantization, long-context, inference, systems
2602.22758Decomposing Physician Disagreement in HealthBench
PDF
cs.AI, stat.AP83Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.evaluation, medical-AI, uncertainty, human-judgment, benchmarking, reliability
2602.22689No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
PDF
cs.CV, cs.CR82Caption-free membership inference for diffusion models using model-fitted synthetic conditioning inputs.privacy, membership inference, diffusion models, data memorization, auditing, security
2602.22585Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
PDF
cs.AI, cs.LG82Uses IRT/Rasch to correct rater effects in human evals; improves reliability of AI conclusionsevaluation, human-ratings, psychometrics, IRT, RLHF, measurement
2602.22642Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
PDF
cs.LG82Difficulty-aware entropy regularization to compress CoT while avoiding entropy collapse on hard problems.reasoning, CoT, efficiency, entropy-regularization, inference-cost, RL
2602.23262Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
PDF
cs.CV, cs.CR81DP image generation via coarse-to-fine wavelet modeling to reduce quality loss; privacy-relevant technique.privacy, differential-privacy, image-generation, wavelets, memorization, DP-SGD
2602.22699DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule
PDF
cs.CR, cs.DB, cs.LG80DP SQL library enforcing user-level DP plus minimum-frequency rule; practical governance-aligned privacy.differential privacy, governance, SQL, data release, minimum frequency rule, privacy engineering
2602.23079Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
PDF
cs.CL, cs.CR, cs.LG80Stylometry+LLM agent for authorship inference; highlights and mitigates deanonymization risksprivacy, deanonymization, stylometry, LLM-agents, security, risk

AI Paper Insight Brief

2026-02-28

0) Executive takeaways (read this first)

  • Agent safety is shifting from “prompt-level” to “systems-level”: edge/hybrid deployments introduce measurable new failure windows (audit latency, failover blackouts, silent cloud fallback) and protocol-layer spoofing risks that bypass model-behavior defenses.
  • Dynamic, policy-text-grounded safety is becoming a viable alternative to weight-locked guardrails: retrieval-grounded “adjudication” (CourtGuard) shows strong benchmark performance and can swap policies zero-shot, but pays latency and depends on backbone formatting adherence.
  • RL for agentic RAG and reasoning efficiency is converging on “process/path shaping”: reward shaping over trajectories (Search-P1) and stability fixes for length heterogeneity (adaptive thinking; difficulty-aware entropy) report simultaneous accuracy gains and large token reductions.
  • Evaluation is getting more realistic—and more sobering: new benchmarks target agent memory (AMA-Bench), mobility tool use (MobilityBench), omni-modal tool agents (OmniGAIA), hidden-behavior auditing (AuditBench), and DRA stochasticity—often revealing that current systems fail for structural reasons (context/memory loss, tool misuse, run-to-run variance).
  • Privacy/security work is broadening beyond classic text MIAs: caption-free diffusion membership inference (MOFIT), DP SQL with minimum-frequency governance (DPSQL+), DP text-to-image via wavelet coarse-to-fine (DP-Wavelet), and stylometry-assisted deanonymization agents show both new attack surfaces and deployable mitigations.
  • Dual-use risk is increasingly about “human uplift,” not model scores: a human-subject study finds LLM access makes novices ~4.16× more accurate on biosecurity-relevant in silico tasks and most participants report little friction from safeguards.

2) Key themes (clusters)

Theme: Systems-level agent security & governance (beyond prompts)

Theme: Policy adaptability & auditing hidden behaviors

Theme: Efficient reasoning & agentic RAG via process/path shaping

Theme: Agent evaluation realism: memory, tools, multimodality, and stochasticity

Theme: Privacy & dual-use: new auditing attacks, DP with governance constraints, and human uplift

3) Technical synthesis

  • Multiple papers converge on GRPO-style RL as a base, then add stability/credit-assignment fixes: CPAS+LAGR for length heterogeneity; CEEH for difficulty-gated entropy; Search-P1 for path-level dense rewards.
  • A recurring pattern is “process over outcome”: path-centric rewards (Search-P1), step-level scoring and reuse (diffusion stitching), and causal boundary diagnostics (AgentSentry) all extract signal from intermediate structure.
  • Tool boundaries are becoming the natural control point for both safety and evaluation: AgentSentry’s boundary-anchored counterfactuals, ESAA’s contract-validated intentions, and IoT MQTT topic enforcement gaps all sit at the tool/transport layer.
  • Benchmarks increasingly enforce reproducibility via determinism (MobilityBench API replay; DRA cached search) to separate model variance from environment variance.
  • Several works highlight measurement-modeling as a first-class component: IRT/MFRM for rater effects; stochasticity as total variance over canonicalized findings/citations; systems security as timing/egress metrics.
  • Memory/context management is splitting into two directions: semantic eviction/compression (SideQuest’s model-driven KV eviction of tool outputs) and structured external memory (AMA-Agent causality graphs + tool-augmented retrieval).
  • Safety alignment is diversifying beyond fine-tuning: training-free weight edits for multilingual safety (sparse low-rank edits) and policy-text swapping for moderation (CourtGuard).
  • Privacy auditing is moving toward optimization-based, model-fitted attacks (MOFIT) and governance-aware DP interfaces (DPSQL+), suggesting defenders need both ML and systems mitigations.
  • Across multimodal and agentic settings, a common failure is “information present but unusable”: modality collapse framed as mismatched decoding (GMI vs MI), and agent memory failures where construction/retrieval loses critical state.

4) Top 5 papers (with “why now”)

1) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

  • Quantifies human uplift: LLM access yields ~4.16× higher novice accuracy (odds ratio; adjusted accuracy ~5% → >17%).
  • Shows Treatment beats Control on 7/8 benchmarks, and can exceed expert internet-only baselines on some (e.g., HPCT, VCT).
  • Adds behavioral signals (longer, more structured responses; higher confidence) and reports 89.6% of participants had no difficulty overcoming safeguards.
  • Skepticism: study logistics changed mid-run (model availability), and some tasks had leakage (participants found questions online); not fully double-blind.

2) AgentSentry: Mitigating Indirect Prompt Injection…

  • Inference-time, black-box-compatible defense using boundary-anchored counterfactual re-executions and causal effect estimates (ACE/IE/DE).
  • Reports ASR = 0% with substantial utility across AgentDojo suites and multiple backbones; ablations show sanitized counterfactuals are critical.
  • Emphasizes safe continuation via context purification + minimal action revision, not blanket refusal.
  • Skepticism: lightweight configuration (e.g., K=1) may rely on benchmark injections being boundary-adjacent; tool/runtime compromise is out-of-scope.

3) CourtGuard: Zero-Shot Policy Adaptation in LLM Safety

  • Retrieval-grounded “Evidentiary Debate” enables policy swapping without fine-tuning; reports strong macro-average benchmark performance.
  • Demonstrates zero-shot adaptation to Wikipedia vandalism policy (90% on a balanced subset) and a legal grounding variant with expert review alignment.
  • Provides interpretable, policy-cited traces and claims dataset label-noise auditing utility.
  • Skepticism: adds inference latency; depends on backbone instruction/format adherence; bounded by policy corpus breadth.

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

  • Supplies a missing benchmark primitive: 56 models with 14 hidden behaviors designed not to confess when asked.
  • Evaluates an autonomous investigator agent across tool configurations and finds scaffolded black-box tools often outperform white-box tools.
  • Surfaces a key warning: tool-to-agent gap—static evidence doesn’t guarantee agent success.
  • Skepticism: targets are narrow fine-tunes on one base model (Llama 3.3 70B); implanted behaviors may differ from emergent real-world issues.

5) Systems-Level Attack Surface of Edge Agent Deployments on IoT

  • Makes “agent security” concrete at the architecture layer: measures actuation-to-audit delay, provenance completeness, data egress, and failover windows.
  • Finds MQTT broker accepted spoof/replay/direct safety-topic publishes by default; forced fallback can trigger silent cloud routing observable only via DNS/tcpdump.
  • Quantifies failover: end-to-end WiFi loss to fallback path 35.7s, while MQTT reconnection alone is milliseconds—highlighting where the real window is.
  • Skepticism: single testbed/topology; cloud egress comparison not workload-matched; mitigations not implemented/evaluated.

5) Practical next steps

  • For tool-using agents, add boundary-level security instrumentation: log tool-return boundaries, cache tool outputs for replay, and measure takeover risk via controlled counterfactual re-executions (AgentSentry-style) on your own workflows.
  • If deploying edge/hybrid agents, define and monitor systems safety SLOs: actuation-to-audit delay, failover blackout windows, provenance-chain completeness, and explicit alerts on any cloud fallback/egress.
  • For moderation/governance, prototype policy-text RAG adjudication with explicit scoring rubrics (regulatory vs practical threat) and measure latency/format-failure rates across backbones.
  • For RL training of agentic RAG, replace binary-only rewards with trajectory/path rewards (self-consistency + reference-alignment) and include partial credit for near-misses; track convergence speed and redundant tool actions.
  • For reasoning efficiency, test mode-control tokens (/think vs /no_think) and stabilize RL with length-aware gradient weighting; separately, try difficulty-gated entropy to avoid entropy collapse on hard items.
  • For evaluation, incorporate stochasticity audits: run agents k times per query, compute variance over findings/citations, and localize variance to modules (query vs summarize vs update) before tuning temperature.
  • For human-labeled evals, consider rater-effect correction (MFRM/IRT) and rater diagnostics before making model selection decisions from raw means.
  • For privacy, assume stronger auditors: evaluate diffusion models under caption-free MIA settings; for analytics, enforce both DP and governance constraints (minimum frequency) with integrated accounting; for text, assess stylometry/deanonymization risk and test guided rewrites.

Generated from per-paper analyses; no external browsing.