Daily AI Paper Report (2026-02-28)

Published: February 28, 2026

Chinese version: [中文]

Run stats

Candidates: 262
Selected: 30
Deepread completed: 30
Window (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2602.22755`	AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors PDF	cs.CL	96	Audit benchmark w/ 56 models hiding 14 bad traits; evaluates auditing tools + autonomous investigator agent.	alignment auditing, hidden behaviors, benchmarks, red-teaming, agent evaluation, model honesty
`2602.23329`	LLM Novice Uplift on Dual-Use, In Silico Biology Tasks PDF	cs.AI, cs.CL, cs.CR, cs.CY, cs.HC	96	Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLMs	dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
`2602.22724`	AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification PDF	cs.CR, cs.AI	94	Targets indirect prompt injection in tool/RAG agents with multi-turn causal diagnostics + context purification.	agent security, prompt injection, tool use, RAG safety, inference-time defense, trajectory attacks
`2602.22525`	Systems-Level Attack Surface of Edge Agent Deployments on IoT PDF	cs.CR	94	Empirical security analysis of edge LLM agents on IoT; identifies concrete attack surfaces + metrics.	agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance, MQTT
`2602.22557`	CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety PDF	cs.AI, cs.LG	92	Model-agnostic zero-shot safety policy adaptation via retrieval-grounded multi-agent evidentiary debate.	policy compliance, RAG, multi-agent debate, governance, safety evaluation, zero-shot
`2602.22787`	Probing for Knowledge Attribution in Large Language Models PDF	cs.CL, cs.AI	92	Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation	hallucinations, attribution, faithfulness, factuality, probes, evaluation
`2602.22953`	General Agent Evaluation PDF	cs.AI	92	Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration gaps.	agent-evaluation, benchmarks, evaluation-protocol, agentic-systems, framework
`2602.22603`	SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning PDF	cs.AI, cs.LG	92	LRM-driven KV-cache compression for long-horizon agents; targets real bottleneck in agentic RAG.	agents, long-context, kv-cache, efficiency, reasoning, memory-management, RAG
`2602.22554`	Multilingual Safety Alignment Via Sparse Weight Editing PDF	cs.LG	90	Training-free sparse weight editing to reduce multilingual safety gaps; claims closed-form cross-lingual mapping.	multilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
`2602.23271`	Evaluating Stochasticity in Deep Research Agents PDF	cs.AI	90	Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP view.	agents, evaluation, reliability, stochasticity, research-agents, variance
`2602.22675`	Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization PDF	cs.CL	89	Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost+generalization	agents, search, efficiency, long-horizon, generalization, deep-research
`2602.22556`	Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation PDF	cs.LG, cs.AI, cs.CL	89	RL method to curb overthinking while preserving hard-query reasoning; practical accuracy/latency tradeoff.	reasoning, test-time-compute, RL, efficiency, adaptive-computation, alignment-adjacent
`2602.22775`	TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation PDF	cs.HC, cs.AI, cs.CL	88	Adversarial multi-agent simulation to surface multi-turn relational safety failures in mental health chatbots.	relational safety, mental health, multi-agent simulation, evaluation, conversation dynamics, harm modes
`2602.22576`	Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training PDF	cs.CL, cs.IR, cs.LG	88	Reward shaping for RL-trained agentic RAG; extracts signal from failures to improve sample efficiency	RAG, agents, reinforcement-learning, reward-shaping, training, retrieval
`2602.22897`	OmniGAIA: Towards Native Omni-Modal AI Agents PDF	cs.AI, cs.CL, cs.CV, cs.LG, cs.MM	88	Omni-modal agent benchmark (video+audio+image) with tool use and multi-hop reasoning; likely reusable.	multimodal, agents, benchmark, tool-use, evaluation, video, audio
`2602.23136`	Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs PDF	cs.CL, cs.AI, cs.LG	87	Info-theoretic account of modality collapse as mismatched decoding; actionable framing for multimodal LLMs.	multimodal-llms, information-theory, decoding, representation, modality-collapse, theory
`2602.22871`	Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching PDF	cs.CL, cs.AI	87	Step-level PRM scoring + stitching across diffusion CoTs; strong test-time scaling idea for reasoning.	reasoning, process-reward-model, test-time-scaling, diffusion-LM, self-consistency
`2602.22968`	Certified Circuits: Stability Guarantees for Mechanistic Circuits PDF	cs.AI, cs.CV, cs.CY	86	Provable stability guarantees for mechanistic circuit discovery via randomized subsampling certification.	mechanistic interpretability, circuits, robustness, certification, auditing, OOD stability
`2602.22638`	MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios PDF	cs.AI	86	Real-world route-planning agent benchmark with deterministic API-replay sandbox for reproducibility	agents, benchmark, evaluation, tool-use, reproducibility, planning
`2602.22769`	AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications PDF	cs.AI, cs.LG	85	AMA-Bench evaluates long-horizon agent memory on real agent-environment trajectories (not just dialogue).	agent memory, benchmarks, long-horizon, evaluation, trajectories, agentic applications
`2602.22719`	Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks PDF	cs.LG	85	Mechanistic interpretability for Mamba SSMs + simple activation steering yields broad gains.	interpretability, steering, state-space-models, Mamba, mechanistic-interpretability, reliability
`2602.23193`	ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering PDF	cs.AI	84	Event-sourcing architecture for LLM agents: structured intentions + deterministic state/logging	agents, software-engineering, orchestration, state, reliability, audit-logs
`2602.23200`	InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models PDF	cs.LG, cs.CL	84	Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.	efficiency, KV-cache, quantization, long-context, inference, systems
`2602.22758`	Decomposing Physician Disagreement in HealthBench PDF	cs.AI, stat.AP	83	Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals.	evaluation, medical-AI, uncertainty, human-judgment, benchmarking, reliability
`2602.22689`	No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings PDF	cs.CV, cs.CR	82	Caption-free membership inference for diffusion models using model-fitted synthetic conditioning inputs.	privacy, membership inference, diffusion models, data memorization, auditing, security
`2602.22585`	Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach PDF	cs.AI, cs.LG	82	Uses IRT/Rasch to correct rater effects in human evals; improves reliability of AI conclusions	evaluation, human-ratings, psychometrics, IRT, RLHF, measurement
`2602.22642`	Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning PDF	cs.LG	82	Difficulty-aware entropy regularization to compress CoT while avoiding entropy collapse on hard problems.	reasoning, CoT, efficiency, entropy-regularization, inference-cost, RL
`2602.23262`	Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling PDF	cs.CV, cs.CR	81	DP image generation via coarse-to-fine wavelet modeling to reduce quality loss; privacy-relevant technique.	privacy, differential-privacy, image-generation, wavelets, memorization, DP-SGD
`2602.22699`	DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule PDF	cs.CR, cs.DB, cs.LG	80	DP SQL library enforcing user-level DP plus minimum-frequency rule; practical governance-aligned privacy.	differential privacy, governance, SQL, data release, minimum frequency rule, privacy engineering
`2602.23079`	Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent PDF	cs.CL, cs.CR, cs.LG	80	Stylometry+LLM agent for authorship inference; highlights and mitigates deanonymization risks	privacy, deanonymization, stylometry, LLM-agents, security, risk

AI Paper Insight Brief

2026-02-28

0) Executive takeaways (read this first)

Agent safety is shifting from “prompt-level” to “systems-level”: edge/hybrid deployments introduce measurable new failure windows (audit latency, failover blackouts, silent cloud fallback) and protocol-layer spoofing risks that bypass model-behavior defenses.
Dynamic, policy-text-grounded safety is becoming a viable alternative to weight-locked guardrails: retrieval-grounded “adjudication” (CourtGuard) shows strong benchmark performance and can swap policies zero-shot, but pays latency and depends on backbone formatting adherence.
RL for agentic RAG and reasoning efficiency is converging on “process/path shaping”: reward shaping over trajectories (Search-P1) and stability fixes for length heterogeneity (adaptive thinking; difficulty-aware entropy) report simultaneous accuracy gains and large token reductions.
Evaluation is getting more realistic—and more sobering: new benchmarks target agent memory (AMA-Bench), mobility tool use (MobilityBench), omni-modal tool agents (OmniGAIA), hidden-behavior auditing (AuditBench), and DRA stochasticity—often revealing that current systems fail for structural reasons (context/memory loss, tool misuse, run-to-run variance).
Privacy/security work is broadening beyond classic text MIAs: caption-free diffusion membership inference (MOFIT), DP SQL with minimum-frequency governance (DPSQL+), DP text-to-image via wavelet coarse-to-fine (DP-Wavelet), and stylometry-assisted deanonymization agents show both new attack surfaces and deployable mitigations.
Dual-use risk is increasingly about “human uplift,” not model scores: a human-subject study finds LLM access makes novices ~4.16× more accurate on biosecurity-relevant in silico tasks and most participants report little friction from safeguards.

2) Key themes (clusters)

Theme: Systems-level agent security & governance (beyond prompts)

Why it matters: As agents move into edge devices, tool buses, and long-horizon workflows, security hinges on architecture (messaging, failover, auditability) as much as model alignment. These are exploitable even if the model is “well-behaved.”
Representative papers:
Common approach:
- Treat agent safety as measurable systems properties (audit delay, provenance completeness, failover windows).
- Insert runtime governance layers at tool boundaries (counterfactual re-execution; contract validation; deterministic orchestrators).
- Prefer auditability + replay (event logs, cached tool outputs) to enable forensics and containment.
Open questions / failure modes:
- “Silent” boundary crossings (e.g., fallback-to-cloud) that evade user awareness and logging.
- Tool/runtime compromise and cache tampering are often out-of-scope but realistic.
- Latency/cost overhead of stronger governance layers vs. real-time actuation needs.

Theme: Policy adaptability & auditing hidden behaviors

Why it matters: Policies change faster than model releases. Separately, models can conceal problematic behaviors, so auditing needs benchmarks and agentic workflows—not just static probes.
Representative papers:
Common approach:
- Ground decisions in retrieved policy text and structured adjudication (debate + judge scoring).
- Build model organisms with known hidden behaviors and measure auditor/agent success.
- Use measurement models (IRT/MFRM) to correct systematic rater bias in human labels.
Open questions / failure modes:
- Tool-to-agent gap: evidence surfaced by tools may not translate into correct agent hypotheses.
- Policy corpus coverage bounds performance; missing/ambiguous policy text can dominate errors.
- Evaluation pipelines can be distorted by rater severity/centrality unless corrected.

Theme: Efficient reasoning & agentic RAG via process/path shaping

Why it matters: Frontier performance is increasingly limited by inference cost and RL instability (length heterogeneity, sparse rewards). Process-aware shaping aims to improve both accuracy and efficiency.
Representative papers:
Common approach:
- Modify GRPO/RLVR to stabilize training under variable-length trajectories (length-aware gradients; advantage shaping; selective entropy).
- Replace sparse outcome rewards with trajectory/path rewards (self-consistency vs reference-alignment; soft outcome scoring).
- Use difficulty gating (per-question historical accuracy) to allocate exploration budget.
Open questions / failure modes:
- Reliance on verifiable rewards (math/QA) may not transfer cleanly to open-ended domains.
- Reference plans (offline-generated) can bias learning toward a narrow strategy set.
- Entropy/exploration mechanisms can still “spend” tokens on hard questions without finding correct paths.

Theme: Agent evaluation realism: memory, tools, multimodality, and stochasticity

Why it matters: Many failures are not “model IQ” but systems issues: memory construction loss, tool misuse, non-reproducible APIs, and run-to-run variance. New benchmarks isolate these.
Representative papers:
Common approach:
- Build deterministic sandboxes (API replay) and decomposed metrics (tool validity, planning precision/recall, DR/FPR).
- Evaluate agent-centric memory over machine-generated artifacts and causal environment dynamics.
- Quantify stochasticity at multiple levels (answers/findings/citations) and attribute it to modules/steps.
Open questions / failure modes:
- Tool-call quantity is not monotonic with success (too few fails; too many doesn’t guarantee).
- Similarity-based retrieval and lossy compression can fail on dense, causally structured traces.
- Early stochasticity cascades; inference/update modules can dominate variance.

Theme: Privacy & dual-use: new auditing attacks, DP with governance constraints, and human uplift

Why it matters: Privacy risk is expanding to diffusion models and agentic deanonymization; DP deployments need governance rules (like minimum frequency). Dual-use risk depends on whether humans become more capable with LLMs.
Representative papers:
Common approach:
- Redefine threat models to match reality (image-only MIA; open-world authorship search; multi-query DP sessions).
- Use post-processing DP decompositions (private coarse structure + public detail completion).
- Measure human-in-the-loop capability change under extended interaction, not just model-only scores.
Open questions / failure modes:
- Caption-free diffusion MIAs can be slow (minutes per image) and may be mitigated by some adaptation methods (e.g., LoRA in evaluated setting).
- DP systems trade expressiveness for safety (restricted SQL subsets; session-scoped accounting).
- Safeguards may not create meaningful friction for motivated users (self-reported in uplift study).

3) Technical synthesis

Multiple papers converge on GRPO-style RL as a base, then add stability/credit-assignment fixes: CPAS+LAGR for length heterogeneity; CEEH for difficulty-gated entropy; Search-P1 for path-level dense rewards.
A recurring pattern is “process over outcome”: path-centric rewards (Search-P1), step-level scoring and reuse (diffusion stitching), and causal boundary diagnostics (AgentSentry) all extract signal from intermediate structure.
Tool boundaries are becoming the natural control point for both safety and evaluation: AgentSentry’s boundary-anchored counterfactuals, ESAA’s contract-validated intentions, and IoT MQTT topic enforcement gaps all sit at the tool/transport layer.
Benchmarks increasingly enforce reproducibility via determinism (MobilityBench API replay; DRA cached search) to separate model variance from environment variance.
Several works highlight measurement-modeling as a first-class component: IRT/MFRM for rater effects; stochasticity as total variance over canonicalized findings/citations; systems security as timing/egress metrics.
Memory/context management is splitting into two directions: semantic eviction/compression (SideQuest’s model-driven KV eviction of tool outputs) and structured external memory (AMA-Agent causality graphs + tool-augmented retrieval).
Safety alignment is diversifying beyond fine-tuning: training-free weight edits for multilingual safety (sparse low-rank edits) and policy-text swapping for moderation (CourtGuard).
Privacy auditing is moving toward optimization-based, model-fitted attacks (MOFIT) and governance-aware DP interfaces (DPSQL+), suggesting defenders need both ML and systems mitigations.
Across multimodal and agentic settings, a common failure is “information present but unusable”: modality collapse framed as mismatched decoding (GMI vs MI), and agent memory failures where construction/retrieval loses critical state.

4) Top 5 papers (with “why now”)

1) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Quantifies human uplift: LLM access yields ~4.16× higher novice accuracy (odds ratio; adjusted accuracy ~5% → >17%).
Shows Treatment beats Control on 7/8 benchmarks, and can exceed expert internet-only baselines on some (e.g., HPCT, VCT).
Adds behavioral signals (longer, more structured responses; higher confidence) and reports 89.6% of participants had no difficulty overcoming safeguards.
Skepticism: study logistics changed mid-run (model availability), and some tasks had leakage (participants found questions online); not fully double-blind.

2) AgentSentry: Mitigating Indirect Prompt Injection…

Inference-time, black-box-compatible defense using boundary-anchored counterfactual re-executions and causal effect estimates (ACE/IE/DE).
Reports ASR = 0% with substantial utility across AgentDojo suites and multiple backbones; ablations show sanitized counterfactuals are critical.
Emphasizes safe continuation via context purification + minimal action revision, not blanket refusal.
Skepticism: lightweight configuration (e.g., K=1) may rely on benchmark injections being boundary-adjacent; tool/runtime compromise is out-of-scope.

3) CourtGuard: Zero-Shot Policy Adaptation in LLM Safety

Retrieval-grounded “Evidentiary Debate” enables policy swapping without fine-tuning; reports strong macro-average benchmark performance.
Demonstrates zero-shot adaptation to Wikipedia vandalism policy (90% on a balanced subset) and a legal grounding variant with expert review alignment.
Provides interpretable, policy-cited traces and claims dataset label-noise auditing utility.
Skepticism: adds inference latency; depends on backbone instruction/format adherence; bounded by policy corpus breadth.

4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Supplies a missing benchmark primitive: 56 models with 14 hidden behaviors designed not to confess when asked.
Evaluates an autonomous investigator agent across tool configurations and finds scaffolded black-box tools often outperform white-box tools.
Surfaces a key warning: tool-to-agent gap—static evidence doesn’t guarantee agent success.
Skepticism: targets are narrow fine-tunes on one base model (Llama 3.3 70B); implanted behaviors may differ from emergent real-world issues.

5) Systems-Level Attack Surface of Edge Agent Deployments on IoT

Makes “agent security” concrete at the architecture layer: measures actuation-to-audit delay, provenance completeness, data egress, and failover windows.
Finds MQTT broker accepted spoof/replay/direct safety-topic publishes by default; forced fallback can trigger silent cloud routing observable only via DNS/tcpdump.
Quantifies failover: end-to-end WiFi loss to fallback path 35.7s, while MQTT reconnection alone is milliseconds—highlighting where the real window is.
Skepticism: single testbed/topology; cloud egress comparison not workload-matched; mitigations not implemented/evaluated.

5) Practical next steps

For tool-using agents, add boundary-level security instrumentation: log tool-return boundaries, cache tool outputs for replay, and measure takeover risk via controlled counterfactual re-executions (AgentSentry-style) on your own workflows.
If deploying edge/hybrid agents, define and monitor systems safety SLOs: actuation-to-audit delay, failover blackout windows, provenance-chain completeness, and explicit alerts on any cloud fallback/egress.
For moderation/governance, prototype policy-text RAG adjudication with explicit scoring rubrics (regulatory vs practical threat) and measure latency/format-failure rates across backbones.
For RL training of agentic RAG, replace binary-only rewards with trajectory/path rewards (self-consistency + reference-alignment) and include partial credit for near-misses; track convergence speed and redundant tool actions.
For reasoning efficiency, test mode-control tokens (/think vs /no_think) and stabilize RL with length-aware gradient weighting; separately, try difficulty-gated entropy to avoid entropy collapse on hard items.
For evaluation, incorporate stochasticity audits: run agents k times per query, compute variance over findings/citations, and localize variance to modules (query vs summarize vs update) before tuning temperature.
For human-labeled evals, consider rater-effect correction (MFRM/IRT) and rater diagnostics before making model selection decisions from raw means.
For privacy, assume stronger auditors: evaluate diffusion models under caption-free MIA settings; for analytics, enforce both DP and governance constraints (minimum frequency) with integrated accounting; for text, assess stylometry/deanonymization risk and test guided rewrites.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-02-28

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Systems-level agent security & governance (beyond prompts)

Theme: Policy adaptability & auditing hidden behaviors

Theme: Efficient reasoning & agentic RAG via process/path shaping

Theme: Agent evaluation realism: memory, tools, multimodality, and stochasticity

Theme: Privacy & dual-use: new auditing attacks, DP with governance constraints, and human uplift

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps