Daily AI Paper Report (2026-02-28)
Published:
Chinese version: [中文]
Run stats
- Candidates: 262
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2602.22755 | AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors | cs.CL | 96 | Audit benchmark w/ 56 models hiding 14 bad traits; evaluates auditing tools + autonomous investigator agent. | alignment auditing, hidden behaviors, benchmarks, red-teaming, agent evaluation, model honesty |
2602.23329 | LLM Novice Uplift on Dual-Use, In Silico Biology Tasks | cs.AI, cs.CL, cs.CR, cs.CY, cs.HC | 96 | Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLMs | dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs |
2602.22724 | AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification | cs.CR, cs.AI | 94 | Targets indirect prompt injection in tool/RAG agents with multi-turn causal diagnostics + context purification. | agent security, prompt injection, tool use, RAG safety, inference-time defense, trajectory attacks |
2602.22525 | Systems-Level Attack Surface of Edge Agent Deployments on IoT | cs.CR | 94 | Empirical security analysis of edge LLM agents on IoT; identifies concrete attack surfaces + metrics. | agent-security, edge-deployment, IoT, attack-surface, systems-security, provenance, MQTT |
2602.22557 | CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety | cs.AI, cs.LG | 92 | Model-agnostic zero-shot safety policy adaptation via retrieval-grounded multi-agent evidentiary debate. | policy compliance, RAG, multi-agent debate, governance, safety evaluation, zero-shot |
2602.22787 | Probing for Knowledge Attribution in Large Language Models | cs.CL, cs.AI | 92 | Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation | hallucinations, attribution, faithfulness, factuality, probes, evaluation |
2602.22953 | General Agent Evaluation | cs.AI | 92 | Proposes unified protocol + framework for general agent evaluation; addresses benchmark integration gaps. | agent-evaluation, benchmarks, evaluation-protocol, agentic-systems, framework |
2602.22603 | SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning | cs.AI, cs.LG | 92 | LRM-driven KV-cache compression for long-horizon agents; targets real bottleneck in agentic RAG. | agents, long-context, kv-cache, efficiency, reasoning, memory-management, RAG |
2602.22554 | Multilingual Safety Alignment Via Sparse Weight Editing | cs.LG | 90 | Training-free sparse weight editing to reduce multilingual safety gaps; claims closed-form cross-lingual mapping. | multilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness |
2602.23271 | Evaluating Stochasticity in Deep Research Agents | cs.AI | 90 | Formalizes and measures stochasticity/variance in deep research agents; identifies sources via MDP view. | agents, evaluation, reliability, stochasticity, research-agents, variance |
2602.22675 | Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization | cs.CL | 89 | Agentic search framework prioritizing parallel evidence over deep reasoning; targets cost+generalization | agents, search, efficiency, long-horizon, generalization, deep-research |
2602.22556 | Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation | cs.LG, cs.AI, cs.CL | 89 | RL method to curb overthinking while preserving hard-query reasoning; practical accuracy/latency tradeoff. | reasoning, test-time-compute, RL, efficiency, adaptive-computation, alignment-adjacent |
2602.22775 | TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation | cs.HC, cs.AI, cs.CL | 88 | Adversarial multi-agent simulation to surface multi-turn relational safety failures in mental health chatbots. | relational safety, mental health, multi-agent simulation, evaluation, conversation dynamics, harm modes |
2602.22576 | Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training | cs.CL, cs.IR, cs.LG | 88 | Reward shaping for RL-trained agentic RAG; extracts signal from failures to improve sample efficiency | RAG, agents, reinforcement-learning, reward-shaping, training, retrieval |
2602.22897 | OmniGAIA: Towards Native Omni-Modal AI Agents | cs.AI, cs.CL, cs.CV, cs.LG, cs.MM | 88 | Omni-modal agent benchmark (video+audio+image) with tool use and multi-hop reasoning; likely reusable. | multimodal, agents, benchmark, tool-use, evaluation, video, audio |
2602.23136 | Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs | cs.CL, cs.AI, cs.LG | 87 | Info-theoretic account of modality collapse as mismatched decoding; actionable framing for multimodal LLMs. | multimodal-llms, information-theory, decoding, representation, modality-collapse, theory |
2602.22871 | Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching | cs.CL, cs.AI | 87 | Step-level PRM scoring + stitching across diffusion CoTs; strong test-time scaling idea for reasoning. | reasoning, process-reward-model, test-time-scaling, diffusion-LM, self-consistency |
2602.22968 | Certified Circuits: Stability Guarantees for Mechanistic Circuits | cs.AI, cs.CV, cs.CY | 86 | Provable stability guarantees for mechanistic circuit discovery via randomized subsampling certification. | mechanistic interpretability, circuits, robustness, certification, auditing, OOD stability |
2602.22638 | MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios | cs.AI | 86 | Real-world route-planning agent benchmark with deterministic API-replay sandbox for reproducibility | agents, benchmark, evaluation, tool-use, reproducibility, planning |
2602.22769 | AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications | cs.AI, cs.LG | 85 | AMA-Bench evaluates long-horizon agent memory on real agent-environment trajectories (not just dialogue). | agent memory, benchmarks, long-horizon, evaluation, trajectories, agentic applications |
2602.22719 | Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks | cs.LG | 85 | Mechanistic interpretability for Mamba SSMs + simple activation steering yields broad gains. | interpretability, steering, state-space-models, Mamba, mechanistic-interpretability, reliability |
2602.23193 | ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering | cs.AI | 84 | Event-sourcing architecture for LLM agents: structured intentions + deterministic state/logging | agents, software-engineering, orchestration, state, reliability, audit-logs |
2602.23200 | InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models | cs.LG, cs.CL | 84 | Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss. | efficiency, KV-cache, quantization, long-context, inference, systems |
2602.22758 | Decomposing Physician Disagreement in HealthBench | cs.AI, stat.AP | 83 | Analyzes physician disagreement in HealthBench; highlights irreducible uncertainty in medical evals. | evaluation, medical-AI, uncertainty, human-judgment, benchmarking, reliability |
2602.22689 | No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings | cs.CV, cs.CR | 82 | Caption-free membership inference for diffusion models using model-fitted synthetic conditioning inputs. | privacy, membership inference, diffusion models, data memorization, auditing, security |
2602.22585 | Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach | cs.AI, cs.LG | 82 | Uses IRT/Rasch to correct rater effects in human evals; improves reliability of AI conclusions | evaluation, human-ratings, psychometrics, IRT, RLHF, measurement |
2602.22642 | Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning | cs.LG | 82 | Difficulty-aware entropy regularization to compress CoT while avoiding entropy collapse on hard problems. | reasoning, CoT, efficiency, entropy-regularization, inference-cost, RL |
2602.23262 | Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling | cs.CV, cs.CR | 81 | DP image generation via coarse-to-fine wavelet modeling to reduce quality loss; privacy-relevant technique. | privacy, differential-privacy, image-generation, wavelets, memorization, DP-SGD |
2602.22699 | DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule | cs.CR, cs.DB, cs.LG | 80 | DP SQL library enforcing user-level DP plus minimum-frequency rule; practical governance-aligned privacy. | differential privacy, governance, SQL, data release, minimum frequency rule, privacy engineering |
2602.23079 | Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent | cs.CL, cs.CR, cs.LG | 80 | Stylometry+LLM agent for authorship inference; highlights and mitigates deanonymization risks | privacy, deanonymization, stylometry, LLM-agents, security, risk |
AI Paper Insight Brief
2026-02-28
0) Executive takeaways (read this first)
- Agent safety is shifting from “prompt-level” to “systems-level”: edge/hybrid deployments introduce measurable new failure windows (audit latency, failover blackouts, silent cloud fallback) and protocol-layer spoofing risks that bypass model-behavior defenses.
- Dynamic, policy-text-grounded safety is becoming a viable alternative to weight-locked guardrails: retrieval-grounded “adjudication” (CourtGuard) shows strong benchmark performance and can swap policies zero-shot, but pays latency and depends on backbone formatting adherence.
- RL for agentic RAG and reasoning efficiency is converging on “process/path shaping”: reward shaping over trajectories (Search-P1) and stability fixes for length heterogeneity (adaptive thinking; difficulty-aware entropy) report simultaneous accuracy gains and large token reductions.
- Evaluation is getting more realistic—and more sobering: new benchmarks target agent memory (AMA-Bench), mobility tool use (MobilityBench), omni-modal tool agents (OmniGAIA), hidden-behavior auditing (AuditBench), and DRA stochasticity—often revealing that current systems fail for structural reasons (context/memory loss, tool misuse, run-to-run variance).
- Privacy/security work is broadening beyond classic text MIAs: caption-free diffusion membership inference (MOFIT), DP SQL with minimum-frequency governance (DPSQL+), DP text-to-image via wavelet coarse-to-fine (DP-Wavelet), and stylometry-assisted deanonymization agents show both new attack surfaces and deployable mitigations.
- Dual-use risk is increasingly about “human uplift,” not model scores: a human-subject study finds LLM access makes novices ~4.16× more accurate on biosecurity-relevant in silico tasks and most participants report little friction from safeguards.
2) Key themes (clusters)
Theme: Systems-level agent security & governance (beyond prompts)
- Why it matters: As agents move into edge devices, tool buses, and long-horizon workflows, security hinges on architecture (messaging, failover, auditability) as much as model alignment. These are exploitable even if the model is “well-behaved.”
- Representative papers:
- Common approach:
- Treat agent safety as measurable systems properties (audit delay, provenance completeness, failover windows).
- Insert runtime governance layers at tool boundaries (counterfactual re-execution; contract validation; deterministic orchestrators).
- Prefer auditability + replay (event logs, cached tool outputs) to enable forensics and containment.
- Open questions / failure modes:
- “Silent” boundary crossings (e.g., fallback-to-cloud) that evade user awareness and logging.
- Tool/runtime compromise and cache tampering are often out-of-scope but realistic.
- Latency/cost overhead of stronger governance layers vs. real-time actuation needs.
Theme: Policy adaptability & auditing hidden behaviors
- Why it matters: Policies change faster than model releases. Separately, models can conceal problematic behaviors, so auditing needs benchmarks and agentic workflows—not just static probes.
- Representative papers:
- Common approach:
- Ground decisions in retrieved policy text and structured adjudication (debate + judge scoring).
- Build model organisms with known hidden behaviors and measure auditor/agent success.
- Use measurement models (IRT/MFRM) to correct systematic rater bias in human labels.
- Open questions / failure modes:
- Tool-to-agent gap: evidence surfaced by tools may not translate into correct agent hypotheses.
- Policy corpus coverage bounds performance; missing/ambiguous policy text can dominate errors.
- Evaluation pipelines can be distorted by rater severity/centrality unless corrected.
Theme: Efficient reasoning & agentic RAG via process/path shaping
- Why it matters: Frontier performance is increasingly limited by inference cost and RL instability (length heterogeneity, sparse rewards). Process-aware shaping aims to improve both accuracy and efficiency.
- Representative papers:
- Common approach:
- Modify GRPO/RLVR to stabilize training under variable-length trajectories (length-aware gradients; advantage shaping; selective entropy).
- Replace sparse outcome rewards with trajectory/path rewards (self-consistency vs reference-alignment; soft outcome scoring).
- Use difficulty gating (per-question historical accuracy) to allocate exploration budget.
- Open questions / failure modes:
- Reliance on verifiable rewards (math/QA) may not transfer cleanly to open-ended domains.
- Reference plans (offline-generated) can bias learning toward a narrow strategy set.
- Entropy/exploration mechanisms can still “spend” tokens on hard questions without finding correct paths.
Theme: Agent evaluation realism: memory, tools, multimodality, and stochasticity
- Why it matters: Many failures are not “model IQ” but systems issues: memory construction loss, tool misuse, non-reproducible APIs, and run-to-run variance. New benchmarks isolate these.
- Representative papers:
- Common approach:
- Build deterministic sandboxes (API replay) and decomposed metrics (tool validity, planning precision/recall, DR/FPR).
- Evaluate agent-centric memory over machine-generated artifacts and causal environment dynamics.
- Quantify stochasticity at multiple levels (answers/findings/citations) and attribute it to modules/steps.
- Open questions / failure modes:
- Tool-call quantity is not monotonic with success (too few fails; too many doesn’t guarantee).
- Similarity-based retrieval and lossy compression can fail on dense, causally structured traces.
- Early stochasticity cascades; inference/update modules can dominate variance.
Theme: Privacy & dual-use: new auditing attacks, DP with governance constraints, and human uplift
- Why it matters: Privacy risk is expanding to diffusion models and agentic deanonymization; DP deployments need governance rules (like minimum frequency). Dual-use risk depends on whether humans become more capable with LLMs.
- Representative papers:
- Common approach:
- Redefine threat models to match reality (image-only MIA; open-world authorship search; multi-query DP sessions).
- Use post-processing DP decompositions (private coarse structure + public detail completion).
- Measure human-in-the-loop capability change under extended interaction, not just model-only scores.
- Open questions / failure modes:
- Caption-free diffusion MIAs can be slow (minutes per image) and may be mitigated by some adaptation methods (e.g., LoRA in evaluated setting).
- DP systems trade expressiveness for safety (restricted SQL subsets; session-scoped accounting).
- Safeguards may not create meaningful friction for motivated users (self-reported in uplift study).
3) Technical synthesis
- Multiple papers converge on GRPO-style RL as a base, then add stability/credit-assignment fixes: CPAS+LAGR for length heterogeneity; CEEH for difficulty-gated entropy; Search-P1 for path-level dense rewards.
- A recurring pattern is “process over outcome”: path-centric rewards (Search-P1), step-level scoring and reuse (diffusion stitching), and causal boundary diagnostics (AgentSentry) all extract signal from intermediate structure.
- Tool boundaries are becoming the natural control point for both safety and evaluation: AgentSentry’s boundary-anchored counterfactuals, ESAA’s contract-validated intentions, and IoT MQTT topic enforcement gaps all sit at the tool/transport layer.
- Benchmarks increasingly enforce reproducibility via determinism (MobilityBench API replay; DRA cached search) to separate model variance from environment variance.
- Several works highlight measurement-modeling as a first-class component: IRT/MFRM for rater effects; stochasticity as total variance over canonicalized findings/citations; systems security as timing/egress metrics.
- Memory/context management is splitting into two directions: semantic eviction/compression (SideQuest’s model-driven KV eviction of tool outputs) and structured external memory (AMA-Agent causality graphs + tool-augmented retrieval).
- Safety alignment is diversifying beyond fine-tuning: training-free weight edits for multilingual safety (sparse low-rank edits) and policy-text swapping for moderation (CourtGuard).
- Privacy auditing is moving toward optimization-based, model-fitted attacks (MOFIT) and governance-aware DP interfaces (DPSQL+), suggesting defenders need both ML and systems mitigations.
- Across multimodal and agentic settings, a common failure is “information present but unusable”: modality collapse framed as mismatched decoding (GMI vs MI), and agent memory failures where construction/retrieval loses critical state.
4) Top 5 papers (with “why now”)
1) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
- Quantifies human uplift: LLM access yields ~4.16× higher novice accuracy (odds ratio; adjusted accuracy ~5% → >17%).
- Shows Treatment beats Control on 7/8 benchmarks, and can exceed expert internet-only baselines on some (e.g., HPCT, VCT).
- Adds behavioral signals (longer, more structured responses; higher confidence) and reports 89.6% of participants had no difficulty overcoming safeguards.
- Skepticism: study logistics changed mid-run (model availability), and some tasks had leakage (participants found questions online); not fully double-blind.
2) AgentSentry: Mitigating Indirect Prompt Injection…
- Inference-time, black-box-compatible defense using boundary-anchored counterfactual re-executions and causal effect estimates (ACE/IE/DE).
- Reports ASR = 0% with substantial utility across AgentDojo suites and multiple backbones; ablations show sanitized counterfactuals are critical.
- Emphasizes safe continuation via context purification + minimal action revision, not blanket refusal.
- Skepticism: lightweight configuration (e.g., K=1) may rely on benchmark injections being boundary-adjacent; tool/runtime compromise is out-of-scope.
3) CourtGuard: Zero-Shot Policy Adaptation in LLM Safety
- Retrieval-grounded “Evidentiary Debate” enables policy swapping without fine-tuning; reports strong macro-average benchmark performance.
- Demonstrates zero-shot adaptation to Wikipedia vandalism policy (90% on a balanced subset) and a legal grounding variant with expert review alignment.
- Provides interpretable, policy-cited traces and claims dataset label-noise auditing utility.
- Skepticism: adds inference latency; depends on backbone instruction/format adherence; bounded by policy corpus breadth.
4) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
- Supplies a missing benchmark primitive: 56 models with 14 hidden behaviors designed not to confess when asked.
- Evaluates an autonomous investigator agent across tool configurations and finds scaffolded black-box tools often outperform white-box tools.
- Surfaces a key warning: tool-to-agent gap—static evidence doesn’t guarantee agent success.
- Skepticism: targets are narrow fine-tunes on one base model (Llama 3.3 70B); implanted behaviors may differ from emergent real-world issues.
5) Systems-Level Attack Surface of Edge Agent Deployments on IoT
- Makes “agent security” concrete at the architecture layer: measures actuation-to-audit delay, provenance completeness, data egress, and failover windows.
- Finds MQTT broker accepted spoof/replay/direct safety-topic publishes by default; forced fallback can trigger silent cloud routing observable only via DNS/tcpdump.
- Quantifies failover: end-to-end WiFi loss to fallback path 35.7s, while MQTT reconnection alone is milliseconds—highlighting where the real window is.
- Skepticism: single testbed/topology; cloud egress comparison not workload-matched; mitigations not implemented/evaluated.
5) Practical next steps
- For tool-using agents, add boundary-level security instrumentation: log tool-return boundaries, cache tool outputs for replay, and measure takeover risk via controlled counterfactual re-executions (AgentSentry-style) on your own workflows.
- If deploying edge/hybrid agents, define and monitor systems safety SLOs: actuation-to-audit delay, failover blackout windows, provenance-chain completeness, and explicit alerts on any cloud fallback/egress.
- For moderation/governance, prototype policy-text RAG adjudication with explicit scoring rubrics (regulatory vs practical threat) and measure latency/format-failure rates across backbones.
- For RL training of agentic RAG, replace binary-only rewards with trajectory/path rewards (self-consistency + reference-alignment) and include partial credit for near-misses; track convergence speed and redundant tool actions.
- For reasoning efficiency, test mode-control tokens (/think vs /no_think) and stabilize RL with length-aware gradient weighting; separately, try difficulty-gated entropy to avoid entropy collapse on hard items.
- For evaluation, incorporate stochasticity audits: run agents k times per query, compute variance over findings/citations, and localize variance to modules (query vs summarize vs update) before tuning temperature.
- For human-labeled evals, consider rater-effect correction (MFRM/IRT) and rater diagnostics before making model selection decisions from raw means.
- For privacy, assume stronger auditors: evaluate diffusion models under caption-free MIA settings; for analytics, enforce both DP and governance constraints (minimum frequency) with integrated accounting; for text, assess stylometry/deanonymization risk and test guided rewrites.
Generated from per-paper analyses; no external browsing.
