Daily AI Paper Report (2026-02-27)

Published: February 27, 2026

Chinese version: [中文]

Run stats

Candidates: 262
Selected: 30
Deepread completed: 30
Window (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2602.22755`	AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors PDF	cs.CL	96	Benchmark of hidden misalignment behaviors + agentic auditing; strong for eval & oversight research	alignment auditing, benchmark, hidden behaviors, model evaluation, agent tools, deception
`2602.23329`	LLM Novice Uplift on Dual-Use, In Silico Biology Tasks PDF	cs.AI, cs.CL, cs.CR, cs.CY, cs.HC	96	Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLM access.	dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
`2602.22724`	AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification PDF	cs.CR, cs.AI	94	Directly targets indirect prompt injection in agents with trajectory-aware detection/mitigation	agent security, prompt injection, tool outputs, inference-time defense, causal diagnostics, context sanitization
`2602.22525`	Systems-Level Attack Surface of Edge Agent Deployments on IoT PDF	cs.CR	94	Empirical security analysis of edge LLM agents; defines measurable system security metrics + failures.	agent-security, edge-agents, IoT, attack-surface, systems-security, provenance, MQTT
`2602.22557`	CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety PDF	cs.AI, cs.LG	92	Zero-shot safety policy adaptation via RAG + adversarial debate grounded in policy docs	LLM safety, policy compliance, RAG, multi-agent debate, governance, zero-shot
`2602.22787`	Probing for Knowledge Attribution in Large Language Models PDF	cs.CL, cs.AI	92	Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation.	hallucinations, attribution, interpretability, faithfulness, factuality, probing
`2602.22953`	General Agent Evaluation PDF	cs.AI	92	Proposes unified protocol + framework for general-agent evaluation; addresses benchmark integration bias.	agent-evaluation, benchmarks, protocols, framework, general-agents, measurement
`2602.22603`	SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning PDF	cs.AI, cs.LG	92	LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.	agents, long-context, KV-cache, memory, efficiency, reasoning
`2602.22554`	Multilingual Safety Alignment Via Sparse Weight Editing PDF	cs.LG	90	Training-free multilingual safety alignment via sparse weight editing; addresses cross-lingual guardrail gaps	multilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
`2602.22576`	Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training PDF	cs.CL, cs.IR, cs.LG	89	Reward shaping for agentic RAG RL; extracts signal from failed trajectories to improve sample efficiency.	agentic-RAG, reinforcement-learning, reward-shaping, retrieval, training, reasoning
`2602.23271`	Evaluating Stochasticity in Deep Research Agents PDF	cs.AI	89	Formalizes and measures stochasticity/variance in research agents; identifies sources via MDP framing.	agents, evaluation, stochasticity, reliability, deep-research, variance
`2602.22556`	Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation PDF	cs.LG, cs.AI, cs.CL	89	Adaptive thinking RL to curb overthinking while preserving hard-query reasoning; practical reliability gain.	reasoning, RL, efficiency, overthinking, post-training, LRM
`2602.22775`	TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation PDF	cs.HC, cs.AI, cs.CL	88	Multi-agent adversarial simulation to surface long-horizon relational safety failures in therapy chatbots	conversational safety, mental health, multi-turn evaluation, adversarial simulation, relational harms, red teaming
`2602.22675`	Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization PDF	cs.CL	87	Agentic search framework emphasizing parallel evidence gathering to cut cost and improve generalization.	agents, search, efficiency, long-horizon, generalization, deep-research
`2602.22897`	OmniGAIA: Towards Native Omni-Modal AI Agents PDF	cs.AI, cs.CL, cs.CV, cs.LG, cs.MM	87	Omni-modal agent benchmark (audio+video+image+tools) with multi-hop queries; useful for capability eval.	multimodal, agents, benchmark, tool-use, evaluation, audio, video
`2602.22769`	AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications PDF	cs.AI, cs.LG	86	Agent memory benchmark focused on real agent-environment trajectories, not just dialogue	agent evaluation, long-horizon memory, benchmark, trajectories, agentic systems
`2602.22719`	Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks PDF	cs.LG	86	Mechanistic interpretability + test-time steering for Mamba SSMs with sizable benchmark gains.	interpretability, steering, state-space-models, Mamba, mechanistic, control
`2602.22968`	Certified Circuits: Stability Guarantees for Mechanistic Circuits PDF	cs.AI, cs.CV, cs.CY	85	Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliability	mechanistic interpretability, circuits, certification, robustness, theory, OOD stability
`2602.23200`	InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models PDF	cs.LG, cs.CL	85	Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.	LLM-efficiency, KV-cache, quantization, long-context, inference, systems
`2602.23193`	ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering PDF	cs.AI	84	Event-sourcing architecture for LLM agents: separates intent from state mutation for reliability/auditing.	agents, software-engineering, state, orchestration, auditability, reliability
`2602.23136`	Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs PDF	cs.CL, cs.AI, cs.LG	84	Information-theoretic account of modality collapse as mismatched decoding; actionable framing for MLLMs.	multimodal-LLMs, information-theory, decoding, modality-collapse, analysis
`2602.22871`	Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching PDF	cs.CL, cs.AI	84	Step-level PRM scoring and stitching for diffusion LMs; improves test-time scaling beyond voting.	test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
`2602.22584`	Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA PDF	cs.CL	82	Industrial RAG reliability: jointly optimizes retrieval+generation with evidence-constrained RL	RAG, hallucination reduction, faithfulness, reinforcement learning, retrieval optimization, enterprise QA
`2602.22585`	Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach PDF	cs.AI, cs.LG	82	Uses IRT/Rasch to correct rater effects in human evals; improves validity of AI comparisons and RLHF data.	evaluation, human-ratings, psychometrics, RLHF, measurement, bias
`2602.22642`	Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning PDF	cs.LG	82	Difficulty-aware entropy regularization to compress CoT while preserving exploration on hard problems.	reasoning, CoT, efficiency, entropy-regularization, RL, inference-cost
`2602.23262`	Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling PDF	cs.CV, cs.CR	81	DP image generation via wavelet coarse-to-fine; targets privacy-sensitive frequencies to reduce quality loss.	privacy, differential-privacy, image-generation, wavelets, memorization
`2602.22758`	Decomposing Physician Disagreement in HealthBench PDF	cs.AI, stat.AP	81	Finds most HealthBench label variance is irreducible case-level residual; important for eval design.	evaluation, medical-AI, rater-disagreement, uncertainty, benchmarks, reliability
`2602.23258`	AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning PDF	cs.AI, cs.CL	80	Test-time rectify-or-reject pruning to prevent error cascades in multi-agent systems	multi-agent systems, test-time control, error correction, RAG, robustness, information flow
`2602.23079`	Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent PDF	cs.CL, cs.CR, cs.LG	80	Stylometry+LLM agent to assess/mitigate deanonymization risk; relevant to privacy leakage from text.	privacy, deanonymization, stylometry, LLM-agents, security, text
`2602.22546`	Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention PDF	cs.AI	79	Learned policy to query human experts as a tool; large gains in Minecraft on hard tasks (human-in-loop).	human-in-the-loop, agents, tool-use, collaboration, planning, Minecraft

AI Paper Insight Brief

2026-02-27

0) Executive takeaways (read this first)

Agent safety is shifting “down the stack”: multiple papers show that deployment architecture (edge IoT swarms, tool-return boundaries, KV/memory management) can dominate risk/robustness outcomes, often bypassing prompt/model-level defenses.
Inference-time, training-free interventions are maturing across safety and efficiency: causal counterfactual defenses for indirect prompt injection (ASR reported 0%), policy-grounded debate for zero-shot policy swaps, and sparse weight edits for multilingual safety transfer.
GRPO is becoming a default backbone for both capability and safety/faithfulness tuning (adaptive thinking, agentic RAG reward shaping, industrial RAG faithfulness, human-collaboration modules), with new work focusing on stabilizing gradients and rewards under length/path heterogeneity.
Long-horizon agents are hitting systems bottlenecks (KV cache growth, memory retrieval failures, stochasticity across runs). New benchmarks and mechanisms (AMA-Bench, stochasticity variance metrics, SideQuest) make these failure modes measurable and optimizable.
Evaluation methodology is under active repair: rater-effect modeling (MFRM/IRT) and physician-disagreement decomposition show that raw human labels can reorder system rankings and that much disagreement is case-specific—implying “better judges” may require better task design, not just better models.
Biosecurity uplift evidence is now human-subject, multi-model, long-horizon: novices with LLM access were reported 4.16× more accurate than internet-only novices, and most reported little difficulty overcoming safeguards—raising the priority of realistic uplift evaluations.

2) Key themes (clusters)

Theme: Inference-time safety layers for agents (policy, prompt injection, edge systems)

Why it matters: As agents act through tools and physical systems, the critical failures often occur at boundaries (tool returns, messaging buses, fallback paths) where classic prompt defenses don’t apply or aren’t observable.
Representative papers:
Common approach:
- Move defenses to decision boundaries (tool-return checkpoints; policy-grounded adjudication; MQTT control plane).
- Use structured protocols (retrieval-grounded debate verdicts; causal counterfactual regimes; provenance metadata envelopes).
- Emphasize measurable operational metrics (ASR/UA/FPR; latency; actuation-to-audit delay; egress/sovereignty; failover windows).
Open questions / failure modes:
- Overhead/latency: counterfactual re-executions and multi-agent debate increase inference cost.
- Backbone brittleness: formatting adherence issues can break parsing (CourtGuard); edge heterogeneity complicates enforcement (IoT).
- Trust boundary gaps: MQTT brokers accepting spoof/replay/direct publishes; silent fallback crossing sovereignty boundaries.

Theme: RL (often GRPO) for agentic RAG, faithfulness, and collaboration

Why it matters: Agentic systems need dense learning signals beyond final-answer correctness; industrial deployments also need faithfulness constraints (e.g., URL hallucination) that are operationally testable.
Representative papers:
Common approach:
- Replace sparse outcome rewards with process/path rewards (dual-track step coverage; soft outcome scoring).
- Use GRPO-style RL plus structured formats (planner-first trajectories; tagged interaction protocols).
- Add domain-specific faithfulness constraints (evidence faithfulness; URL validity checks with penalties).
Open questions / failure modes:
- Dependence on LLM evaluators/judges for scoring (reward hacking / evaluator sensitivity).
- Offline artifacts: reference planners and indicator-like resources add pipeline complexity.
- Generalization: training often anchored in specific domains (Minecraft, advertising QA, QA benchmarks).

Theme: Reasoning efficiency without accuracy collapse (adaptive thinking, entropy/length control)

Why it matters: Long CoT is expensive; naive length penalties can collapse exploration or destabilize RL due to extreme length heterogeneity.
Representative papers:
Common approach:
- Instance-adaptive control (think/no-think token; hard/easy entropy scaling; per-question shortest-correct length baselines).
- Stabilize RL with advantage shaping + gradient reweighting under length heterogeneity.
- Shift test-time scaling from “one long trace” to step-level reuse (PRM-scored stitching + AR recomputation).
Open questions / failure modes:
- Scaling beyond small/medium models not established in some work (adaptive thinking evaluated on 1.5B/7B).
- Reliance on PRMs and diversity of sampled traces; shared mistakes limit stitching recovery.
- Difficulty estimation proxies (historical accuracy EMA) may be brittle across domains.

Theme: Long-horizon agent infrastructure: memory, KV cache, stochasticity, and evaluation

Why it matters: As agents run longer, failures become systems failures: memory compression loses causal state, KV cache becomes a serving bottleneck, and stochasticity undermines reliability even at temperature 0 in API settings.
Representative papers:
Common approach:
- Make hidden bottlenecks measurable (peak token utilization, KV reads, total variance over findings/citations, needle protocols).
- Use model-driven or hardware-aligned mechanisms (aux-thread semantic eviction; inner-dimension KV grouping).
- Add structured mitigations (structured outputs; query-intersection ensembling; tool-augmented retrieval over causality graphs).
Open questions / failure modes:
- OOD degradation (SideQuest up to 5% accuracy drop on BrowseComp).
- Memory construction/retrieval losses dominate end-to-end performance (AMA-Bench needle ablations).
- Microbenchmarks vs end-to-end latency (InnerQ reports matmul speedups; broader serving impact not fully shown).

Theme: Evaluation reliability, auditing, and hidden behaviors

Why it matters: Safety and capability claims depend on measurement; rater effects and disagreement ceilings can invert rankings, while auditing tools must be tested against models that actively resist disclosure.
Representative papers:
Common approach:
- Treat evaluation as a measurement problem (MFRM severity/thresholds; variance decomposition; agentic auditing success).
- Stress-test with adversarial targets (implanted hidden behaviors + anti-confession training).
- Report diagnostics, not just scores (rater centrality; tool-to-agent gap; residual disagreement dominance).
Open questions / failure modes:
- Tool-to-agent gap: evidence surfaced by tools may not translate to investigator success.
- Identification/estimability constraints in rater models (policy facet not estimable in MFRM attempt).
- Large residual disagreement suggests limits to “judge model” improvements without better rubrics/context.

Theme: Privacy & dual-use risk in the agent era

Why it matters: Agents plus tools/memory can amplify privacy harms (deanonymization) and dual-use capability uplift; defenses must be evaluated under realistic, long-horizon human use.
Representative papers:
Common approach:
- End-to-end pipelines with search + aggregation + reflection (stylometry agent; uplift study with multi-model access).
- Formal privacy via DP + post-processing (DP only on coarse wavelet tokens; public prior for details).
- Measure not just accuracy but operational risk signals (candidate coverage; mitigation via guided recomposition; participant-reported safeguard friction).
Open questions / failure modes:
- Open-world deanonymization remains low even with DB augmentation (top-3 still modest), but targeted settings improve sharply.
- DP quality gaps persist at strict ε (e.g., ε=1 artifacts; sensitivity to public prior strength).
- Translating in-silico uplift to wet-lab risk remains unresolved.

3) Technical synthesis

GRPO shows up as a unifying optimization primitive across: adaptive thinking (CPAS/LAGR), agentic RAG (Search-P1), industrial faithfulness RL (Advertising QA), and human-collaboration tool-use (AHCE HFM).
A recurring stabilization pattern: when trajectories vary wildly in length/structure, methods add explicit normalization/weighting (LAGR length weights; CPAS advantage offsets; path-centric rewards; difficulty-aware entropy).
“Boundary-centric” agent safety is converging: AgentSentry defends at tool-return boundaries; IoT edge paper highlights MQTT as the command boundary; CourtGuard grounds judgments in retrieved policy text rather than parametric “intuition.”
Retrieval is being treated as a policy-learning problem, not a fixed module: Search-P1 shapes rewards around plan execution and reference step coverage; industrial GraphRAG co-adapts retrieval and generation with RL.
Long-horizon reliability is being operationalized with new metrics: stochasticity via normalized total variance over answers/findings/citations; memory via recall/causal/state-update/abstraction categories; systems security via actuation-to-audit delay and failover blackout windows.
Model-driven systems optimization is expanding beyond “better prompts”: SideQuest uses the model to garbage-collect KV cache; InnerQ aligns quantization grouping with decode-time vector–matrix access patterns.
Evaluation is moving toward “measurement models”: IRT/MFRM adjusts for rater severity/centrality; HealthBench disagreement decomposition shows residual dominates; AuditBench measures end-to-end investigator success rather than tool signal alone.
Safety transfer is increasingly parameter- or inference-time: sparse weight editing for multilingual safety; CourtGuard policy swapping; AgentSentry inference-only counterfactual purification—reducing dependence on large new datasets.
Benchmarks are becoming more agent-realistic: AMA-Bench uses action–observation logs with symbolic artifacts; OmniGAIA requires omni-modal tool use; General Agent Evaluation focuses on protocol-preserving cross-environment comparisons.

4) Top 5 papers (with “why now”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Introduces boundary-anchored counterfactual re-executions (orig/mask/mask_sanitized/orig_sanitized) to estimate causal takeover (ACE/IE/DE).
Reports ASR = 0% across three IPI families and multiple black-box models on AgentDojo, with reported FPR = 0% in tables.
“Why now”: tool-augmented agents are shipping; this is a concrete inference-time layer that aims to continue safely rather than terminate.
Skepticism: overhead scales with re-executions per boundary; evaluation notes benchmarks may under-represent long-horizon delayed takeovers.

2) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Provides 56 target models / 14 hidden behaviors with anti-confession training, enabling systematic auditing evaluation.
Finds scaffolded black-box tools outperform many white-box tools overall; documents a tool-to-agent gap.
“Why now”: auditing is becoming a deployment gate; this gives repeatable targets and end-to-end agent evaluation.
Skepticism: targets are fine-tunes on one base model (Llama 3.3 70B); may be easier to audit than organically emergent behaviors.

3) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Human-subject evidence: LLM access yields 4.16× higher novice accuracy (odds ratio) vs internet-only; Treatment improves on 7/8 benchmarks.
Treatment sometimes exceeds expert baselines (e.g., HPCT, VCT) and participants often report little safeguard friction (89.6%).
“Why now”: policy discussions need uplift data under realistic multi-model, long-duration use—not just model-only benchmarks.
Skepticism: confined to in-silico tasks; model availability changed mid-study; not double-blind.

4) SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Uses a parallel auxiliary thread to decide which tool outputs are stale and delete their KV entries without polluting main context.
Reports large efficiency gains (peak token utilization −56–65%, KV reads −53–71%) and serving throughput +83.9% on H100 for FRAMES.
“Why now”: deep-research/web agents are KV-bound; this is a practical serving-side lever.
Skepticism: eviction limited to tool outputs (not “thoughts”); some OOD accuracy degradation (BrowseComp).

5) Multilingual Safety Alignment Via Sparse Weight Editing

Training-free sparse neuron editing with a closed-form low-rank update to transfer English safety behavior to other languages.
Introduces MULTI-STRONGREJECT (8 languages, 313 harmful prompts each) and shows unsafe-count reductions across models; composes with MPO.
“Why now”: multilingual jailbreak gaps are a real deployment vulnerability; weight editing is fast to iterate and deploy.
Skepticism: evaluation relies on an automated guard model; datasets are machine-translated (may miss natural LRL jailbreaks).

5) Practical next steps

Add boundary instrumentation to agents: log tool-return boundaries with provenance metadata and run periodic “shadow” counterfactual checks (AgentSentry-style) on high-risk tools/actions.
Treat messaging middleware as part of the safety perimeter in edge/IoT: enforce MQTT authentication/ACLs and replay protection; measure actuation-to-audit delay and failover blackout windows as first-class safety SLOs.
If doing agentic RAG RL, try path-centric rewards (self-consistency + reference step coverage) and soft outcome scoring; explicitly test evaluator sensitivity by swapping judge models.
Reduce long-horizon cost without breaking correctness: implement adaptive thinking control tokens and stabilize RL with length-aware gradient regulation; separately test difficulty-aware entropy regularization to prevent entropy collapse.
Make reliability measurable for research agents: compute run-to-run variance over answers/findings/citations; then apply structured outputs + early query intersection ensembling to reduce stochasticity while tracking accuracy.
For multilingual deployments, run a multilingual harmful-prompt sweep and consider sparse weight edits as a fast patch—while validating with multiple harm classifiers (not just one guard).
Upgrade human evaluation pipelines: model rater severity/centrality (MFRM) and track disagreement decomposition; prioritize collecting “reducible uncertainty” tags or missing-context annotations where disagreement is high.
For auditing programs, evaluate tools end-to-end with an investigator agent (AuditBench-style), not just tool signal; explicitly test hardest target configurations (e.g., TD+KTO) to avoid overfitting to easy-to-audit organisms.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-02-27

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Inference-time safety layers for agents (policy, prompt injection, edge systems)

Theme: RL (often GRPO) for agentic RAG, faithfulness, and collaboration

Theme: Reasoning efficiency without accuracy collapse (adaptive thinking, entropy/length control)

Theme: Long-horizon agent infrastructure: memory, KV cache, stochasticity, and evaluation

Theme: Evaluation reliability, auditing, and hidden behaviors

Theme: Privacy & dual-use risk in the agent era

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps