Daily AI Paper Report (2026-02-27)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 262
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-02-26T01:00:00Z → 2026-02-27T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2602.22755AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
PDF
cs.CL96Benchmark of hidden misalignment behaviors + agentic auditing; strong for eval & oversight researchalignment auditing, benchmark, hidden behaviors, model evaluation, agent tools, deception
2602.23329LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
PDF
cs.AI, cs.CL, cs.CR, cs.CY, cs.HC96Careful human uplift study on bio dual-use tasks; quantifies novice capability jump with LLM access.dual-use, biosecurity, human-uplift, evaluation, risk-assessment, LLMs
2602.22724AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
PDF
cs.CR, cs.AI94Directly targets indirect prompt injection in agents with trajectory-aware detection/mitigationagent security, prompt injection, tool outputs, inference-time defense, causal diagnostics, context sanitization
2602.22525Systems-Level Attack Surface of Edge Agent Deployments on IoT
PDF
cs.CR94Empirical security analysis of edge LLM agents; defines measurable system security metrics + failures.agent-security, edge-agents, IoT, attack-surface, systems-security, provenance, MQTT
2602.22557CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
PDF
cs.AI, cs.LG92Zero-shot safety policy adaptation via RAG + adversarial debate grounded in policy docsLLM safety, policy compliance, RAG, multi-agent debate, governance, zero-shot
2602.22787Probing for Knowledge Attribution in Large Language Models
PDF
cs.CL, cs.AI92Probe predicts whether outputs rely on prompt vs internal knowledge; useful for hallucination mitigation.hallucinations, attribution, interpretability, faithfulness, factuality, probing
2602.22953General Agent Evaluation
PDF
cs.AI92Proposes unified protocol + framework for general-agent evaluation; addresses benchmark integration bias.agent-evaluation, benchmarks, protocols, framework, general-agents, measurement
2602.22603SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
PDF
cs.AI, cs.LG92LRM-driven KV cache compression for long-horizon agents; targets real bottleneck in agentic reasoning.agents, long-context, KV-cache, memory, efficiency, reasoning
2602.22554Multilingual Safety Alignment Via Sparse Weight Editing
PDF
cs.LG90Training-free multilingual safety alignment via sparse weight editing; addresses cross-lingual guardrail gapsmultilingual safety, weight editing, safety neurons, alignment, low-resource languages, robustness
2602.22576Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
PDF
cs.CL, cs.IR, cs.LG89Reward shaping for agentic RAG RL; extracts signal from failed trajectories to improve sample efficiency.agentic-RAG, reinforcement-learning, reward-shaping, retrieval, training, reasoning
2602.23271Evaluating Stochasticity in Deep Research Agents
PDF
cs.AI89Formalizes and measures stochasticity/variance in research agents; identifies sources via MDP framing.agents, evaluation, stochasticity, reliability, deep-research, variance
2602.22556Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
PDF
cs.LG, cs.AI, cs.CL89Adaptive thinking RL to curb overthinking while preserving hard-query reasoning; practical reliability gain.reasoning, RL, efficiency, overthinking, post-training, LRM
2602.22775TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
PDF
cs.HC, cs.AI, cs.CL88Multi-agent adversarial simulation to surface long-horizon relational safety failures in therapy chatbotsconversational safety, mental health, multi-turn evaluation, adversarial simulation, relational harms, red teaming
2602.22675Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
PDF
cs.CL87Agentic search framework emphasizing parallel evidence gathering to cut cost and improve generalization.agents, search, efficiency, long-horizon, generalization, deep-research
2602.22897OmniGAIA: Towards Native Omni-Modal AI Agents
PDF
cs.AI, cs.CL, cs.CV, cs.LG, cs.MM87Omni-modal agent benchmark (audio+video+image+tools) with multi-hop queries; useful for capability eval.multimodal, agents, benchmark, tool-use, evaluation, audio, video
2602.22769AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
PDF
cs.AI, cs.LG86Agent memory benchmark focused on real agent-environment trajectories, not just dialogueagent evaluation, long-horizon memory, benchmark, trajectories, agentic systems
2602.22719Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
PDF
cs.LG86Mechanistic interpretability + test-time steering for Mamba SSMs with sizable benchmark gains.interpretability, steering, state-space-models, Mamba, mechanistic, control
2602.22968Certified Circuits: Stability Guarantees for Mechanistic Circuits
PDF
cs.AI, cs.CV, cs.CY85Provable stability guarantees for mechanistic circuit discovery; improves interpretability reliabilitymechanistic interpretability, circuits, certification, robustness, theory, OOD stability
2602.23200InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
PDF
cs.LG, cs.CL85Hardware-aware KV-cache quantization reducing latency/memory for long-context decoding without accuracy loss.LLM-efficiency, KV-cache, quantization, long-context, inference, systems
2602.23193ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering
PDF
cs.AI84Event-sourcing architecture for LLM agents: separates intent from state mutation for reliability/auditing.agents, software-engineering, state, orchestration, auditability, reliability
2602.23136Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
PDF
cs.CL, cs.AI, cs.LG84Information-theoretic account of modality collapse as mismatched decoding; actionable framing for MLLMs.multimodal-LLMs, information-theory, decoding, modality-collapse, analysis
2602.22871Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
PDF
cs.CL, cs.AI84Step-level PRM scoring and stitching for diffusion LMs; improves test-time scaling beyond voting.test-time-scaling, diffusion-LM, process-reward-model, reasoning, self-consistency
2602.22584Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
PDF
cs.CL82Industrial RAG reliability: jointly optimizes retrieval+generation with evidence-constrained RLRAG, hallucination reduction, faithfulness, reinforcement learning, retrieval optimization, enterprise QA
2602.22585Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
PDF
cs.AI, cs.LG82Uses IRT/Rasch to correct rater effects in human evals; improves validity of AI comparisons and RLHF data.evaluation, human-ratings, psychometrics, RLHF, measurement, bias
2602.22642Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
PDF
cs.LG82Difficulty-aware entropy regularization to compress CoT while preserving exploration on hard problems.reasoning, CoT, efficiency, entropy-regularization, RL, inference-cost
2602.23262Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
PDF
cs.CV, cs.CR81DP image generation via wavelet coarse-to-fine; targets privacy-sensitive frequencies to reduce quality loss.privacy, differential-privacy, image-generation, wavelets, memorization
2602.22758Decomposing Physician Disagreement in HealthBench
PDF
cs.AI, stat.AP81Finds most HealthBench label variance is irreducible case-level residual; important for eval design.evaluation, medical-AI, rater-disagreement, uncertainty, benchmarks, reliability
2602.23258AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
PDF
cs.AI, cs.CL80Test-time rectify-or-reject pruning to prevent error cascades in multi-agent systemsmulti-agent systems, test-time control, error correction, RAG, robustness, information flow
2602.23079Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
PDF
cs.CL, cs.CR, cs.LG80Stylometry+LLM agent to assess/mitigate deanonymization risk; relevant to privacy leakage from text.privacy, deanonymization, stylometry, LLM-agents, security, text
2602.22546Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention
PDF
cs.AI79Learned policy to query human experts as a tool; large gains in Minecraft on hard tasks (human-in-loop).human-in-the-loop, agents, tool-use, collaboration, planning, Minecraft

AI Paper Insight Brief

2026-02-27

0) Executive takeaways (read this first)

  • Agent safety is shifting “down the stack”: multiple papers show that deployment architecture (edge IoT swarms, tool-return boundaries, KV/memory management) can dominate risk/robustness outcomes, often bypassing prompt/model-level defenses.
  • Inference-time, training-free interventions are maturing across safety and efficiency: causal counterfactual defenses for indirect prompt injection (ASR reported 0%), policy-grounded debate for zero-shot policy swaps, and sparse weight edits for multilingual safety transfer.
  • GRPO is becoming a default backbone for both capability and safety/faithfulness tuning (adaptive thinking, agentic RAG reward shaping, industrial RAG faithfulness, human-collaboration modules), with new work focusing on stabilizing gradients and rewards under length/path heterogeneity.
  • Long-horizon agents are hitting systems bottlenecks (KV cache growth, memory retrieval failures, stochasticity across runs). New benchmarks and mechanisms (AMA-Bench, stochasticity variance metrics, SideQuest) make these failure modes measurable and optimizable.
  • Evaluation methodology is under active repair: rater-effect modeling (MFRM/IRT) and physician-disagreement decomposition show that raw human labels can reorder system rankings and that much disagreement is case-specific—implying “better judges” may require better task design, not just better models.
  • Biosecurity uplift evidence is now human-subject, multi-model, long-horizon: novices with LLM access were reported 4.16× more accurate than internet-only novices, and most reported little difficulty overcoming safeguards—raising the priority of realistic uplift evaluations.

2) Key themes (clusters)

Theme: Inference-time safety layers for agents (policy, prompt injection, edge systems)

Theme: RL (often GRPO) for agentic RAG, faithfulness, and collaboration

Theme: Reasoning efficiency without accuracy collapse (adaptive thinking, entropy/length control)

Theme: Long-horizon agent infrastructure: memory, KV cache, stochasticity, and evaluation

Theme: Evaluation reliability, auditing, and hidden behaviors

Theme: Privacy & dual-use risk in the agent era

  • Why it matters: Agents plus tools/memory can amplify privacy harms (deanonymization) and dual-use capability uplift; defenses must be evaluated under realistic, long-horizon human use.
  • Representative papers:
  • Common approach:
    • End-to-end pipelines with search + aggregation + reflection (stylometry agent; uplift study with multi-model access).
    • Formal privacy via DP + post-processing (DP only on coarse wavelet tokens; public prior for details).
    • Measure not just accuracy but operational risk signals (candidate coverage; mitigation via guided recomposition; participant-reported safeguard friction).
  • Open questions / failure modes:
    • Open-world deanonymization remains low even with DB augmentation (top-3 still modest), but targeted settings improve sharply.
    • DP quality gaps persist at strict ε (e.g., ε=1 artifacts; sensitivity to public prior strength).
    • Translating in-silico uplift to wet-lab risk remains unresolved.

3) Technical synthesis

  • GRPO shows up as a unifying optimization primitive across: adaptive thinking (CPAS/LAGR), agentic RAG (Search-P1), industrial faithfulness RL (Advertising QA), and human-collaboration tool-use (AHCE HFM).
  • A recurring stabilization pattern: when trajectories vary wildly in length/structure, methods add explicit normalization/weighting (LAGR length weights; CPAS advantage offsets; path-centric rewards; difficulty-aware entropy).
  • “Boundary-centric” agent safety is converging: AgentSentry defends at tool-return boundaries; IoT edge paper highlights MQTT as the command boundary; CourtGuard grounds judgments in retrieved policy text rather than parametric “intuition.”
  • Retrieval is being treated as a policy-learning problem, not a fixed module: Search-P1 shapes rewards around plan execution and reference step coverage; industrial GraphRAG co-adapts retrieval and generation with RL.
  • Long-horizon reliability is being operationalized with new metrics: stochasticity via normalized total variance over answers/findings/citations; memory via recall/causal/state-update/abstraction categories; systems security via actuation-to-audit delay and failover blackout windows.
  • Model-driven systems optimization is expanding beyond “better prompts”: SideQuest uses the model to garbage-collect KV cache; InnerQ aligns quantization grouping with decode-time vector–matrix access patterns.
  • Evaluation is moving toward “measurement models”: IRT/MFRM adjusts for rater severity/centrality; HealthBench disagreement decomposition shows residual dominates; AuditBench measures end-to-end investigator success rather than tool signal alone.
  • Safety transfer is increasingly parameter- or inference-time: sparse weight editing for multilingual safety; CourtGuard policy swapping; AgentSentry inference-only counterfactual purification—reducing dependence on large new datasets.
  • Benchmarks are becoming more agent-realistic: AMA-Bench uses action–observation logs with symbolic artifacts; OmniGAIA requires omni-modal tool use; General Agent Evaluation focuses on protocol-preserving cross-environment comparisons.

4) Top 5 papers (with “why now”)

1) AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

  • Introduces boundary-anchored counterfactual re-executions (orig/mask/mask_sanitized/orig_sanitized) to estimate causal takeover (ACE/IE/DE).
  • Reports ASR = 0% across three IPI families and multiple black-box models on AgentDojo, with reported FPR = 0% in tables.
  • “Why now”: tool-augmented agents are shipping; this is a concrete inference-time layer that aims to continue safely rather than terminate.
  • Skepticism: overhead scales with re-executions per boundary; evaluation notes benchmarks may under-represent long-horizon delayed takeovers.

2) AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

  • Provides 56 target models / 14 hidden behaviors with anti-confession training, enabling systematic auditing evaluation.
  • Finds scaffolded black-box tools outperform many white-box tools overall; documents a tool-to-agent gap.
  • “Why now”: auditing is becoming a deployment gate; this gives repeatable targets and end-to-end agent evaluation.
  • Skepticism: targets are fine-tunes on one base model (Llama 3.3 70B); may be easier to audit than organically emergent behaviors.

3) LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

  • Human-subject evidence: LLM access yields 4.16× higher novice accuracy (odds ratio) vs internet-only; Treatment improves on 7/8 benchmarks.
  • Treatment sometimes exceeds expert baselines (e.g., HPCT, VCT) and participants often report little safeguard friction (89.6%).
  • “Why now”: policy discussions need uplift data under realistic multi-model, long-duration use—not just model-only benchmarks.
  • Skepticism: confined to in-silico tasks; model availability changed mid-study; not double-blind.

4) SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

  • Uses a parallel auxiliary thread to decide which tool outputs are stale and delete their KV entries without polluting main context.
  • Reports large efficiency gains (peak token utilization −56–65%, KV reads −53–71%) and serving throughput +83.9% on H100 for FRAMES.
  • “Why now”: deep-research/web agents are KV-bound; this is a practical serving-side lever.
  • Skepticism: eviction limited to tool outputs (not “thoughts”); some OOD accuracy degradation (BrowseComp).

5) Multilingual Safety Alignment Via Sparse Weight Editing

  • Training-free sparse neuron editing with a closed-form low-rank update to transfer English safety behavior to other languages.
  • Introduces MULTI-STRONGREJECT (8 languages, 313 harmful prompts each) and shows unsafe-count reductions across models; composes with MPO.
  • “Why now”: multilingual jailbreak gaps are a real deployment vulnerability; weight editing is fast to iterate and deploy.
  • Skepticism: evaluation relies on an automated guard model; datasets are machine-translated (may miss natural LRL jailbreaks).

5) Practical next steps

  • Add boundary instrumentation to agents: log tool-return boundaries with provenance metadata and run periodic “shadow” counterfactual checks (AgentSentry-style) on high-risk tools/actions.
  • Treat messaging middleware as part of the safety perimeter in edge/IoT: enforce MQTT authentication/ACLs and replay protection; measure actuation-to-audit delay and failover blackout windows as first-class safety SLOs.
  • If doing agentic RAG RL, try path-centric rewards (self-consistency + reference step coverage) and soft outcome scoring; explicitly test evaluator sensitivity by swapping judge models.
  • Reduce long-horizon cost without breaking correctness: implement adaptive thinking control tokens and stabilize RL with length-aware gradient regulation; separately test difficulty-aware entropy regularization to prevent entropy collapse.
  • Make reliability measurable for research agents: compute run-to-run variance over answers/findings/citations; then apply structured outputs + early query intersection ensembling to reduce stochasticity while tracking accuracy.
  • For multilingual deployments, run a multilingual harmful-prompt sweep and consider sparse weight edits as a fast patch—while validating with multiple harm classifiers (not just one guard).
  • Upgrade human evaluation pipelines: model rater severity/centrality (MFRM) and track disagreement decomposition; prioritize collecting “reducible uncertainty” tags or missing-context annotations where disagreement is high.
  • For auditing programs, evaluate tools end-to-end with an investigator agent (AuditBench-style), not just tool signal; explicitly test hardest target configurations (e.g., TD+KTO) to avoid overfitting to easy-to-audit organisms.

Generated from per-paper analyses; no external browsing.