Daily AI Paper Report (2026-03-13)
Published:
Chinese version: [中文]
Run stats
- Candidates: 249
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-12T00:00:00Z → 2026-03-13T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.11853 | OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents | cs.CR | 94 | Practical defense-in-depth runtime layer for tool agents; lifecycle hooks + risk accumulation. | agent-security, tool-use, prompt-injection, runtime-guardrails, policy-enforcement, monitoring |
2603.11862 | You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents | cs.CR, cs.AI | 93 | Measures doc-instruction induced leakage in high-privilege agents; frames structural vulnerability. | agents, prompt-injection, data-exfiltration, evaluation, trusted-executor-dilemma, privacy |
2603.11875 | The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection | cs.CR, cs.AI | 93 | Fast, auditable prompt-injection detector via strict data geometry; strong practical security angle | prompt-injection, LLM-security, dataset-curation, robust-detection, SVM, auditing |
2603.11481 | INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs | cs.CV, cs.AI | 92 | Diagnostic benchmark for Video-LLM hallucinations incl. induced corruptions; strong reliability eval | video-llm, hallucination, factuality, faithfulness, benchmark, robustness, evaluation |
2603.12011 | Can RL Improve Generalization of LLM Agents? An Empirical Study | cs.AI | 92 | Systematic study of RL fine-tuning generalization across env shifts, transfer, and forgetting for agents | LLM agents, reinforcement fine-tuning, generalization, distribution shift, transfer, forgetting, evaluation |
2603.11619 | Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats | cs.CR, cs.AI | 90 | Comprehensive OpenClaw threat analysis + mitigations across lifecycle (memory/tool/supply chain). | agent-security, threat-modeling, prompt-injection, memory-poisoning, supply-chain, mitigations |
2603.11914 | Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks | cs.CR, cs.AI | 90 | Evaluates content-level refusal when harmful text appears inside benign tasks; new harmful-content dataset | safety, harmful-content, refusal, policy-compliance, evaluation, dataset |
2603.12109 | On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents | cs.AI | 90 | Identifies RL failure mode in active reasoning (info self-locking) via action selection vs belief tracking | LLM agents, active reasoning, RL, failure modes, belief tracking, question asking, agent reliability |
2603.11501 | KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation | cs.LG, cs.AI, cs.CR | 89 | Poisoning attack tailored to GraphRAG; exposes new RAG attack surface beyond vanilla RAG. | RAG, GraphRAG, data-poisoning, security, knowledge-graphs, robustness |
2603.11394 | Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning | cs.CL, cs.AI, cs.LG | 89 | Shows multi-turn chat can degrade clinical reasoning; introduces stick-or-switch for conviction/flex | healthcare, multi-turn, robustness, evaluation, reliability, safety, diagnostic-reasoning |
2603.11987 | LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories | cs.AI | 88 | New multimodal lab-safety benchmark for hazard ID and safety-critical planning in labs. | benchmark, multimodal, agent-safety, planning, hazard-detection, evaluation |
2603.12152 | LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation | cs.CL | 88 | Long-horizon user simulator + benchmark for personalized assistants; closer to real deployment dynamics | agents, evaluation, personalization, user-simulation, long-horizon, benchmark |
2603.11495 | Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs | cs.CL | 88 | Divide-and-conquer Try/Check/Retry boosts long-context tool selection among many noisy tools | tool calling, agents, long context, self-reflection, robustness, planning, inference |
2603.12237 | STAMP: Selective Task-Aware Mechanism for Text Privacy | cs.LG, cs.CR, cs.IT | 87 | Token-level task-aware text privatization with selective budgets + new embedding perturbation method | privacy, differential-privacy, text, token-level, security, utility-privacy-tradeoff |
2603.11975 | HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios | cs.CV, cs.AI, cs.CR | 86 | Benchmark for unsafe action detection in household embodied settings; dynamic, video-based eval. | benchmark, VLM, embodied-agents, safety-eval, unsafe-actions, household-robots |
2603.12149 | Linking Perception, Confidence and Accuracy in MLLMs | cs.CV, cs.CL | 86 | Targets MLLM confidence miscalibration with RL + test-time scaling; reliability-critical for deployment | calibration, multimodal, uncertainty, RL, test-time-scaling, reliability |
2603.11513 | Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale | cs.CL | 86 | Measures whether small LMs actually use retrieved evidence; separates retrieval quality vs utilization failure | RAG, retrieval utilization, small language models, factuality, evaluation, oracle retrieval, scaling |
2603.12246 | Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training | cs.AI, cs.CL, cs.LG | 85 | Tests reasoning LLM-judges in RL alignment for non-verifiable tasks; focuses on training impact. | alignment, RLHF, LLM-as-judge, preference-learning, evaluation, post-training |
2603.11388 | Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment | cs.AI | 84 | Analyzes overrefusal via refusal triggers; proposes mitigation to improve safety/usability tradeoff. | safety-alignment, overrefusal, refusal, post-training, reliability |
2603.12133 | TopoBench: Benchmarking LLMs on Hard Topological Reasoning | cs.AI, cs.CL | 84 | Hard topological reasoning benchmark + error taxonomy from CoT traces; useful diagnostic eval | reasoning, benchmark, spatial-reasoning, topology, error-analysis, LLM-eval |
2603.11445 | Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution | cs.AI, cs.MA | 84 | Plan-execute-verify-replan multi-agent orchestration with verifier-driven replanning and stop rules | agents, orchestration, verification, planning, multi-agent, reliability, evaluation |
2603.11749 | Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information | cs.CL, cs.AI | 84 | Compression-consistency principle explains when LMs prefer correct info despite mixed-quality training data | theory, truthfulness, hallucinations, generalization, compression, synthetic data, transformers |
2603.12180 | Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections | cs.CL, cs.AI | 83 | MADQA benchmark probes agent search vs strategy over PDFs; adds accuracy-effort evaluation. | benchmark, agents, document-QA, evaluation, tool-use, search |
2603.11442 | GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics | cs.AI, cs.CV | 83 | Document forensics dataset + human study; highlights arithmetic-error signals vs visual artifacts | forensics, synthetic-media, multimodal, evaluation, security, datasets, fraud-detection |
2603.11949 | Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models | cs.CR, cs.AI | 82 | Introduces delayed backdoors (temporal triggers), enabling common-word triggers; new threat model. | backdoors, model-security, trojans, threat-model, pretrained-models, temporal-attacks |
2603.12165 | QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions | cs.CL | 82 | Synthetic code data filtering via bidirectional coherence (Q|A); addresses hallucinated instructions | synthetic-data, data-selection, code, hallucinations, training-data-quality, mutual-information |
2603.11564 | Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries | cs.CL | 82 | KV-cache compression aligned to decoding via position-aware pseudo queries; targets long-context cost | llm-inference, kv-cache, compression, long-context, efficiency, decoding |
2603.11611 | Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE | cs.LG, cs.CL | 82 | Partial RoPE study shows up to 10x KV-cache memory savings with similar loss; relevant for long-context LMs | transformers, RoPE, long context, efficiency, KV cache, positional embeddings, scaling |
2603.11772 | Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents | cs.CL | 80 | Chinese legal RAG benchmark with clause-level refs + framework for structured legal provisions | rag, legal, benchmark, grounding, retrieval, citations, chinese |
2603.11991 | BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs | cs.CL, cs.AI, cs.LG, stat.ML | 79 | Comprehensive genuine zero-shot text classification benchmark across model families; fills eval gap | zero-shot, text-classification, benchmark, embeddings, rerankers, LLMs |
AI Paper Insight Brief
2026-03-13
0) Executive takeaways (read this first)
- “Safety” failures are increasingly about format and lifecycle, not just content: multi-turn interaction (clinical “conversation tax”), agent lifecycle attacks (OpenClaw), and documentation-induced exfiltration (ReadSecBench) show that sequential context and tool/runtime boundaries dominate risk.
- Overrefusal has a concrete mechanism and a cheap fix: “refusal triggers” (benign cues correlated with harmful training samples) explain benign refusals; trigger-matched benign supervision reduces refusal rates even with far fewer benign samples than generic corpora.
- RAG/GraphRAG is not automatically safer—structure creates new attack surfaces: GraphRAG can be steered by temporally coherent “knowledge evolution” poisoning (KEPo), while small LMs often fail to use even oracle retrieval and can have their known answers destroyed by added context.
- Robust evaluation is shifting from accuracy to reliability under perturbation: INFACT (video) uses perturbation modes + reliability metrics (Resist Rate, Temporal Sensitivity), and TopoBench uses causal error injections to identify which reasoning failures actually matter.
- Operational security is moving “left” into fast, non-promptable gates and runtime enforcement: Mirror shows a curated-data + linear SVM L1 detector can beat a semantic model on recall/latency; PRISM adds lifecycle hooks + auditability but highlights latency costs when scanners are invoked.
- Compute/memory efficiency work is getting more “mechanism-aligned”: DapQ aligns KV eviction to decoding positions via pseudo-queries; Partial RoPE suggests ≥10% RoPE can match full RoPE convergence while cutting RoPE-cache memory substantially.
2) Key themes (clusters)
Theme: Alignment trade-offs & refusal behavior
- Why it matters: Safety tuning can degrade utility via overrefusal or by mishandling harmful content embedded inside benign tasks; both are deployment-critical because they affect everyday user workflows.
- Representative papers:
- Common approach:
- Define a specific failure mode (refusal triggers; in-content harm risk; reward hacking under judges) and build targeted metrics/datasets.
- Use controlled ablations (trigger similarity levels; task types like translation/summarization; judge effort/distillation).
- Evaluate trade-offs explicitly (ASR vs refusal; harmful-response rates vs task completion; judge agreement vs adversarial policy learning).
- Open questions / failure modes:
- Reliance on automated detectors/judges (keyword RR, moderation APIs, LLM judges) can miss nuanced borderline cases or be gamed.
- Mitigations may shift failure modes (e.g., reducing overrefusal without weakening jailbreak resistance; judge-trained policies learning adversarial “policy-quoting” patterns).
Theme: Agent security is lifecycle security (tools, memory, docs, runtime)
- Why it matters: Tool-augmented agents turn prompt injection into system compromise via persistent memory, plugins, filesystem/network privileges, and multi-stage execution.
- Representative papers:
- Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats
- OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents
- You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents
- The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection
- Common approach:
- Model the agent as a multi-stage system (init → input → inference → decision → execution; or hook-based runtime interception).
- Combine lightweight always-on screening with escalations (heuristics → scanner/LLM; L1 SVM → later layers).
- Emphasize operational properties: non-promptable gates, audit logs, hot-reloadable policy, allow/deny lists, DLP, protected paths.
- Open questions / failure modes:
- Latency/UX costs: scanner-backed defenses can push p95 into seconds (PRISM).
- Semantic attacks that look like “normal docs” remain hard: README injections reached up to 85% ASR and humans detected 0% in a small study.
- Portability/generalization: OpenClaw-specific hooks and policies may not transfer cleanly to other agent frameworks.
Theme: Retrieval & knowledge systems—utilization failures and poisoning
- Why it matters: Retrieval is often assumed to improve truthfulness, but small models may not use it, and structured retrieval (GraphRAG) introduces new poisoning vectors.
- Representative papers:
- Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
- KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation
- Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents
- Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
- Common approach:
- Separate retrieval quality from utilization (oracle retrieval; page/doc attribution metrics; oracle retriever upper bounds).
- Stress realistic structure: clause-level legal references; PDF collections with visual/layout evidence; KG community retrieval in GraphRAG.
- Evaluate both correctness and grounding (Page F1/Doc F1; clause references; ASR/CASR for poisoning).
- Open questions / failure modes:
- Context can harm: for sub-7B models, oracle context often fails to help UNKNOWN questions and can destroy KNOWN answers.
- Graph-structured systems can be steered by coherence tricks (temporal “evolution” narratives) and standard defenses may not reduce ASR meaningfully.
- Evaluation contamination/guessability remains a concern in public-doc benchmarks (MADQA estimates memorization signal).
Theme: Robustness under perturbations (video, topology, conversation)
- Why it matters: Clean accuracy hides brittleness; real deployments include degraded inputs, misleading auxiliary text, and multi-turn social pressure.
- Representative papers:
- Common approach:
- Introduce perturbation modes (visual degradation, evidence corruption, temporal shuffling; multi-turn stick-or-switch).
- Add reliability metrics beyond accuracy (Resist Rate, Temporal Sensitivity; conviction survival; causal error injections).
- Diagnose why failures happen (caption injection sensitivity; temporal inertia; premature commitment/constraint forgetting).
- Open questions / failure modes:
- Temporal inertia: models may keep answers even when temporal order is destroyed (near-zero TSS in some open models).
- Multi-turn “social” degradation: abstention defense erodes sharply across turns (large end-to-end abstention drops).
- Tool augmentation helps some families/tasks but is not uniformly beneficial and can introduce new dependencies.
Theme: Efficiency & scaling mechanisms for long context and tool use
- Why it matters: Long-context deployment is bottlenecked by KV memory and tool-library size; mechanism-aligned compression and divide-and-conquer can preserve quality.
- Representative papers:
- Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
- Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE
- Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs
- Common approach:
- Align approximations to the decoding-time reality (pseudo queries with future positional IDs; schema validators for tool calls).
- Provide ablations showing which component is essential (Retry removal collapses tool-calling accuracy; positional alignment dominates pseudo-query semantics).
- Report performance under aggressive budgets/extended settings (Needle-in-a-Haystack near-lossless at small KV; tool-calling gains under noisy libraries).
- Open questions / failure modes:
- Inference overhead vs quality: training-free multi-pass tool pipelines add latency; scanner-like components can dominate runtime.
- Hyperparameter sensitivity (pseudo-query window length; RoPE fraction; grouping K; validator strictness).
- Generalization to multi-step/nested tool use remains under-tested for some methods.
3) Technical synthesis
- Multiple papers converge on a “reliability under distribution shift” framing: INFACT’s RR/TSS, conversation tax’s conviction survival, and TopoBench’s causal injections all measure stability rather than raw accuracy.
- Auxiliary text is a recurring adversarial channel: caption injection (INFACT), README instructions (ReadSecBench), and prompt injection corpora (Mirror) all exploit the model’s tendency to treat text as authoritative control-plane input.
- Context is double-edged: retrieval context can reduce accuracy for small models (oracle still mostly wasted; KNOWN answers destroyed), while in GraphRAG, added structured context can be weaponized via KG-coherent poisoning.
- Verification loops are becoming the default pattern across domains:
- Orchestration-level verification + replanning (VMAO),
- Tool-call schema validation (Tool-DC),
- Self-reflection verification loops in domain RAG (LegRAG),
- Runtime hook-based enforcement + audit (PRISM).
- Judge-based supervision is itself an attack surface: reasoning judges can improve gold-judge scores yet induce adversarial “policy-quoting refusal” strategies; non-reasoning judges lead to classic reward hacking.
- Mechanistic alignment beats semantic matching in efficiency work: DapQ shows positional alignment dominates pseudo-query similarity; Partial RoPE shows small fractions can preserve convergence, suggesting many “full” positional mechanisms are over-provisioned.
- Calibration is being operationalized: MLLM confidence miscalibration under noise motivates RL-based calibration rewards (CDRL) and confidence-aware test-time scaling (CA-TTS).
- Security work is splitting into fast, deterministic L1 gates (Mirror’s compiled SVM) plus lifecycle enforcement (PRISM/OpenClaw), reflecting real hot-path constraints.
4) Top 5 papers (with “why now”)
1) You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents
- Quantifies a high-impact, realistic agent failure: README-embedded instructions can drive private-data exfiltration with ASR up to 85%.
- Provides a structured taxonomy (linguistic disguise × structural obfuscation × semantic abstraction) and a 500-README benchmark (ReadSecBench).
- Shows humans (15 participants) detected 0% of injected instructions under naturalistic review, and common scanners/classifiers struggle without high FPs.
- Skepticism: end-to-end results focus on a single deployed agent; some cells have small n (noted by authors).
2) KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation
- Demonstrates GraphRAG-specific poisoning: forging temporally coherent “knowledge evolution” paths integrates poisoned facts into KG communities.
- Outperforms prior poisoning baselines across multiple GraphRAG frameworks; standard defenses (paraphrasing, instruction ignoring, prompt detection) don’t meaningfully reduce ASR.
- Multi-target coordinated poisoning links corpora into reinforcing communities, matching how real KGs cluster information.
- Skepticism: evaluated on simplified/open-source GraphRAG implementations; real-world feasibility depends on indexing/provenance controls.
3) Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
- Offers a concrete mechanism for overrefusal (refusal triggers) with behavioral + representation evidence (rejected benign queries closer to triggers).
- Mitigation is practical: repurpose triggers into trigger-matched benign supervision; reduces benign refusal rates even with far fewer benign samples than Alpaca-scale corpora.
- Works across SFT, prefilled SFT, and RLVR, targeting a ubiquitous deployment pain point.
- Skepticism: trigger extraction relies on an external LLM and evaluation uses automatic detectors (rule-based ASR, keyword RR).
4) Can Small Language Models Use What They Retrieve?
- Cleanly separates retrieval quality from utilization via oracle retrieval + KNOWN/UNKNOWN split.
- Finds sub-7B models waste most oracle retrieval on UNKNOWN questions (very low extraction rates) and retrieval can destroy KNOWN answers (large drops).
- Error analysis highlights irrelevant generation as dominant failure mode, informing where to invest (training vs retriever).
- Skepticism: conclusions are scoped to extractive QA and the chosen corpus; quantization may affect absolute numbers.
5) The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection
- Shows that data geometry + a sparse linear model can be a strong, auditable L1 injection gate: high recall with sub-millisecond latency.
- Provides a concrete curation topology (matched cells across nuisance dimensions) and deployment artifact (compiled Rust dot-product).
- Fair comparison on the same holdout shows much higher recall than a semantic baseline (Prompt Guard 2) at far lower latency.
- Skepticism: high false positives on “hard benign” security-adjacent docs at default threshold; external validity beyond the curated geometry remains open.
5) Practical next steps
- Add “in-content harm” tests to your safety evals: run translation/summarization/polish tasks on harmful user-supplied content and measure harmful-response rates; test “wrapped” inputs against your guards.
- Instrument overrefusal debugging: mine refusal triggers from your harmful training set (or logs) and measure embedding/hidden-state similarity between benign refusals and trigger sets; try trigger-matched benign supervision.
- Treat agent security as lifecycle engineering: implement hook points (ingress, prompt build, pre/post tool call, persistence, egress) and log tamper-evident audit trails; measure p95 latency impact when escalations trigger.
- Deploy a fast, non-promptable L1 gate for injection-like patterns (e.g., compiled linear classifier) and reserve semantic/LLM scanning for escalations; tune thresholds using a “hard benign” set to control FPR.
- For GraphRAG, add KG-level provenance and anomaly checks: specifically look for temporally “evolutionary” narratives that connect anchors to new endpoints; evaluate against KEPo-style attacks rather than flat-text poisoning.
- If you ship RAG with small models, gate retrieval: use a confidence/knownness heuristic to avoid adding context when the model likely already knows; measure “knowledge destruction” rate as a first-class metric.
- Adopt perturbation-based reliability metrics in multimodal/video: include caption injection, subtitle desync, and temporal shuffles; track Resist Rate and Temporal Sensitivity, not just base accuracy.
- When using LLM-judges for RL, red-team the judge: search for adversarial response patterns that inflate judge scores (e.g., fabricated policy refusals) and rotate/ensemble judges or add adversarial training.
Generated from per-paper analyses; no external browsing.
