Daily AI Paper Report (2026-03-14)

Published: March 14, 2026

Chinese version: [中文]

Run stats

Candidates: 249
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-12T00:00:00Z → 2026-03-13T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.11853`	OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents PDF	cs.CR	94	Defense-in-depth runtime layer for tool agents; lifecycle hooks + risk accumulation for real deployments	agent-security, tool-use, prompt-injection, runtime-enforcement, sandboxing, monitoring
`2603.11862`	You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents PDF	cs.CR, cs.AI	93	Measures doc-instruction induced leakage in high-privilege agents; frames structural 'Trusted Executor' vuln	agent-security, data-leakage, prompt-injection, evaluation, taxonomy, autonomous-agents
`2603.11875`	The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection PDF	cs.CR, cs.AI	93	Fast, auditable prompt-injection detector via strict data geometry; strong practical security angle	prompt-injection, security, dataset-curation, auditable-ml, linear-models, robustness
`2603.11481`	INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs PDF	cs.CV, cs.AI	92	Diagnostic benchmark for Video-LLM hallucinations incl. induced corruptions + new robustness metrics	hallucinations, factuality, faithfulness, video-llm, robustness, benchmark, evaluation, corruption
`2603.12011`	Can RL Improve Generalization of LLM Agents? An Empirical Study PDF	cs.AI	92	Systematic study of RFT generalization/transfer/forgetting for LLM agents under environment shifts	llm-agents, reinforcement-learning, generalization, transfer, evaluation, distribution-shift
`2603.11619`	Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats PDF	cs.CR, cs.AI	90	Comprehensive threat analysis + mitigations for OpenClaw agents across lifecycle (memory/tool/supply chain)	agent-security, threat-modeling, prompt-injection, memory-poisoning, supply-chain, mitigations
`2603.11914`	Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks PDF	cs.CR, cs.AI	90	Evaluates content-level refusal when harmful user text appears inside benign tasks; new dataset + tests	safety, refusal, policy-compliance, harmful-content, evaluation, dataset
`2603.12109`	On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents PDF	cs.AI	90	Identifies RL failure mode in active reasoning (info self-locking) via action selection vs belief tracking	llm-agents, active-reasoning, reinforcement-learning, failure-modes, belief-tracking, agent-reliability
`2603.12230`	Security Considerations for Artificial Intelligence Agents PDF	cs.LG, cs.AI, cs.CR	88	Practitioner-informed mapping of frontier agent attack surfaces; useful guidance for real-world hardening	agent-security, attack-surface, confused-deputy, indirect-prompt-injection, deployment, best-practices
`2603.12152`	LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation PDF	cs.CL	88	Long-horizon personalized assistant eval via BDI user simulator; 1,200 scenarios across life domains	agents, evaluation, personalization, user-simulation, long-horizon, benchmark
`2603.11394`	Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning PDF	cs.CL, cs.AI, cs.LG	88	Shows multi-turn dialogue can degrade clinical reasoning; measures conviction vs flexibility (safety-relevant)	healthcare, multi-turn, robustness, evaluation, diagnostic-reasoning, reliability, conversation
`2603.11495`	Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs PDF	cs.CL	88	Divide-and-conquer Try/Check/Retry boosts long-context tool-calling with many noisy tools	tool-use, agents, long-context, self-reflection, robustness, evaluation
`2603.11501`	KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation PDF	cs.LG, cs.AI, cs.CR	87	Poisoning attack tailored to GraphRAG; highlights new RAG/knowledge-graph security failure modes	RAG-security, data-poisoning, GraphRAG, knowledge-graphs, adversarial-attacks, robustness
`2603.11987`	LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories PDF	cs.AI	86	LABSHIELD benchmark for safety-critical multimodal lab planning; concrete eval for high-stakes agents	benchmarks, multimodal, embodied-agents, safety-evaluation, planning, hazard-detection
`2603.12149`	Linking Perception, Confidence and Accuracy in MLLMs PDF	cs.CV, cs.CL	86	Targets MLLM confidence miscalibration with RL + test-time scaling; reliability-critical for deployment	calibration, multimodal, uncertainty, reinforcement-learning, test-time-scaling, reliability
`2603.12237`	STAMP: Selective Task-Aware Mechanism for Text Privacy PDF	cs.LG, cs.CR, cs.IT	86	Token-level task-aware privatization with selective budgets + new embedding perturbation (polar mechanism)	privacy, differential-privacy, text, token-level, privacy-utility, security, embeddings
`2603.11513`	Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale PDF	cs.CL	86	Measures whether small LMs actually use retrieved info; separates retrieval quality vs utilization failure	RAG, retrieval-utilization, small-models, factuality, evaluation, knowledge
`2603.11975`	HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios PDF	cs.CV, cs.AI, cs.CR	85	HomeSafe-Bench evaluates VLMs on unsafe action detection in household videos; fills dynamic safety gap	benchmarks, VLM, embodied-safety, unsafe-actions, video, household-robots
`2603.11388`	Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment PDF	cs.AI	84	Analyzes overrefusal via 'refusal triggers' and proposes mitigation; improves safety/usability tradeoff	alignment, safety-training, overrefusal, refusal, post-training, reliability
`2603.12133`	TopoBench: Benchmarking LLMs on Hard Topological Reasoning PDF	cs.AI, cs.CL	84	Hard topological reasoning benchmark + error taxonomy from 750 CoT traces; useful for diagnosing failures	reasoning, benchmark, spatial-reasoning, error-analysis, chain-of-thought, evaluation
`2603.11445`	Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution PDF	cs.AI, cs.MA	84	Plan-execute-verify-replan multi-agent orchestration with DAG parallelism and verifier-driven replanning	agents, orchestration, verification, planning, multi-agent, evaluation, replanning
`2603.11611`	Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE PDF	cs.LG, cs.CL	84	Partial RoPE study shows up to 10x KV-cache memory savings with similar loss; relevant for long context	transformers, positional-embeddings, efficiency, long-context, memory, training-dynamics
`2603.11442`	GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics PDF	cs.AI, cs.CV	83	Document forensics benchmark + human study; finds arithmetic errors as key detection signal for AI receipts	forensics, synthetic-documents, multimodal, benchmark, security, fraud, evaluation
`2603.11749`	Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information PDF	cs.CL, cs.AI	83	Compression–consistency principle explains when LMs prefer correct info despite mixed-quality training data	theory, truthfulness, hallucinations, inductive-bias, compression, synthetic-data
`2603.12206`	CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks PDF	cs.CL	82	Defense for Mamba/SSM hidden-state poisoning via token-level detection using BOE features (low overhead)	SSM, Mamba, security, poisoning, adversarial-text, detection
`2603.12165`	QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions PDF	cs.CL	82	Synthetic code-instruction filtering via bidirectional coherence (Q\|A); addresses hallucinated data noise	synthetic-data, data-selection, code, hallucinations, training, mutual-information
`2603.11564`	Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries PDF	cs.CL	82	Decoding-aligned KV-cache compression using position-aware pseudo queries for long-context efficiency	long-context, kv-cache, compression, inference, efficiency, transformers, memory
`2603.11949`	Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models PDF	cs.CR, cs.AI	81	Introduces delayed backdoors with temporal triggers; expands PTM threat model beyond immediate activation	backdoors, model-security, pretrained-models, temporal-attacks, threat-modeling
`2603.11793`	Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder PDF	cs.CV, cs.AI, cs.CY	81	Mechanistic fairness audit locates demographic bias to specific CLIP attention heads; ablation reduces bias	fairness, interpretability, mechanistic-audit, vision-transformers, CLIP, bias-mitigation
`2603.12232`	Incremental Neural Network Verification via Learned Conflicts PDF	cs.LO, cs.AI	80	Incremental neural net verification reusing learned conflicts across queries; improves safety assurance tooling	verification, formal-methods, neural-networks, branch-and-bound, safety-assurance

AI Paper Insight Brief

2026-03-14

0) Executive takeaways (read this first)

“Safety” failures are increasingly about interaction structure, not just content: multi-turn clinical dialogs degrade diagnosis (“conversation tax”), and agents executing README instructions leak data at high rates—both show that sequential decision-making amplifies risk even when single-shot benchmarks look fine.
Data- and representation-aware safety tuning beats generic augmentation: “refusal triggers” explain overrefusal mechanistically, and trigger-matched benign data sharply improves the safety–utility tradeoff versus generic benign corpora.
RAG/GraphRAG is a double-edged sword: small models often cannot use retrieved evidence even when it contains the answer (utilization bottleneck), while GraphRAG introduces a new poisoning surface where temporally coherent “knowledge evolution” injections can dominate retrieval and generation.
Runtime security for agents is moving from “filters” to “lifecycle enforcement”: OpenClaw threat taxonomies + PRISM’s hook-based enforcement and audit chaining point toward defense-in-depth architectures with measurable block-rate gains—but latency and policy maintenance remain real constraints.
Robustness diagnostics are getting more causal and more realistic: TopoBench uses causal error injections to identify which reasoning failures actually matter; INFACT and HomeSafe/LABSHIELD show that corruptions, temporal interventions, and embodied perception gaps (e.g., transparent glassware) break “clean” performance assumptions.

2) Key themes (clusters)

Theme: Alignment failures from benign wrappers and overgeneralization

Why it matters: Models can be unsafe or unusable even when they “follow policy” at the task level—either by refusing too much (overrefusal) or by complying with benign tasks that contain harmful content.
Representative papers:
- Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
- Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks
Common approach:
- Identify where safety behavior generalizes from (linguistic “triggers” vs. task framing).
- Construct targeted datasets (trigger-matched benign data; harmful-content-in-benign-task dataset) and measure refusal/harm rates.
- Use automated labeling/detectors (keyword RR, rule-based ASR; Moderation API + human validation).
Open questions / failure modes:
- Reliance on external models for extraction/labeling (e.g., GPT-4o trigger extraction; Moderation API judgments).
- How to get calibrated behavior (neither blanket refusal nor blind compliance) without brittle heuristics.
- Whether these mechanisms hold under domain shift and more naturalistic user interactions.

Theme: Multi-turn interaction tax (sycophancy, switching, and agentic leakage)

Why it matters: Sequential interactions create compounding error modes—models abandon correct answers, lose abstentions, or execute untrusted instructions—raising safety risk in healthcare and high-privilege agent settings.
Representative papers:
- Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning
- You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents
Common approach:
- Convert static tasks into multi-turn protocols (stick-or-switch binary turns; README-driven setup workflows).
- Measure conviction/abstention retention vs. switching; measure exfiltration ASR/RR/TSR.
- Stress-test across phrasing/structure/abstraction (linguistic disguise, link depth, semantic abstraction).
Open questions / failure modes:
- How to prevent “blind switching” while preserving flexibility to adopt correct late evidence.
- How to enforce trust boundaries for documentation and other “ambient instructions” without breaking usability.
- Generalization beyond perturbed MCQ setups and beyond one privileged agent implementation.

Theme: Agent security engineering: lifecycle defenses, not just classifiers

Why it matters: Tool-using agents have multiple ingress points (web, tools, files, memory) and high privileges; point defenses miss cross-stage attacks and persistence.
Representative papers:
Common approach:
- Map threats across an agent lifecycle (init/input/inference/decision/execution) and enforce controls at multiple hooks.
- Combine fast heuristics with escalation (scanner sidecar + optional LLM judge) and session-scoped risk accumulation.
- Add operational primitives: hot-reloadable policy, audit trails (tamper-evident chaining), tool/path governance.
Open questions / failure modes:
- Latency/cost blowups when escalating to heavier scanning (PRISM reports multi-second p95 in scanner-backed paths).
- Coverage gaps: string-level path checks (no symlink containment), incomplete obfuscation coverage, policy drift.
- Lack of large-scale, multi-tenant field validation and adaptive-adversary evaluation.

Theme: RAG reliability and GraphRAG poisoning

Why it matters: Retrieval can hurt (small models ignore or are distracted by context), while graph-based retrieval introduces new poisoning strategies that bypass naive RAG defenses.
Representative papers:
- Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
- KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation
Common approach:
- Separate retrieval quality from utilization (oracle passage at rank 1; KNOWN/UNKNOWN split).
- Attack GraphRAG by crafting documents that integrate into KG structure via temporally coherent “evolution” narratives.
- Evaluate with ASR/CASR and defense baselines (paraphrasing, instruction ignoring, prompt detection).
Open questions / failure modes:
- For small models: how to prevent retrieval from destroying KNOWN answers (selective retrieval, RAG-aware tuning).
- For GraphRAG: how to validate temporal claims and detect anomalous evolution chains during KG construction.
- Over-reliance on LLM-based evaluators for attack success and safety judgments.

Theme: Robustness benchmarks that stress time, corruption, and embodiment

Why it matters: Clean accuracy is not predictive of reliability under corruptions, temporal interventions, or embodied perception constraints; safety-critical deployment needs these stressors.
Representative papers:
Common approach:
- Induce failures via controlled operators (caption injection, subtitle corruption, temporal shuffling/reversal; hazard keyframes).
- Use metrics that reflect stability/timeliness (Resist Rate, Temporal Sensitivity Score; Weighted Safety Score with phase scoring).
- Diagnose perception bottlenecks (e.g., transparent glassware) and MCQ-to-open-plan performance drops.
Open questions / failure modes:
- Domain gap from synthetic/simulated/generated videos to real robots and real lab operations.
- Temporal inertia: many models show near-zero sensitivity to order changes on factuality.
- Judge optimism/instability for plan scoring (LABSHIELD explicitly uses score + pass rate).

Theme: Efficiency + long-context: position is a lever

Why it matters: Long-context deployment is constrained by KV cache memory and positional encoding overhead; small architectural choices can yield large savings without large quality loss.
Representative papers:
- Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
- Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE
Common approach:
- Use position-aligned pseudo queries at prefill to approximate decoding attention and guide eviction (DapQ).
- Pretrain from scratch while varying RoPE rotated-dimension fraction to map stability/performance bands.
- Validate across long-context benchmarks and measure memory/latency impacts.
Open questions / failure modes:
- Sensitivity to positional alignment/window placement (DapQ) and to fraction thresholds (partial RoPE).
- How these methods interact with length extrapolation and other long-context tricks.
- Whether semantic content of pseudo queries can be optimized without losing the “lightweight” property.

Theme: Data-centric security detectors and forensics

Why it matters: Some security problems benefit from fast, auditable, non-promptable first-line detectors and from exploiting non-visual consistency checks (e.g., arithmetic) rather than “semantic understanding.”
Representative papers:
Common approach:
- Engineer datasets to control nuisance dimensions (Mirror’s matched cells across reasons × languages).
- Use lightweight, inspectable models (char n-gram linear SVM compiled to Rust; XGBoost on BOE features).
- Leverage structured consistency signals (receipt arithmetic integrity) that humans underuse.
Open questions / failure modes:
- Hard-benign false positives and paraphrase-attack recall ceilings for L1 detectors (Mirror).
- Domain transfer: résumé-only HiSPA detection (CLASP) and single-generator receipt artifacts (GPT4o-Receipt).
- Adaptive adversaries and evolving generators/models.

3) Technical synthesis

Multiple papers converge on a “clean benchmark ≠ deployed reliability” pattern: INFACT (Base vs induced modes), LABSHIELD (MCQ vs semi-open PRP), and clinical stick-or-switch (single-shot vs multi-turn) all show large gaps.
Temporal structure is a recurring vulnerability: delayed backdoors activate after cumulative triggers; Video-LLMs show temporal inertia (near-zero TSS); household safety needs early-warning keyframes; multi-turn diagnosis suffers compounding switches.
External LLMs are increasingly used as infrastructure (trigger extraction, evaluators, expert planners/judges), but this creates shared-model bias and reproducibility issues (noted in VMAO, receipt forensics, and several safety evaluations).
Data geometry and targeted matching appear as a general lever: refusal-trigger-matched benign data (overrefusal), Mirror’s matched cells (prompt injection), and QAQ’s stratified reverse-MI selection (synthetic code) all argue that what you match matters more than raw scale.
Retrieval is not automatically helpful: small models show negative expected EM change with retrieval and high “irrelevant generation” even under oracle passages; meanwhile GraphRAG’s structure can be exploited by coherence/temporal chaining attacks.
Verification/checking loops are trending across domains: Tool-DC’s schema validator, VMAO’s verifier-driven replanning, CA-TTS’s self-check modules, and agent runtime hooks (PRISM) all implement “try → check → retry” patterns at different layers.
Position dominates in two separate efficiency stories: DapQ uses positional pseudo queries to align eviction with decoding; partial RoPE shows ≥10% rotation reaches a lower-loss band similar to full RoPE with large cache savings.
Causal diagnostics are emerging: TopoBench’s injected error modes identify which reasoning behaviors actually reduce accuracy (premature commitment, constraint forgetting), moving beyond observational CoT labeling.
Safety screening is bifurcating into (a) fast, deterministic L1 gates (Mirror SVM; PRISM heuristics) and (b) slower semantic/LLM-based adjudication for residuals—mirroring classic security architectures.

4) Top 5 papers (with “why now”)

1) Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Mechanistic framing of overrefusal via “refusal triggers” and representational similarity evidence.
Practical mitigation: trigger-matched benign data improves safety–utility across SFT/P-SFT/RLVR (e.g., large Avg↓ reductions vs Alpaca baselines).
Useful for teams seeing usability regressions after safety tuning and needing a data construction recipe.
Skepticism: trigger extraction depends on GPT-4o and evaluation relies on automatic ASR/RR detectors.

2) Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Introduces stick-or-switch metrics (positive/negative conviction, flexibility) that expose multi-turn failure modes.
Shows large abstention collapse and switching/sycophancy patterns across many models; GPT-5.2 is best but not perfect.
Directly relevant to clinical deployments where interaction is inherently incremental.
Skepticism: uses perturbed MCQA rather than real conversation logs; limited internal confidence analysis.

3) KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation

Demonstrates GraphRAG-specific poisoning via temporally coherent evolution narratives that integrate into KGs.
Shows strong ASR/CASR across multiple GraphRAG variants and that common defenses have little effect.
“Why now”: GraphRAG adoption is rising for timeliness/multi-hop; this is a concrete new attack surface.
Skepticism: black-box threat assumes ability to inject crawlable docs; evaluation uses GPT-4o-based measures.

4) You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

Defines the Trusted Executor Dilemma and quantifies README-embedded instruction attacks (ASR up to 85%).
Robustness across disguise/structure/abstraction; human reviewers detect 0% under naturalistic review framing.
Evaluates defenses and shows poor trade-offs (rule-based high FP; minimal LLM prompts under-detect).
Skepticism: some per-condition sample sizes are small; primary end-to-end focus is one commercial agent.

5) INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Adds factuality + induced corruptions + temporal interventions for Video-LLMs with RR/TSS metrics.
Finds evidence corruption (caption injection) is especially damaging; many open models show near-zero factuality TSS.
“Why now”: video agents and multimodal assistants are moving into settings with unreliable subtitles/captions.
Skepticism: induced operators are proxies; temporal interventions are limited to shuffle/reversal on a subset.

5) Practical next steps

Add “interaction-structure” evals to safety gates: run stick-or-switch style multi-turn tests (conviction/abstention retention) for any high-stakes domain assistant, not just single-shot accuracy.
Instrument overrefusal via trigger mining: extract candidate refusal triggers from your harmful SFT/RLHF data and measure representation/semantic-distance dependence; try trigger-matched benign augmentation rather than generic benign corpora.
Harden agent setup workflows: treat README/docs as untrusted; require provenance-aware trust tiers and action-level confirmations for filesystem/network reads, especially during install/setup.
Adopt lifecycle hooks + auditability for tool agents: implement multi-hook enforcement (ingress/pre/post tool/outbound/maintenance), session-scoped risk accumulation, and tamper-evident audit chaining; measure block-rate vs latency.
For GraphRAG deployments: add KG-ingestion checks for anomalous temporal chaining and multi-source corroboration before integrating new edges; explicitly test against knowledge-evolution poisoning.
For small-model RAG: don’t assume retrieval helps—measure KNOWN/UNKNOWN splits and oracle utilization; consider selective retrieval (only when uncertain) and RAG-aware fine-tuning to reduce “irrelevant generation.”
For long-context inference: evaluate position-aligned KV eviction (pseudo-query scoring) and consider partial RoPE (≥10%) to reduce memory while monitoring stability bands and placement sensitivity.
For multimodal safety: incorporate induced corruptions (caption injection, temporal shuffles) and embodied perception stressors (transparent objects, early-warning timing) into acceptance tests; track stability metrics, not just base accuracy.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-14

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Alignment failures from benign wrappers and overgeneralization

Theme: Multi-turn interaction tax (sycophancy, switching, and agentic leakage)

Theme: Agent security engineering: lifecycle defenses, not just classifiers

Theme: RAG reliability and GraphRAG poisoning

Theme: Robustness benchmarks that stress time, corruption, and embodiment

Theme: Efficiency + long-context: position is a lever

Theme: Data-centric security detectors and forensics

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps