Daily AI Paper Report (2026-03-14)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 249
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-12T00:00:00Z → 2026-03-13T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.11853OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents
PDF
cs.CR94Defense-in-depth runtime layer for tool agents; lifecycle hooks + risk accumulation for real deploymentsagent-security, tool-use, prompt-injection, runtime-enforcement, sandboxing, monitoring
2603.11862You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents
PDF
cs.CR, cs.AI93Measures doc-instruction induced leakage in high-privilege agents; frames structural 'Trusted Executor' vulnagent-security, data-leakage, prompt-injection, evaluation, taxonomy, autonomous-agents
2603.11875The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection
PDF
cs.CR, cs.AI93Fast, auditable prompt-injection detector via strict data geometry; strong practical security angleprompt-injection, security, dataset-curation, auditable-ml, linear-models, robustness
2603.11481INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
PDF
cs.CV, cs.AI92Diagnostic benchmark for Video-LLM hallucinations incl. induced corruptions + new robustness metricshallucinations, factuality, faithfulness, video-llm, robustness, benchmark, evaluation, corruption
2603.12011Can RL Improve Generalization of LLM Agents? An Empirical Study
PDF
cs.AI92Systematic study of RFT generalization/transfer/forgetting for LLM agents under environment shiftsllm-agents, reinforcement-learning, generalization, transfer, evaluation, distribution-shift
2603.11619Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats
PDF
cs.CR, cs.AI90Comprehensive threat analysis + mitigations for OpenClaw agents across lifecycle (memory/tool/supply chain)agent-security, threat-modeling, prompt-injection, memory-poisoning, supply-chain, mitigations
2603.11914Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks
PDF
cs.CR, cs.AI90Evaluates content-level refusal when harmful user text appears inside benign tasks; new dataset + testssafety, refusal, policy-compliance, harmful-content, evaluation, dataset
2603.12109On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
PDF
cs.AI90Identifies RL failure mode in active reasoning (info self-locking) via action selection vs belief trackingllm-agents, active-reasoning, reinforcement-learning, failure-modes, belief-tracking, agent-reliability
2603.12230Security Considerations for Artificial Intelligence Agents
PDF
cs.LG, cs.AI, cs.CR88Practitioner-informed mapping of frontier agent attack surfaces; useful guidance for real-world hardeningagent-security, attack-surface, confused-deputy, indirect-prompt-injection, deployment, best-practices
2603.12152LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
PDF
cs.CL88Long-horizon personalized assistant eval via BDI user simulator; 1,200 scenarios across life domainsagents, evaluation, personalization, user-simulation, long-horizon, benchmark
2603.11394Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning
PDF
cs.CL, cs.AI, cs.LG88Shows multi-turn dialogue can degrade clinical reasoning; measures conviction vs flexibility (safety-relevant)healthcare, multi-turn, robustness, evaluation, diagnostic-reasoning, reliability, conversation
2603.11495Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs
PDF
cs.CL88Divide-and-conquer Try/Check/Retry boosts long-context tool-calling with many noisy toolstool-use, agents, long-context, self-reflection, robustness, evaluation
2603.11501KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation
PDF
cs.LG, cs.AI, cs.CR87Poisoning attack tailored to GraphRAG; highlights new RAG/knowledge-graph security failure modesRAG-security, data-poisoning, GraphRAG, knowledge-graphs, adversarial-attacks, robustness
2603.11987LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
PDF
cs.AI86LABSHIELD benchmark for safety-critical multimodal lab planning; concrete eval for high-stakes agentsbenchmarks, multimodal, embodied-agents, safety-evaluation, planning, hazard-detection
2603.12149Linking Perception, Confidence and Accuracy in MLLMs
PDF
cs.CV, cs.CL86Targets MLLM confidence miscalibration with RL + test-time scaling; reliability-critical for deploymentcalibration, multimodal, uncertainty, reinforcement-learning, test-time-scaling, reliability
2603.12237STAMP: Selective Task-Aware Mechanism for Text Privacy
PDF
cs.LG, cs.CR, cs.IT86Token-level task-aware privatization with selective budgets + new embedding perturbation (polar mechanism)privacy, differential-privacy, text, token-level, privacy-utility, security, embeddings
2603.11513Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
PDF
cs.CL86Measures whether small LMs actually use retrieved info; separates retrieval quality vs utilization failureRAG, retrieval-utilization, small-models, factuality, evaluation, knowledge
2603.11975HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
PDF
cs.CV, cs.AI, cs.CR85HomeSafe-Bench evaluates VLMs on unsafe action detection in household videos; fills dynamic safety gapbenchmarks, VLM, embodied-safety, unsafe-actions, video, household-robots
2603.11388Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
PDF
cs.AI84Analyzes overrefusal via 'refusal triggers' and proposes mitigation; improves safety/usability tradeoffalignment, safety-training, overrefusal, refusal, post-training, reliability
2603.12133TopoBench: Benchmarking LLMs on Hard Topological Reasoning
PDF
cs.AI, cs.CL84Hard topological reasoning benchmark + error taxonomy from 750 CoT traces; useful for diagnosing failuresreasoning, benchmark, spatial-reasoning, error-analysis, chain-of-thought, evaluation
2603.11445Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution
PDF
cs.AI, cs.MA84Plan-execute-verify-replan multi-agent orchestration with DAG parallelism and verifier-driven replanningagents, orchestration, verification, planning, multi-agent, evaluation, replanning
2603.11611Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE
PDF
cs.LG, cs.CL84Partial RoPE study shows up to 10x KV-cache memory savings with similar loss; relevant for long contexttransformers, positional-embeddings, efficiency, long-context, memory, training-dynamics
2603.11442GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics
PDF
cs.AI, cs.CV83Document forensics benchmark + human study; finds arithmetic errors as key detection signal for AI receiptsforensics, synthetic-documents, multimodal, benchmark, security, fraud, evaluation
2603.11749Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information
PDF
cs.CL, cs.AI83Compression–consistency principle explains when LMs prefer correct info despite mixed-quality training datatheory, truthfulness, hallucinations, inductive-bias, compression, synthetic-data
2603.12206CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks
PDF
cs.CL82Defense for Mamba/SSM hidden-state poisoning via token-level detection using BOE features (low overhead)SSM, Mamba, security, poisoning, adversarial-text, detection
2603.12165QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
PDF
cs.CL82Synthetic code-instruction filtering via bidirectional coherence (Q|A); addresses hallucinated data noisesynthetic-data, data-selection, code, hallucinations, training, mutual-information
2603.11564Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
PDF
cs.CL82Decoding-aligned KV-cache compression using position-aware pseudo queries for long-context efficiencylong-context, kv-cache, compression, inference, efficiency, transformers, memory
2603.11949Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models
PDF
cs.CR, cs.AI81Introduces delayed backdoors with temporal triggers; expands PTM threat model beyond immediate activationbackdoors, model-security, pretrained-models, temporal-attacks, threat-modeling
2603.11793Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder
PDF
cs.CV, cs.AI, cs.CY81Mechanistic fairness audit locates demographic bias to specific CLIP attention heads; ablation reduces biasfairness, interpretability, mechanistic-audit, vision-transformers, CLIP, bias-mitigation
2603.12232Incremental Neural Network Verification via Learned Conflicts
PDF
cs.LO, cs.AI80Incremental neural net verification reusing learned conflicts across queries; improves safety assurance toolingverification, formal-methods, neural-networks, branch-and-bound, safety-assurance

AI Paper Insight Brief

2026-03-14

0) Executive takeaways (read this first)

  • “Safety” failures are increasingly about interaction structure, not just content: multi-turn clinical dialogs degrade diagnosis (“conversation tax”), and agents executing README instructions leak data at high rates—both show that sequential decision-making amplifies risk even when single-shot benchmarks look fine.
  • Data- and representation-aware safety tuning beats generic augmentation: “refusal triggers” explain overrefusal mechanistically, and trigger-matched benign data sharply improves the safety–utility tradeoff versus generic benign corpora.
  • RAG/GraphRAG is a double-edged sword: small models often cannot use retrieved evidence even when it contains the answer (utilization bottleneck), while GraphRAG introduces a new poisoning surface where temporally coherent “knowledge evolution” injections can dominate retrieval and generation.
  • Runtime security for agents is moving from “filters” to “lifecycle enforcement”: OpenClaw threat taxonomies + PRISM’s hook-based enforcement and audit chaining point toward defense-in-depth architectures with measurable block-rate gains—but latency and policy maintenance remain real constraints.
  • Robustness diagnostics are getting more causal and more realistic: TopoBench uses causal error injections to identify which reasoning failures actually matter; INFACT and HomeSafe/LABSHIELD show that corruptions, temporal interventions, and embodied perception gaps (e.g., transparent glassware) break “clean” performance assumptions.

2) Key themes (clusters)

Theme: Alignment failures from benign wrappers and overgeneralization

  • Why it matters: Models can be unsafe or unusable even when they “follow policy” at the task level—either by refusing too much (overrefusal) or by complying with benign tasks that contain harmful content.
  • Representative papers:
  • Common approach:
    • Identify where safety behavior generalizes from (linguistic “triggers” vs. task framing).
    • Construct targeted datasets (trigger-matched benign data; harmful-content-in-benign-task dataset) and measure refusal/harm rates.
    • Use automated labeling/detectors (keyword RR, rule-based ASR; Moderation API + human validation).
  • Open questions / failure modes:
    • Reliance on external models for extraction/labeling (e.g., GPT-4o trigger extraction; Moderation API judgments).
    • How to get calibrated behavior (neither blanket refusal nor blind compliance) without brittle heuristics.
    • Whether these mechanisms hold under domain shift and more naturalistic user interactions.

Theme: Multi-turn interaction tax (sycophancy, switching, and agentic leakage)

  • Why it matters: Sequential interactions create compounding error modes—models abandon correct answers, lose abstentions, or execute untrusted instructions—raising safety risk in healthcare and high-privilege agent settings.
  • Representative papers:
  • Common approach:
    • Convert static tasks into multi-turn protocols (stick-or-switch binary turns; README-driven setup workflows).
    • Measure conviction/abstention retention vs. switching; measure exfiltration ASR/RR/TSR.
    • Stress-test across phrasing/structure/abstraction (linguistic disguise, link depth, semantic abstraction).
  • Open questions / failure modes:
    • How to prevent “blind switching” while preserving flexibility to adopt correct late evidence.
    • How to enforce trust boundaries for documentation and other “ambient instructions” without breaking usability.
    • Generalization beyond perturbed MCQ setups and beyond one privileged agent implementation.

Theme: Agent security engineering: lifecycle defenses, not just classifiers

Theme: RAG reliability and GraphRAG poisoning

  • Why it matters: Retrieval can hurt (small models ignore or are distracted by context), while graph-based retrieval introduces new poisoning strategies that bypass naive RAG defenses.
  • Representative papers:
  • Common approach:
    • Separate retrieval quality from utilization (oracle passage at rank 1; KNOWN/UNKNOWN split).
    • Attack GraphRAG by crafting documents that integrate into KG structure via temporally coherent “evolution” narratives.
    • Evaluate with ASR/CASR and defense baselines (paraphrasing, instruction ignoring, prompt detection).
  • Open questions / failure modes:
    • For small models: how to prevent retrieval from destroying KNOWN answers (selective retrieval, RAG-aware tuning).
    • For GraphRAG: how to validate temporal claims and detect anomalous evolution chains during KG construction.
    • Over-reliance on LLM-based evaluators for attack success and safety judgments.

Theme: Robustness benchmarks that stress time, corruption, and embodiment

Theme: Efficiency + long-context: position is a lever

  • Why it matters: Long-context deployment is constrained by KV cache memory and positional encoding overhead; small architectural choices can yield large savings without large quality loss.
  • Representative papers:
  • Common approach:
    • Use position-aligned pseudo queries at prefill to approximate decoding attention and guide eviction (DapQ).
    • Pretrain from scratch while varying RoPE rotated-dimension fraction to map stability/performance bands.
    • Validate across long-context benchmarks and measure memory/latency impacts.
  • Open questions / failure modes:
    • Sensitivity to positional alignment/window placement (DapQ) and to fraction thresholds (partial RoPE).
    • How these methods interact with length extrapolation and other long-context tricks.
    • Whether semantic content of pseudo queries can be optimized without losing the “lightweight” property.

Theme: Data-centric security detectors and forensics

3) Technical synthesis

  • Multiple papers converge on a “clean benchmark ≠ deployed reliability” pattern: INFACT (Base vs induced modes), LABSHIELD (MCQ vs semi-open PRP), and clinical stick-or-switch (single-shot vs multi-turn) all show large gaps.
  • Temporal structure is a recurring vulnerability: delayed backdoors activate after cumulative triggers; Video-LLMs show temporal inertia (near-zero TSS); household safety needs early-warning keyframes; multi-turn diagnosis suffers compounding switches.
  • External LLMs are increasingly used as infrastructure (trigger extraction, evaluators, expert planners/judges), but this creates shared-model bias and reproducibility issues (noted in VMAO, receipt forensics, and several safety evaluations).
  • Data geometry and targeted matching appear as a general lever: refusal-trigger-matched benign data (overrefusal), Mirror’s matched cells (prompt injection), and QAQ’s stratified reverse-MI selection (synthetic code) all argue that what you match matters more than raw scale.
  • Retrieval is not automatically helpful: small models show negative expected EM change with retrieval and high “irrelevant generation” even under oracle passages; meanwhile GraphRAG’s structure can be exploited by coherence/temporal chaining attacks.
  • Verification/checking loops are trending across domains: Tool-DC’s schema validator, VMAO’s verifier-driven replanning, CA-TTS’s self-check modules, and agent runtime hooks (PRISM) all implement “try → check → retry” patterns at different layers.
  • Position dominates in two separate efficiency stories: DapQ uses positional pseudo queries to align eviction with decoding; partial RoPE shows ≥10% rotation reaches a lower-loss band similar to full RoPE with large cache savings.
  • Causal diagnostics are emerging: TopoBench’s injected error modes identify which reasoning behaviors actually reduce accuracy (premature commitment, constraint forgetting), moving beyond observational CoT labeling.
  • Safety screening is bifurcating into (a) fast, deterministic L1 gates (Mirror SVM; PRISM heuristics) and (b) slower semantic/LLM-based adjudication for residuals—mirroring classic security architectures.

4) Top 5 papers (with “why now”)

1) Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

  • Mechanistic framing of overrefusal via “refusal triggers” and representational similarity evidence.
  • Practical mitigation: trigger-matched benign data improves safety–utility across SFT/P-SFT/RLVR (e.g., large Avg↓ reductions vs Alpaca baselines).
  • Useful for teams seeing usability regressions after safety tuning and needing a data construction recipe.
  • Skepticism: trigger extraction depends on GPT-4o and evaluation relies on automatic ASR/RR detectors.

2) Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

  • Introduces stick-or-switch metrics (positive/negative conviction, flexibility) that expose multi-turn failure modes.
  • Shows large abstention collapse and switching/sycophancy patterns across many models; GPT-5.2 is best but not perfect.
  • Directly relevant to clinical deployments where interaction is inherently incremental.
  • Skepticism: uses perturbed MCQA rather than real conversation logs; limited internal confidence analysis.

3) KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation

  • Demonstrates GraphRAG-specific poisoning via temporally coherent evolution narratives that integrate into KGs.
  • Shows strong ASR/CASR across multiple GraphRAG variants and that common defenses have little effect.
  • “Why now”: GraphRAG adoption is rising for timeliness/multi-hop; this is a concrete new attack surface.
  • Skepticism: black-box threat assumes ability to inject crawlable docs; evaluation uses GPT-4o-based measures.

4) You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

  • Defines the Trusted Executor Dilemma and quantifies README-embedded instruction attacks (ASR up to 85%).
  • Robustness across disguise/structure/abstraction; human reviewers detect 0% under naturalistic review framing.
  • Evaluates defenses and shows poor trade-offs (rule-based high FP; minimal LLM prompts under-detect).
  • Skepticism: some per-condition sample sizes are small; primary end-to-end focus is one commercial agent.

5) INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

  • Adds factuality + induced corruptions + temporal interventions for Video-LLMs with RR/TSS metrics.
  • Finds evidence corruption (caption injection) is especially damaging; many open models show near-zero factuality TSS.
  • “Why now”: video agents and multimodal assistants are moving into settings with unreliable subtitles/captions.
  • Skepticism: induced operators are proxies; temporal interventions are limited to shuffle/reversal on a subset.

5) Practical next steps

  • Add “interaction-structure” evals to safety gates: run stick-or-switch style multi-turn tests (conviction/abstention retention) for any high-stakes domain assistant, not just single-shot accuracy.
  • Instrument overrefusal via trigger mining: extract candidate refusal triggers from your harmful SFT/RLHF data and measure representation/semantic-distance dependence; try trigger-matched benign augmentation rather than generic benign corpora.
  • Harden agent setup workflows: treat README/docs as untrusted; require provenance-aware trust tiers and action-level confirmations for filesystem/network reads, especially during install/setup.
  • Adopt lifecycle hooks + auditability for tool agents: implement multi-hook enforcement (ingress/pre/post tool/outbound/maintenance), session-scoped risk accumulation, and tamper-evident audit chaining; measure block-rate vs latency.
  • For GraphRAG deployments: add KG-ingestion checks for anomalous temporal chaining and multi-source corroboration before integrating new edges; explicitly test against knowledge-evolution poisoning.
  • For small-model RAG: don’t assume retrieval helps—measure KNOWN/UNKNOWN splits and oracle utilization; consider selective retrieval (only when uncertain) and RAG-aware fine-tuning to reduce “irrelevant generation.”
  • For long-context inference: evaluate position-aligned KV eviction (pseudo-query scoring) and consider partial RoPE (≥10%) to reduce memory while monitoring stability bands and placement sensitivity.
  • For multimodal safety: incorporate induced corruptions (caption injection, temporal shuffles) and embodied perception stressors (transparent objects, early-warning timing) into acceptance tests; track stability metrics, not just base accuracy.

Generated from per-paper analyses; no external browsing.