Daily AI Paper Report (2026-04-10)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 256
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-08T00:00:00Z → 2026-04-09T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.07223TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
PDF
cs.CR, cs.AI, cs.CL, cs.LG, cs.SE95First benchmark for mid-trajectory tool-use guardrails; broad risks + multi-model eval.agent-safety, tool-use, guardrails, benchmark, prompt-injection, privacy, monitoring
2604.06820Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation
PDF
cs.AI93Audits LLM disinfo judges vs humans; finds persistent gaps—key for safety eval validity.disinformation, evaluation, llm-judges, human-grounded, risk-assessment
2604.06550SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
PDF
cs.CR, cs.AI92Practical triage to detect malicious agent skills across code+NL instructions at scale.agent-security, marketplaces, prompt-injection, static-analysis, LLM-auditing, supply-chain
2604.07036ReDAct: Uncertainty-Aware Deferral for LLM Agents
PDF
cs.CL, cs.LG, cs.MA92Uncertainty-based deferral for LLM agents: cheaper model escalates to reliable one to reduce hallucination cascadesagents, uncertainty, deferral, hallucinations, cost-quality tradeoff, calibration
2604.06756How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
PDF
cs.CL91Shows reasoning-chain exposure can bias LLM judges; important for reliable eval pipelines.llm-judges, chain-of-thought, factuality, evaluation, bias
2604.06811SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
PDF
cs.CR, cs.AI90Backdoors via skill composition (not weights); realistic threat for modular agent stacks.agent-security, backdoors, tooling, skill-composition, supply-chain, red-teaming
2604.06742Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
PDF
cs.SE, cs.AI90CLI-Tool-Bench: end-to-end sandboxed eval for 0-to-1 agent software generation (100 repos).agents, software-generation, benchmark, evaluation, sandboxing, differential-testing
2604.06996Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
PDF
cs.CL, cs.AI90Shows self-preference bias persists even with objective rubrics; important for LLM-as-judge and RSI evaluationevaluation, LLM-as-judge, bias, rubrics, benchmarking, recursive-improvement
2604.06746StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
PDF
cs.CL89StructKV compresses KV cache via structure-aware global hubs; targets million-token long-context.llm, long-context, inference, kv-cache, efficiency, systems
2604.06840MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
PDF
cs.CR88CoT backdoor that keeps reasoning clean but flips final answer; challenges process monitors.backdoors, chain-of-thought, evasion, model-security, monitoring, robustness
2604.06714Steering the Verifiability of Multimodal AI Hallucinations
PDF
cs.AI, cs.CL, cs.CV, cs.LG88Dataset + method to steer verifiability of MLLM hallucinations (obvious vs elusive).multimodal, hallucinations, verifiability, dataset, safety
2604.06753Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
PDF
cs.CL88Paradigm routing for agents: per-task selection among CoT/ReAct/etc yields big gains; strong inference-time methodagents, routing, inference-time, reasoning, ReAct, benchmarking
2604.06831Towards Privacy-Preserving Large Language Model: Text-free Inference Through Alignment and Adaptation
PDF
cs.CR, cs.AI87Text-free LLM inference via client encoder + server projector; aims to preserve utility with privacy.privacy, llm, secure-inference, representation-learning, deployment
2604.06833FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
PDF
cs.CR, cs.LG86Federated alignment can poison safety; proposes on-device sanitization for robust SLMs.federated-learning, alignment, data-poisoning, toxicity, on-device, SLMs
2604.06618PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy
PDF
cs.CR86LLM multi-agent PoC exploit generation w/ semantic oracle + RL policy—high security impact.cybersecurity, agents, vulnerability, exploit-generation, verification, rl
2604.06552To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs
PDF
cs.CL86GlobalLies dataset shows country/language-dependent misinformation compliance; large-scale eval.misinformation, safety, multilingual, dataset, evaluation, llm-behavior
2604.06710ATANT: An Evaluation Framework for AI Continuity
PDF
cs.AI, cs.IR86Continuity/memory evaluation without LLM judges; corpus + checks for long-term context persistence and updatesevaluation, memory, long-term context, RAG, agent reliability, datasets
2604.07343Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
PDF
cs.CL, cs.LG84Benchmark for personalized reward models; targets pluralistic alignment evaluation gap.alignment, reward-models, personalization, evaluation, preference-learning, benchmarks
2604.06846MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
PDF
cs.CL, cs.AI84MedDialBench: graded adversarial patient behaviors to measure diagnostic robustness.benchmark, robustness, medical-dialogue, adversarial, evaluation
2604.07165Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization
PDF
cs.AI, cs.LG84T-STAR builds a cognitive tree to fix sparse-reward credit assignment in multi-turn agent RL.agents, rl, policy-optimization, credit-assignment, reasoning, post-training
2604.06633Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection
PDF
cs.CR, cs.CL, cs.SE82Multi-agent LLM workflow for SAST aiming to cut hallucinations/FPs and improve full-chain vulns.LLM-security, static-analysis, multi-agent, vulnerability-detection, software-security, tooling
2604.06793Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development
PDF
cs.SE, cs.AI82SWD-Bench evaluates repo-level docs via QA + feature implementation, avoiding LLM-judge pitfalls.benchmark, software-engineering, documentation, evaluation, qa
2604.06834On the Step Length Confounding in LLM Reasoning Data Selection
PDF
cs.CL, cs.AI82Finds step-length confounding in reasoning-data selection; warns naturalness filters prefer verbosity.reasoning, data-selection, chain-of-thought, training, evaluation, methodology
2604.06812AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation
PDF
cs.CL82Long-text uncertainty quantification using adaptive granularity + semantic clustering; targets hallucination riskuncertainty, hallucinations, long-form generation, NLI, reliability
2604.07253Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education
PDF
cs.CY, cs.AI80Participatory design on safe/accountable GenAI for women under surveillance—practical safety needs.human-factors, privacy, safety, accountability, surveillance, hci
2604.06736SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation
PDF
cs.CL, cs.DB80Structural reliability for Text-to-SQL; AST-based eval shows paraphrase sensitivity; structured generation helpsevaluation, text-to-SQL, program synthesis, robustness, structure, agents
2604.07321Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation
PDF
cs.LO, cs.AI78Evaluates NL→LTL translation; highlights semantic failure modes for formal-policy tooling.formal-methods, evaluation, LLMs, temporal-logic, specification, reliability
2604.06912Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
PDF
cs.CV, cs.AI78Query-aware adaptive high-res perception to cut MLLM compute; likely impactful efficiency idea.multimodal, efficiency, adaptive-computation, vision, long-context
2604.06696AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents
PDF
cs.AI78AgentGate: constrained routing/dispatch for multi-agent systems under latency, privacy, cost.agents, routing, orchestration, systems, tool-use, deployment
2604.06799Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
PDF
cs.CL, cs.CY78Nine-dimension algebraic complexity framework to diagnose LLM reasoning failures; auto gen+verify for trackingevaluation, reasoning, algebra, diagnostics, benchmarks

AI Paper Insight Brief

2026-04-10

0) Executive takeaways (read this first)

  • Agent ecosystems are becoming a supply-chain security problem: two complementary papers show both defenses for malicious “skills” at marketplace scale (SkillSieve) and attacks that persist via reusable skill packages (SkillTrojan), implying you need registry scanning + provenance + runtime controls, not just prompt-level guardrails.
  • “LLM-as-judge” is increasingly a weak link in safety evaluation: human-grounded disinformation auditing finds judges agree with each other but poorly track humans (rank/cue mismatch), and rubric-based judging shows strong self-preference bias—even on objective rubrics—so evaluation pipelines need human anchoring and anti-bias design.
  • Structured intermediates + structural competence are emerging as the reliability lever: compile-style JSON→SQL improves both execution accuracy and structural stability; NL→LTL improves with Python/AST interfaces; TraceSafe finds guardrail performance correlates with structured-input competence, not jailbreak robustness.
  • Cost-aware agent routing is maturing from “which model?” to “which policy/paradigm/agent?”: AgentGate routes among agents with confidence-triggered fallback; Select-then-Solve routes among reasoning paradigms; ReDAct defers per action in sequential environments using uncertainty—together suggesting a unified “router stack” for production agents.
  • Long-horizon robustness is being attacked and defended at the process level: MirageBackdoor shows “think-well-answer-wrong” backdoors that evade CoT monitoring; judge studies show reasoning exposure can mislead; this pushes toward consistency checks between reasoning, actions, and outcomes rather than trusting fluent traces.
  • Efficiency work is targeting the real bottlenecks: StructKV improves long-context inference under tight KV budgets while speeding prefill; Q-Zoom reduces visual token cost via query-aware RoI refinement in a single prefill pass—both are deployment-relevant.

2) Key themes (clusters)

Theme: Skill / tool supply-chain security for agents

  • Why it matters: Skills/tools run with agent privileges (files/env/network). Marketplace-scale distribution makes malicious packages a high-leverage attack vector; defenses must handle both code and natural-language instructions and be robust to evasion.
  • Representative papers:
  • Common approach:
    • Treat “skills” as composite artifacts (instructions + scripts) with privileged execution.
    • Emphasize stealth/evasion: split payloads, conditional triggers, obfuscation, dormant behavior.
    • Use structured analysis pipelines (triage layers; trigger/payload construction frameworks).
  • Open questions / failure modes:
    • Runtime-fetched payloads and time-delayed attacks remain hard for static scanning (SkillSieve limitation).
    • How to detect backdoors that preserve clean visible behavior while executing side effects (SkillTrojan’s core stealth property).
    • Generalization of detectors trained on limited malicious-author diversity (SkillSieve).

Theme: LLM-as-judge validity and bias (especially for safety)

Theme: Structured representations as a reliability primitive (generation + guarding)

Theme: Routing, deferral, and meta-control for agent systems (edge-first)

Theme: Robustness & efficiency for long context and high-res multimodal

  • Why it matters: Frontier contexts (128K+) and high-res vision flood compute/memory; without principled pruning/routing, deployment costs explode and accuracy collapses under compression.
  • Representative papers:
  • Common approach:
    • Identify globally important tokens/regions (cross-layer centrality; self-distilled RoI heatmaps).
    • Decouple compute budget from storage budget (StructKV) or coarse from fine perception (Q-Zoom).
    • Operate with minimal extra passes (StructKV pivoting; Q-Zoom single prefill pass).
  • Open questions / failure modes:
    • Validation beyond 128K and on other architectures (MoE/SSMs) is untested for StructKV.
    • Gate/RPN hyperparameter sensitivity and base-resolution constraints can reduce throughput gains (Q-Zoom).

3) Technical synthesis

  • Multi-stage “cheap→expensive” pipelines are converging across security and systems: SkillSieve’s static triage→LLM decomposition→multi-LLM jury mirrors agent routing stacks (AgentGate confidence fallback; ReDAct deferral).
  • Structured outputs are doing double duty: they improve generation stability (SQL compile-style JSON→SQL) and guardrail effectiveness (TraceSafe shows structural competence is the main bottleneck).
  • Evaluation is shifting from single scalar metrics to diagnostic decompositions:
    • Structural variance metrics for Text-to-SQL (distinct ASTs, majority ratio, perturbation sensitivity).
    • Multi-axis parametric stress tests (MedDialBench’s behavior dimensions; algebra’s nine complexity dimensions).
    • Mid-trajectory risk taxonomies for tool traces (12 TraceSafe risk types).
  • Human grounding is becoming mandatory for reader-facing harms: disinformation risk evaluation shows LLM judges are systematically harsher and mis-rank items vs humans; prompt tweaks don’t fix it.
  • Reasoning traces are not trustworthy signals by default:
    • MirageBackdoor preserves plausible CoT while forcing wrong answers.
    • Judges can be misled by fluent reasoning; even strong judges can be swayed by high-quality wrong chains.
  • RAG helps but is uneven: misinformation mitigations via RAG reduce generation (up to ~53%) but vary by language/region due to information availability; Argus uses RAG to expand sink discovery in SAST.
  • Offline learning is entering agentic security automation: PoC-Adapt uses offline DDQN to reduce exploit-generation trial-and-error; T-STAR uses tree consolidation + targeted preference loss to improve multi-turn RL credit assignment.
  • Edge/low-cost deployment is a recurring constraint: SkillSieve runs on a $440 ARM board; AgentGate targets 3B–7B routers; FedDetox distills a safety “Guardian” into MobileBERT for on-device sanitization.
  • Uncertainty is being operationalized at different layers: AGSC for long-text factuality UQ via NLI + clustering; ReDAct for action-level deferral via token-probability UQ; both face “echo chamber”/systematic-error risks.

4) Top 5 papers (with “why now”)

1) SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

  • Practical, cost-conscious pipeline: on-device static triage resolves ~86% of skills at ~39ms each; only ~14% escalate to LLM calls.
  • Strong benchmarked gains: end-to-end F1=0.800 vs regex baseline F1=0.421 on a labeled 400-skill set.
  • Real deployment signal: processed 49,592 skills in 31 minutes on an ARM single-board computer.
  • Skepticism: can’t catch runtime-fetched payloads/time-delayed attacks; LLM nondeterminism and limited training diversity may hurt generalization.

2) TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

  • Fills a missing benchmark niche: static, step-labeled tool-call traces across 12 risk types (>1,000 traces) for mid-trajectory guard evaluation.
  • Key diagnostic: performance correlates strongly with structured-input competence (ρ≈0.80 with RAGTruth Data2txt) and near-zero with jailbreak robustness (ρ≈0.05).
  • Shows where guards fail: subtle interface inconsistencies remain low-accuracy even when prompt injection/key leaks are detected well.
  • Skepticism: static traces don’t capture interactive co-evolution between agent and guard; taxonomy will need continual updates.

3) Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

  • Directly audits proxy validity: 290 deceptive articles paired with 2,043 human ratings; eight frontier judges scored the same items.
  • Finds a dangerous pattern: judges agree with each other (≈0.81/0.69) but align weakly with humans (≈0.45 credibility, ≈0.24 sharing).
  • Identifies cue mismatch: judges overweight logical rigor and penalize emotional intensity relative to humans.
  • Skepticism: participant sample not population-representative; benchmark could be expanded for scenario coverage.

4) StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

  • Concrete long-context win under tight budgets: at 10% KV retention, nearly matches full-context LongBench (48.61 vs 49.33 on LLaMA-3.1-8B) and improves RULER retrieval to 128K (80.1 vs 75.6).
  • Adds prefill acceleration (~1.87× at 32K) by projecting later layers onto a “structural skeleton”.
  • Cross-layer importance + pivot detection is a principled alternative to single-layer saliency pruning.
  • Skepticism: validated only up to 128K; bandwidth-heavy aggregation/pivoting may be hardware-sensitive.

5) SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

  • Makes instability measurable: even when execution-correct, models produce multiple distinct ASTs (e.g., GPT-5-mini correct queries still average 1.378 distinct ASTs).
  • Shows perturbation brittleness: paraphrases can drive large structural shifts (AST sim 0.328; sensitive fraction 0.900 for GPT-5-mini).
  • Demonstrates a mitigation: compile-style JSON→SQL improves exec accuracy (0.785 vs 0.742) and structural similarity (0.632 vs 0.552).
  • Skepticism: limited to Spider; canonical ASTs don’t fully capture semantic equivalence.

5) Practical next steps

  • Harden skill/tool supply chains: adopt a SkillSieve-like layered scanner for your tool registry (static features + structured semantic checks + multi-model adjudication), and explicitly test against split/obfuscated payloads and conditional triggers.
  • Add runtime monitoring for what static scans miss: prioritize detection for runtime fetches and time-delayed behaviors (explicitly called out as hard in SkillSieve).
  • Benchmark your guardrails on traces, not just prompts: use TraceSafe-style step-labeled tool-call traces; measure per-risk-type performance, especially interface inconsistency detection.
  • Stop treating judge agreement as validity for reader-facing harms: for disinformation, incorporate periodic human rating calibration and track rank correlation to humans (not just average scores).
  • Mitigate judge bias in evaluation loops: test rubric-based SPB (self/family overestimation) and consider committee voting; treat “objective rubrics” as necessary but not sufficient.
  • Prefer structured intermediates in agent tool generation: for program-like outputs (SQL, logic specs, tool args), move to JSON/AST interfaces + deterministic compilation and validate with external oracles (execution, NuSMV).
  • Deploy a router stack: combine (a) edge-first structured routing (AgentGate), (b) paradigm routing (Select-then-Solve), and (c) per-step deferral via uncertainty (ReDAct) to control cost while preserving reliability.
  • Stress-test against “clean reasoning, wrong answer” threats: add consistency checks between reasoning/actions and outcomes; don’t rely on CoT plausibility as a safety signal (MirageBackdoor + judge susceptibility results).

Generated from per-paper analyses; no external browsing.