Daily AI Paper Report (2026-04-10)

Published: April 10, 2026

Chinese version: [中文]

Run stats

Candidates: 256
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-08T00:00:00Z → 2026-04-09T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.07223`	TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories PDF	cs.CR, cs.AI, cs.CL, cs.LG, cs.SE	95	First benchmark for mid-trajectory tool-use guardrails; broad risks + multi-model eval.	agent-safety, tool-use, guardrails, benchmark, prompt-injection, privacy, monitoring
`2604.06820`	Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation PDF	cs.AI	93	Audits LLM disinfo judges vs humans; finds persistent gaps—key for safety eval validity.	disinformation, evaluation, llm-judges, human-grounded, risk-assessment
`2604.06550`	SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills PDF	cs.CR, cs.AI	92	Practical triage to detect malicious agent skills across code+NL instructions at scale.	agent-security, marketplaces, prompt-injection, static-analysis, LLM-auditing, supply-chain
`2604.07036`	ReDAct: Uncertainty-Aware Deferral for LLM Agents PDF	cs.CL, cs.LG, cs.MA	92	Uncertainty-based deferral for LLM agents: cheaper model escalates to reliable one to reduce hallucination cascades	agents, uncertainty, deferral, hallucinations, cost-quality tradeoff, calibration
`2604.06756`	How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality PDF	cs.CL	91	Shows reasoning-chain exposure can bias LLM judges; important for reliable eval pipelines.	llm-judges, chain-of-thought, factuality, evaluation, bias
`2604.06811`	SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems PDF	cs.CR, cs.AI	90	Backdoors via skill composition (not weights); realistic threat for modular agent stacks.	agent-security, backdoors, tooling, skill-composition, supply-chain, red-teaming
`2604.06742`	Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios PDF	cs.SE, cs.AI	90	CLI-Tool-Bench: end-to-end sandboxed eval for 0-to-1 agent software generation (100 repos).	agents, software-generation, benchmark, evaluation, sandboxing, differential-testing
`2604.06996`	Self-Preference Bias in Rubric-Based Evaluation of Large Language Models PDF	cs.CL, cs.AI	90	Shows self-preference bias persists even with objective rubrics; important for LLM-as-judge and RSI evaluation	evaluation, LLM-as-judge, bias, rubrics, benchmarking, recursive-improvement
`2604.06746`	StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference PDF	cs.CL	89	StructKV compresses KV cache via structure-aware global hubs; targets million-token long-context.	llm, long-context, inference, kv-cache, efficiency, systems
`2604.06840`	MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning PDF	cs.CR	88	CoT backdoor that keeps reasoning clean but flips final answer; challenges process monitors.	backdoors, chain-of-thought, evasion, model-security, monitoring, robustness
`2604.06714`	Steering the Verifiability of Multimodal AI Hallucinations PDF	cs.AI, cs.CL, cs.CV, cs.LG	88	Dataset + method to steer verifiability of MLLM hallucinations (obvious vs elusive).	multimodal, hallucinations, verifiability, dataset, safety
`2604.06753`	Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents PDF	cs.CL	88	Paradigm routing for agents: per-task selection among CoT/ReAct/etc yields big gains; strong inference-time method	agents, routing, inference-time, reasoning, ReAct, benchmarking
`2604.06831`	Towards Privacy-Preserving Large Language Model: Text-free Inference Through Alignment and Adaptation PDF	cs.CR, cs.AI	87	Text-free LLM inference via client encoder + server projector; aims to preserve utility with privacy.	privacy, llm, secure-inference, representation-learning, deployment
`2604.06833`	FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization PDF	cs.CR, cs.LG	86	Federated alignment can poison safety; proposes on-device sanitization for robust SLMs.	federated-learning, alignment, data-poisoning, toxicity, on-device, SLMs
`2604.06618`	PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy PDF	cs.CR	86	LLM multi-agent PoC exploit generation w/ semantic oracle + RL policy—high security impact.	cybersecurity, agents, vulnerability, exploit-generation, verification, rl
`2604.06552`	To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs PDF	cs.CL	86	GlobalLies dataset shows country/language-dependent misinformation compliance; large-scale eval.	misinformation, safety, multilingual, dataset, evaluation, llm-behavior
`2604.06710`	ATANT: An Evaluation Framework for AI Continuity PDF	cs.AI, cs.IR	86	Continuity/memory evaluation without LLM judges; corpus + checks for long-term context persistence and updates	evaluation, memory, long-term context, RAG, agent reliability, datasets
`2604.07343`	Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization PDF	cs.CL, cs.LG	84	Benchmark for personalized reward models; targets pluralistic alignment evaluation gap.	alignment, reward-models, personalization, evaluation, preference-learning, benchmarks
`2604.06846`	MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors PDF	cs.CL, cs.AI	84	MedDialBench: graded adversarial patient behaviors to measure diagnostic robustness.	benchmark, robustness, medical-dialogue, adversarial, evaluation
`2604.07165`	Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization PDF	cs.AI, cs.LG	84	T-STAR builds a cognitive tree to fix sparse-reward credit assignment in multi-turn agent RL.	agents, rl, policy-optimization, credit-assignment, reasoning, post-training
`2604.06633`	Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection PDF	cs.CR, cs.CL, cs.SE	82	Multi-agent LLM workflow for SAST aiming to cut hallucinations/FPs and improve full-chain vulns.	LLM-security, static-analysis, multi-agent, vulnerability-detection, software-security, tooling
`2604.06793`	Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development PDF	cs.SE, cs.AI	82	SWD-Bench evaluates repo-level docs via QA + feature implementation, avoiding LLM-judge pitfalls.	benchmark, software-engineering, documentation, evaluation, qa
`2604.06834`	On the Step Length Confounding in LLM Reasoning Data Selection PDF	cs.CL, cs.AI	82	Finds step-length confounding in reasoning-data selection; warns naturalness filters prefer verbosity.	reasoning, data-selection, chain-of-thought, training, evaluation, methodology
`2604.06812`	AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation PDF	cs.CL	82	Long-text uncertainty quantification using adaptive granularity + semantic clustering; targets hallucination risk	uncertainty, hallucinations, long-form generation, NLI, reliability
`2604.07253`	Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education PDF	cs.CY, cs.AI	80	Participatory design on safe/accountable GenAI for women under surveillance—practical safety needs.	human-factors, privacy, safety, accountability, surveillance, hci
`2604.06736`	SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation PDF	cs.CL, cs.DB	80	Structural reliability for Text-to-SQL; AST-based eval shows paraphrase sensitivity; structured generation helps	evaluation, text-to-SQL, program synthesis, robustness, structure, agents
`2604.07321`	Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation PDF	cs.LO, cs.AI	78	Evaluates NL→LTL translation; highlights semantic failure modes for formal-policy tooling.	formal-methods, evaluation, LLMs, temporal-logic, specification, reliability
`2604.06912`	Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models PDF	cs.CV, cs.AI	78	Query-aware adaptive high-res perception to cut MLLM compute; likely impactful efficiency idea.	multimodal, efficiency, adaptive-computation, vision, long-context
`2604.06696`	AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents PDF	cs.AI	78	AgentGate: constrained routing/dispatch for multi-agent systems under latency, privacy, cost.	agents, routing, orchestration, systems, tool-use, deployment
`2604.06799`	Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions PDF	cs.CL, cs.CY	78	Nine-dimension algebraic complexity framework to diagnose LLM reasoning failures; auto gen+verify for tracking	evaluation, reasoning, algebra, diagnostics, benchmarks

AI Paper Insight Brief

2026-04-10

0) Executive takeaways (read this first)

Agent ecosystems are becoming a supply-chain security problem: two complementary papers show both defenses for malicious “skills” at marketplace scale (SkillSieve) and attacks that persist via reusable skill packages (SkillTrojan), implying you need registry scanning + provenance + runtime controls, not just prompt-level guardrails.
“LLM-as-judge” is increasingly a weak link in safety evaluation: human-grounded disinformation auditing finds judges agree with each other but poorly track humans (rank/cue mismatch), and rubric-based judging shows strong self-preference bias—even on objective rubrics—so evaluation pipelines need human anchoring and anti-bias design.
Structured intermediates + structural competence are emerging as the reliability lever: compile-style JSON→SQL improves both execution accuracy and structural stability; NL→LTL improves with Python/AST interfaces; TraceSafe finds guardrail performance correlates with structured-input competence, not jailbreak robustness.
Cost-aware agent routing is maturing from “which model?” to “which policy/paradigm/agent?”: AgentGate routes among agents with confidence-triggered fallback; Select-then-Solve routes among reasoning paradigms; ReDAct defers per action in sequential environments using uncertainty—together suggesting a unified “router stack” for production agents.
Long-horizon robustness is being attacked and defended at the process level: MirageBackdoor shows “think-well-answer-wrong” backdoors that evade CoT monitoring; judge studies show reasoning exposure can mislead; this pushes toward consistency checks between reasoning, actions, and outcomes rather than trusting fluent traces.
Efficiency work is targeting the real bottlenecks: StructKV improves long-context inference under tight KV budgets while speeding prefill; Q-Zoom reduces visual token cost via query-aware RoI refinement in a single prefill pass—both are deployment-relevant.

2) Key themes (clusters)

Theme: Skill / tool supply-chain security for agents

Why it matters: Skills/tools run with agent privileges (files/env/network). Marketplace-scale distribution makes malicious packages a high-leverage attack vector; defenses must handle both code and natural-language instructions and be robust to evasion.
Representative papers:
- SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
- SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
Common approach:
- Treat “skills” as composite artifacts (instructions + scripts) with privileged execution.
- Emphasize stealth/evasion: split payloads, conditional triggers, obfuscation, dormant behavior.
- Use structured analysis pipelines (triage layers; trigger/payload construction frameworks).
Open questions / failure modes:
- Runtime-fetched payloads and time-delayed attacks remain hard for static scanning (SkillSieve limitation).
- How to detect backdoors that preserve clean visible behavior while executing side effects (SkillTrojan’s core stealth property).
- Generalization of detectors trained on limited malicious-author diversity (SkillSieve).

Theme: LLM-as-judge validity and bias (especially for safety)

Why it matters: Many safety and alignment workflows optimize against automated judges; if judges don’t track humans or are biased toward themselves/family, you can Goodhart the evaluation and ship systems that look safe but aren’t.
Representative papers:
Common approach:
- Pair model judgments with stronger references (human ratings; programmatic ground truth; controlled interventions).
- Measure not just accuracy but calibration, ranking fidelity, and cue dependence.
- Test prompt sensitivity and ensembling/jury mitigations.
Open questions / failure modes:
- High judge–judge agreement can coexist with low human alignment (disinformation audit).
- Rubric-based evaluation still exhibits self/family overestimation even with objective rubrics (IFEval).
- Exposing reasoning can increase overconfidence or be exploited by fluent-but-wrong rationales.

Theme: Structured representations as a reliability primitive (generation + guarding)

Why it matters: Execution correctness can hide instability; free-form text is hard to validate. Structured intermediates enable deterministic compilation, canonicalization, and better guardrail parsing.
Representative papers:
Common approach:
- Parse/canonicalize outputs into AST-like forms and evaluate structure distributions, not just end results.
- Use external oracles (SQL execution; NuSMV entailment) to separate syntax from semantics.
- Provide schemas/traces to guards; analyze which risk types are structurally obvious vs subtle.
Open questions / failure modes:
- Canonical ASTs don’t fully capture semantic equivalence (SQLStructEval limitation).
- NL→LTL semantic equivalence remains low on average despite syntactic plausibility; temporal operator scoping errors dominate.
- Subtle interface inconsistencies in tool traces remain hard for guards (TraceSafe).

Theme: Routing, deferral, and meta-control for agent systems (edge-first)

Why it matters: Real deployments need to trade off cost/latency/privacy vs correctness/safety. Routing is shifting from ad-hoc prompting to explicit, confidence-aware decision systems.
Representative papers:
Common approach:
- Constrain the action space (route vs plan vs direct vs escalate; choose among paradigms; defer per step).
- Use confidence/uncertainty signals to trigger fallback/escalation.
- Train lightweight routers on curated datasets or calibrated thresholds.
Open questions / failure modes:
- Confidence calibration and fallback policies can dominate behavior (AgentGate bottleneck).
- Deferral requires token-probabilities (ReDAct limitation) and calibration can shift once deferral changes trajectories.
- Paradigm implementations and router training sets are limited; generality across tool stacks remains open.

Theme: Robustness & efficiency for long context and high-res multimodal

Why it matters: Frontier contexts (128K+) and high-res vision flood compute/memory; without principled pruning/routing, deployment costs explode and accuracy collapses under compression.
Representative papers:
- StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
- Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Common approach:
- Identify globally important tokens/regions (cross-layer centrality; self-distilled RoI heatmaps).
- Decouple compute budget from storage budget (StructKV) or coarse from fine perception (Q-Zoom).
- Operate with minimal extra passes (StructKV pivoting; Q-Zoom single prefill pass).
Open questions / failure modes:
- Validation beyond 128K and on other architectures (MoE/SSMs) is untested for StructKV.
- Gate/RPN hyperparameter sensitivity and base-resolution constraints can reduce throughput gains (Q-Zoom).

3) Technical synthesis

Multi-stage “cheap→expensive” pipelines are converging across security and systems: SkillSieve’s static triage→LLM decomposition→multi-LLM jury mirrors agent routing stacks (AgentGate confidence fallback; ReDAct deferral).
Structured outputs are doing double duty: they improve generation stability (SQL compile-style JSON→SQL) and guardrail effectiveness (TraceSafe shows structural competence is the main bottleneck).
Evaluation is shifting from single scalar metrics to diagnostic decompositions:
- Structural variance metrics for Text-to-SQL (distinct ASTs, majority ratio, perturbation sensitivity).
- Multi-axis parametric stress tests (MedDialBench’s behavior dimensions; algebra’s nine complexity dimensions).
- Mid-trajectory risk taxonomies for tool traces (12 TraceSafe risk types).
Human grounding is becoming mandatory for reader-facing harms: disinformation risk evaluation shows LLM judges are systematically harsher and mis-rank items vs humans; prompt tweaks don’t fix it.
Reasoning traces are not trustworthy signals by default:
- MirageBackdoor preserves plausible CoT while forcing wrong answers.
- Judges can be misled by fluent reasoning; even strong judges can be swayed by high-quality wrong chains.
RAG helps but is uneven: misinformation mitigations via RAG reduce generation (up to ~53%) but vary by language/region due to information availability; Argus uses RAG to expand sink discovery in SAST.
Offline learning is entering agentic security automation: PoC-Adapt uses offline DDQN to reduce exploit-generation trial-and-error; T-STAR uses tree consolidation + targeted preference loss to improve multi-turn RL credit assignment.
Edge/low-cost deployment is a recurring constraint: SkillSieve runs on a $440 ARM board; AgentGate targets 3B–7B routers; FedDetox distills a safety “Guardian” into MobileBERT for on-device sanitization.
Uncertainty is being operationalized at different layers: AGSC for long-text factuality UQ via NLI + clustering; ReDAct for action-level deferral via token-probability UQ; both face “echo chamber”/systematic-error risks.

4) Top 5 papers (with “why now”)

1) SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Practical, cost-conscious pipeline: on-device static triage resolves ~86% of skills at ~39ms each; only ~14% escalate to LLM calls.
Strong benchmarked gains: end-to-end F1=0.800 vs regex baseline F1=0.421 on a labeled 400-skill set.
Real deployment signal: processed 49,592 skills in 31 minutes on an ARM single-board computer.
Skepticism: can’t catch runtime-fetched payloads/time-delayed attacks; LLM nondeterminism and limited training diversity may hurt generalization.

2) TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Fills a missing benchmark niche: static, step-labeled tool-call traces across 12 risk types (>1,000 traces) for mid-trajectory guard evaluation.
Key diagnostic: performance correlates strongly with structured-input competence (ρ≈0.80 with RAGTruth Data2txt) and near-zero with jailbreak robustness (ρ≈0.05).
Shows where guards fail: subtle interface inconsistencies remain low-accuracy even when prompt injection/key leaks are detected well.
Skepticism: static traces don’t capture interactive co-evolution between agent and guard; taxonomy will need continual updates.

3) Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

Directly audits proxy validity: 290 deceptive articles paired with 2,043 human ratings; eight frontier judges scored the same items.
Finds a dangerous pattern: judges agree with each other (≈0.81/0.69) but align weakly with humans (≈0.45 credibility, ≈0.24 sharing).
Identifies cue mismatch: judges overweight logical rigor and penalize emotional intensity relative to humans.
Skepticism: participant sample not population-representative; benchmark could be expanded for scenario coverage.

4) StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

Concrete long-context win under tight budgets: at 10% KV retention, nearly matches full-context LongBench (48.61 vs 49.33 on LLaMA-3.1-8B) and improves RULER retrieval to 128K (80.1 vs 75.6).
Adds prefill acceleration (~1.87× at 32K) by projecting later layers onto a “structural skeleton”.
Cross-layer importance + pivot detection is a principled alternative to single-layer saliency pruning.
Skepticism: validated only up to 128K; bandwidth-heavy aggregation/pivoting may be hardware-sensitive.

5) SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

Makes instability measurable: even when execution-correct, models produce multiple distinct ASTs (e.g., GPT-5-mini correct queries still average 1.378 distinct ASTs).
Shows perturbation brittleness: paraphrases can drive large structural shifts (AST sim 0.328; sensitive fraction 0.900 for GPT-5-mini).
Demonstrates a mitigation: compile-style JSON→SQL improves exec accuracy (0.785 vs 0.742) and structural similarity (0.632 vs 0.552).
Skepticism: limited to Spider; canonical ASTs don’t fully capture semantic equivalence.

5) Practical next steps

Harden skill/tool supply chains: adopt a SkillSieve-like layered scanner for your tool registry (static features + structured semantic checks + multi-model adjudication), and explicitly test against split/obfuscated payloads and conditional triggers.
Add runtime monitoring for what static scans miss: prioritize detection for runtime fetches and time-delayed behaviors (explicitly called out as hard in SkillSieve).
Benchmark your guardrails on traces, not just prompts: use TraceSafe-style step-labeled tool-call traces; measure per-risk-type performance, especially interface inconsistency detection.
Stop treating judge agreement as validity for reader-facing harms: for disinformation, incorporate periodic human rating calibration and track rank correlation to humans (not just average scores).
Mitigate judge bias in evaluation loops: test rubric-based SPB (self/family overestimation) and consider committee voting; treat “objective rubrics” as necessary but not sufficient.
Prefer structured intermediates in agent tool generation: for program-like outputs (SQL, logic specs, tool args), move to JSON/AST interfaces + deterministic compilation and validate with external oracles (execution, NuSMV).
Deploy a router stack: combine (a) edge-first structured routing (AgentGate), (b) paradigm routing (Select-then-Solve), and (c) per-step deferral via uncertainty (ReDAct) to control cost while preserving reliability.
Stress-test against “clean reasoning, wrong answer” threats: add consistency checks between reasoning/actions and outcomes; don’t rely on CoT plausibility as a safety signal (MirageBackdoor + judge susceptibility results).

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-10

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Skill / tool supply-chain security for agents

Theme: LLM-as-judge validity and bias (especially for safety)

Theme: Structured representations as a reliability primitive (generation + guarding)

Theme: Routing, deferral, and meta-control for agent systems (edge-first)

Theme: Robustness & efficiency for long context and high-res multimodal

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps