Daily AI Paper Report (2026-04-09)

Published: April 09, 2026

Chinese version: [中文]

Run stats

Candidates: 261
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-07T00:00:00Z → 2026-04-08T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.05292`	Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code PDF	cs.CR, cs.AI, cs.SE	96	Formal-verif study finds 55.8% AI code vulnerable; strong security methodology + dataset scale	code-security, formal-verification, LLM-coding, CWE, SMT, evaluation
`2604.05969`	A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms PDF	cs.CR, cs.AI	95	Formal security framework for MCP agent ecosystems: taxonomy, verification models, defenses.	agent-security, MCP, threat-modeling, formal-methods, tool-use, verification
`2604.05432`	Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use PDF	cs.CR, cs.AI	94	Backdoored tool-use agents can exfiltrate stored context via memory/retrieval tool calls.	data-exfiltration, backdoors, tool-use, agent-security, memory, prompt-injection
`2604.05358`	LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment PDF	cs.AI, cs.LG	93	White-box, real-time RAG faithfulness monitor using residual activations; verifiable deployment angle	RAG, faithfulness, monitoring, white-box, hallucinations, verification, residual-stream
`2604.06154`	Exclusive Unlearning PDF	cs.CL	93	Unlearning-by-retention for broad harm removal; claims jailbreak robustness while keeping utility	unlearning, jailbreaks, safety, harmful-content, post-training
`2604.05485`	Auditable Agents PDF	cs.AI	92	Defines actionable auditability dimensions for agents; focuses on evidence integrity & attribution.	auditability, accountability, agents, logging, governance, monitoring
`2604.05339`	Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities PDF	cs.CL	92	Multi-agent env to test how value misalignment changes collective behavior; direct agent-safety relevance	multi-agent, values, misalignment, emergent-behavior, simulation, agent-safety
`2604.05480`	Can You Trust the Vectors in Your Vector Database? Black-Hole Attack from Embedding Space Defects PDF	cs.CR, cs.DB	91	Practical poisoning attack on vector DBs via centroid hubness; high relevance to RAG security	security, RAG, vector-database, data-poisoning, embeddings, retrieval-attacks, hubness
`2604.06091`	Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives PDF	cs.CL, cs.AI, cs.MA	91	Shows social-psychology vulnerabilities in LLM collectives; adversaries sway representative agents	multi-agent, security, social-influence, robustness, adversarial-evaluation
`2604.06132`	Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents PDF	cs.AI	90	Agent eval suite with trace-level evidence channels; targets safety/robustness gaps in benchmarks.	agent-evaluation, benchmarks, traces, robustness, multimodal, safety-eval
`2604.05995`	The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models PDF	cs.CL, cs.AI, cs.LG	90	Diagnoses knowledge-editing evals: models can comply without real learning; improves reliability testing	knowledge-editing, evaluation, reliability, self-assessment, robustness
`2604.05279`	Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition PDF	cs.AI	89	Targets sycophancy with reward decomposition separating pressure capitulation vs evidence blindness	alignment, sycophancy, reward-modeling, RLHF, DPO, robustness, evaluation
`2604.05793`	BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents PDF	cs.CR, cs.CV	88	Propagation-aware prompt privacy mediation across retrieval/memory/tools; benchmarked reductions.	privacy, agents, prompt-mediation, PII, tool-calls, RAG, memory
`2604.05779`	What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say "I Don't Know" PDF	cs.CL, cs.AI	88	Knowledge-weighted finetuning to reduce hallucinations and elicit 'I don't know' with new uncertainty metrics	hallucination, uncertainty, calibration, abstention, fine-tuning, reliability
`2604.05336`	TRACE: Capability-Targeted Agentic Training PDF	cs.AI	88	Capability-targeted agent training from failure/success contrasts; practical agent self-improvement	agents, training, self-improvement, trajectory-learning, evaluation
`2604.05719`	Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing PDF	cs.CR, cs.AI, cs.SE	86	SoK + unified empirical eval of LLM automated pentesting frameworks; clarifies real capability.	cybersecurity, agents, SoK, autonomous-attacks, evaluation, dual-use
`2604.06126`	Gym-Anything: Turn any Software into an Agent Environment PDF	cs.LG, cs.AI	86	Scales computer-use agent eval by auto-building software environments with audit agent verification	agents, computer-use, benchmarks, environment-generation, auditing, tool-use, evaluation
`2604.05557`	EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents PDF	cs.CL	86	Episodic multi-turn multimodal benchmark for research workflows: search, figures/tables, cross-paper memory	agents, benchmark, multimodal, tool-use, search, long-horizon
`2604.05623`	DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions PDF	cs.CV, cs.CL, cs.MM	86	Benchmark for token-level hallucination localization in long captions; dense, multi-domain eval	hallucinations, multimodal, benchmark, evaluation, reliability
`2604.06019`	CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments PDF	cs.CR, cs.AI	85	OT-focused LLM cyber capability eval in IEC 61850 substations; fills IT-only benchmark gap.	cybersecurity, OT-security, evaluation, agents, critical-infrastructure, dual-use
`2604.05955`	Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution PDF	cs.SE, cs.AI	84	Benchmark for issue-resolution beyond tests: explicit design-constraint compliance from real PRs	agents, software-engineering, code-agents, benchmarks, constraint-compliance, evaluation
`2604.05593`	Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge PDF	cs.AI, cs.CL	84	Shows LLM-as-judge trust is label-biased; counterfactual + attention analysis questions evaluator validity	LLM-judge, evaluation, bias, trust, human-factors, robustness
`2604.05483`	Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning PDF	cs.AI, cs.CL	84	Black-box method to map topics where LLM becomes biased/untrustworthy using KG + multi-agent RL	bias, trustworthiness, black-box, red-teaming, reinforcement-learning
`2604.05872`	Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts PDF	cs.CR, cs.AI, cs.CL	83	Swiss regulatory reliability+adversarial security benchmark across 4 languages and 808 items.	evaluation, reliability, adversarial, regulation, multilingual, prompt-leakage
`2604.05912`	FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks PDF	cs.CL	83	Long-horizon computer-use benchmark for real finance workflows; useful for tracking agent capability	agents, benchmarks, computer-use, long-horizon, finance, evaluation, accountability
`2604.05952`	Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration PDF	cs.AI, cs.CL	83	Deep research agent with progressive confidence estimation/calibration to improve report trust	agents, calibration, uncertainty, trustworthiness, report-generation
`2604.06013`	Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis PDF	cs.AI, cs.CL	82	Inference-time protocol to audit memorized priors vs data-driven reasoning via entity blinding.	audit, data-contamination, epistemic, evaluation, grounding, scientific-LLMs
`2604.05522`	Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs PDF	cs.CL	82	Cross-modal coreference dataset/tasks to improve omni-LLM alignment of referents; reliability for multimodal agents	multimodal, coreference, dataset, grounding, evaluation, omni-LLM
`2604.05333`	Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills PDF	cs.AI	82	Dependency-aware retrieval for massive skill libraries; reduces context bloat and agent errors	agents, tool-use, retrieval, skills, long-context-efficiency
`2604.05348`	From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs PDF	cs.AI	81	Medical hallucination risk triage benchmark + white-box detector for evidence conflict/gaps.	hallucinations, medical-safety, benchmarks, uncertainty, risk-triage, grounding

AI Paper Insight Brief

2026-04-09

0) Executive takeaways (read this first)

“White-box monitoring” is becoming a practical deployment primitive: two independent works show internal-state signals can triage hallucination/faithfulness with strong accuracy and low latency (medical evidence triage; RAG faithfulness monitoring with sub-ms overhead and optional zk verification).
Agent security is shifting from prompt-injection to “tool + memory + retrieval” system exploits: backdoored tool-use can exfiltrate session memory via seemingly legitimate retrieval traffic, while vector DBs admit query-agnostic poisoning via centroid “black-hole” embeddings—both bypass content-focused defenses.
Evaluation is moving from outcome-only to trace- and process-grounded auditing: new benchmarks/frameworks emphasize trajectory evidence, robustness under perturbations, and multi-turn workflows (Claw-Eval, EpiBench, FrontierFinance), repeatedly showing that output-only judging misses major safety/robustness failures.
Targeted training signals beat monolithic rewards for social/agent failures: decomposed reward shaping reduces sycophancy under authority pressure; capability-targeted adapter training improves agent success by isolating deficits rather than optimizing a single environment reward.
“Trust” failures increasingly look like social/organizational dynamics: multi-agent collectives and provenance labels systematically bias decisions (peer conformity/verbosity/expertise effects; “Human vs AI” labels shift trust ratings for both humans and LLM judges).

2) Key themes (clusters)

Theme: White-box reliability monitors (hallucination/faithfulness triage)

Why it matters: Deployments need fast, local, evidence-conditioned checks without extra judge models or heavy sampling—especially in medical/RAG settings where unsupported claims are safety-critical.
Representative papers:
Common approach:
- Use paired conditions to isolate evidence dependence (CTX vs NOCTX passes; calibration splits; multi-sampled probing).
- Convert internal/model-derived signals into lightweight classifiers/threshold rules (XGBoost heads; Mahalanobis distance; instance-weighted loss).
- Optimize for high-recall triage policies and actionable subtyping (unsafe→gap vs contradiction; abstain via <IDK>).
Open questions / failure modes:
- Generalization beyond studied settings (structured retinal evidence; 7–8B open-weight models; patient-disjoint splits not used in RETINA-SAFE).
- Monitors ensure faithfulness to retrieved evidence, not truth of evidence (corpus poisoning remains).
- Calibration/threshold brittleness near decision boundaries (quantization noise for verifiable deployment; subtle-evidence cases for subtype attribution).

Theme: Agent-stack security: tool exfiltration + vector DB poisoning + formally proven code vulns

Why it matters: Real-world agent stacks add new attack surfaces (memory, tools, retrieval, vector stores). Defenses that only inspect retrieved text or rely on static tools can miss the real channel.
Representative papers:
Common approach:
- Move from heuristic detection to provable/structural reasoning (SMT witnesses; geometric hubness theory; formal LTS security properties).
- Attack/defense evaluation at the system boundary (tool-call payloads, reranker delivery, ANN index behavior), not just model text.
- Emphasize end-to-end exploitability (ASAN-confirmed PoCs; delivery-through-stack rates; retrieval ranking manipulation).
Open questions / failure modes:
- Practical mitigations are under-tested: MCP “reference architecture” is unimplemented; exfiltration defenses need egress/payload auditing validation.
- Detection/mitigation trade-offs: hubness transforms reduce attack success but can collapse recall; scalable detection adds extra k-NN overhead.
- “Secure prompting” is weak: security instructions reduced vulnerability rate only ~4pp in the formal code study.

Theme: Trustworthy agent evaluation via traces, rubrics, and multi-turn workflows

Why it matters: Pass rates and final-answer judging systematically overestimate readiness; real deployments require auditability, robustness under failures, and evidence-grounded multi-step behavior.
Representative papers:
Common approach:
- Require process evidence (execution traces + audit logs + snapshots; evidence checklists; rubric-based grading).
- Stress long-horizon and tool-disabled phases to test memory/evidence reuse (EpiBench final turn; finance deliverables).
- Separate peak capability vs reliability (Pass@k vs Pass^k; robustness under injected failures).
Open questions / failure modes:
- Cost/complexity of running full suites at scale (multi-trial runs; human baselines; heavy tool infrastructure).
- Judge bias persists even with rubrics (FrontierFinance judge overestimation; EpiBench relies on LLM judge despite agreement checks).
- Memory remains a dominant bottleneck: tool-disabled final turns sharply reduce success; robustness failures show up as inconsistency across trials.

Why it matters: Many failures are not “reasoning errors” but socially mediated: authority cues, majority influence, provenance labels, and population value composition can shift outcomes and induce harmful behaviors.
Representative papers:
Common approach:
- Operationalize social failure modes with controlled manipulations (authority pressure levels; adversary count; rhetorical style; value prevalence sweeps).
- Use contrastive setups to isolate causal drivers (opposing contexts; success vs failure rollouts; counterfactual label swaps).
- Measure both macro outcomes (community resilience, population stability) and micro behaviors (deception, betrayal, sycophancy).
Open questions / failure modes:
- Transfer to real multi-turn, adversarial, and culturally diverse pressure forms is incomplete (sycophancy transfer weaker for emotional-investment latent pressure).
- Annotation/judging bias risks (LLM annotators for emergent behaviors; attention/gaze comparisons are correlational).
- Representative-agent aggregation is fragile to verbosity/expertise cues; needs robust aggregation protocols beyond “read peers and decide”.

Theme: Scaling agent capability via targeted retrieval and targeted training

Why it matters: As skill libraries and environments scale, agents fail due to missing prerequisites or specific capability gaps; targeted retrieval/training improves efficiency and success under budgets.
Representative papers:
Common approach:
- Replace flat retrieval with structure-aware selection (typed skill graphs + reverse-aware diffusion; budgeted hydration).
- Identify deficits from traces and train capability-specific adapters (LoRA per capability; routing at inference).
- Scale environments/tasks via automated creation + auditing loops and checklist verifiers.
Open questions / failure modes:
- Graph quality and static structure can bottleneck GoS; TRACE depends on correctness of LLM-based capability labeling/routing (not fully measured).
- Long-horizon pass rates remain low even with large task corpora; auditing helps but doesn’t solve planning/verification deficits.
- Interaction with security: larger tool/skill surfaces increase attack exposure unless coupled with audit/egress controls.

3) Technical synthesis

Multiple papers converge on contrastive signal design to avoid “gradient/learning collapse”: sycophancy uses opposing contexts + pressured variants; TRACE uses success/failure rollout contrasts; blinding uses A/B anonymization; label-effects uses counterfactual swaps.
GRPO appears as a recurring optimization primitive for agent/alignment training (sycophancy reward decomposition; TRACE per-capability adapters; CROSSOMNI SFT+GRPO for coreference thinking patterns).
A clear pattern: process-grounded evaluation beats output-only judging. Claw-Eval quantifies miss rates for vanilla judges (safety/robustness), FrontierFinance shows rubric guidance improves judge-human correlation, and EpiBench forces memory-only final turns to expose hidden failures.
“Trustworthiness” is increasingly decomposed into subtasks with explicit policies: safe/unsafe then gap vs contradiction (ECRT), safe vs risky faithfulness (LatentAudit), answer vs <IDK> (KWT), completion × safety × robustness (Claw-Eval).
Security work is moving toward formal or quasi-formal witnesses: SMT SAT witnesses for exploitability; LTS properties for MCP; theoretical hubness conditions for vector poisoning—reducing reliance on pattern matching.
Several results show asymmetries between generation and verification: models generate vulnerable code frequently but can detect many of their own proven vulns in review mode; agents can succeed when tools remain available but fail when forced to rely on stored evidence.
Multi-agent systems show two distinct risk channels: population composition effects (values → tipping points) and interaction protocol effects (representative swayed by majority/verbosity/expertise).
Benchmarks increasingly include reliability under perturbation (Claw-Eval error injection; AutoPT framework comparisons; long-horizon finance tasks; CUA-World-Long budgets).
Privacy/security defenses are trending toward boundary controls (prompt mediation + restoration; egress/payload auditing; signed hash-chained logs) rather than only model-side alignment.

4) Top 5 papers (with “why now”)

1) Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

Formalizes exploitability with Z3 SMT witnesses (1,055 SAT findings) rather than heuristic flags.
Shows high vulnerability rates across seven frontier models (mean 55.8%; integer arithmetic worst at 87%).
Reveals a major tooling gap: six industry tools miss 97.8% of Z3-proven findings.
Skepticism: benchmark scope (500 prompts, temp=0) and auxiliary ablations limited to a 50-prompt subcorpus.

2) Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use

Demonstrates an end-to-end agentic exfiltration channel: session_memory → outbound retrieval with encoded payload.
High trigger activation (ASR >94%) with minimal benign performance loss (<1% MT-Bench degradation).
Shows reranker-aware rewriting restores delivery through rerankers and bypasses retrieval-stage defenses (delivery-through-stack ≈81–87%).
Skepticism: attack requires outbound connectors + memory; multi-turn leakage estimates assume cooperative users and specific defense placements/configs.

3) LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment

Low-latency white-box faithfulness monitor (e.g., 0.942 AUROC on PubMedQA with 0.77 ms overhead).
Robust across model families/datasets and stress tests; no separate judge model (only tiny projector calibration).
Optional zk-verifiable decision rule with fixed-point quantization (k=16 preserves ~99.8% AUROC).
Skepticism: requires open weights/activations; verifies faithfulness to retrieved evidence, not evidence truth.

4) Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Enforces trajectory-audited evaluation with three evidence channels and post-hoc judging firewall.
Quantifies how output-only judges fail (miss 44% safety violations; 13% robustness failures).
Separates peak vs reliability via Pass@k vs Pass^k and robustness via controlled error injection.
Skepticism: limitations/costs of running the full suite at scale aren’t clearly enumerated in the provided analysis.

5) Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition

Makes sycophancy trainable by decomposing reward into pressure resistance vs evidence responsiveness (plus auxiliary terms).
Two-phase SFT+GRPO reduces answer-priming sycophancy ~15–17pp on SycophancyEval and improves stance consistency.
Ablations suggest reward terms control independent behavioral axes, improving targeted correction.
Skepticism: relies heavily on NLI scoring; transfer weaker for some latent pressure forms (e.g., emotional-investment).

5) Practical next steps

For RAG deployments, prototype a white-box faithfulness monitor (Mahalanobis-style or CTX/NOCTX discrepancy features) and measure AUROC/latency under retrieval-miss and contradiction stress tests.
Add egress controls + tool-call payload auditing to agent stacks: flag long opaque/base64-like URL parameters; separate privileges so memory-read and network-write can’t chain without explicit authorization.
Run a vector DB poisoning red-team: inject centroid-near vectors at ~1% rate in a staging index and track MO@10/Recall@10; evaluate detection-by-hit-count filters vs hubness transforms.
Replace output-only evaluation with trace-grounded scoring: log tool calls, server-side audit logs, and snapshots; compute reliability floors (Pass^k) under injected transient tool/service failures.
For multi-agent “committee” systems, harden aggregation against majority/verbosity/expertise effects: cap rationale length, randomize/normalize peer formatting, and test representative accuracy vs adversary count and verbosity.
In code-generation pipelines, incorporate formal exploitability checks (SMT-based where feasible) and exploit the generation–review asymmetry: require self-review plus formal witness validation before merge.
When fine-tuning for factuality, consider knowledge-aware weighting + explicit abstention (e.g., <IDK> supervision) and track uncertainty-aware metrics (nAUPC, A-FPR, IDK-Precision), not just accuracy.
For long-horizon professional agents (research/finance), enforce memory-only final turns in internal evals to expose evidence-reuse failures, then iterate on memory indexing and evidence minimality.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-09

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: White-box reliability monitors (hallucination/faithfulness triage)

Theme: Agent-stack security: tool exfiltration + vector DB poisoning + formally proven code vulns

Theme: Trustworthy agent evaluation via traces, rubrics, and multi-turn workflows

Theme: Social pressure, collective dynamics, and trust heuristics

Theme: Scaling agent capability via targeted retrieval and targeted training

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps