Daily AI Paper Report (2026-04-25)

Published: April 25, 2026

Chinese version: [中文]

Run stats

Candidates: 221
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-23T00:00:00Z → 2026-04-24T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.21477`	MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks PDF	cs.CR	95	Protocol-aware MCP security testbed w/ reproducible pitfalls, traces, validators; multi-vector attacks	agents, MCP, tool-security, prompt-injection, supply-chain, benchmark, evaluation
`2604.21860`	Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models PDF	cs.CR, cs.AI	93	New multi-turn jailbreak exploiting stateless moderation; broad eval across frontier & OSS models	jailbreaks, multi-turn, moderation, adversarial, red-teaming, security
`2604.21211`	Subject-level Inference for Realistic Text Anonymization Evaluation PDF	cs.CL	93	New benchmark shows span-masking can still leak identity via subject-level inference.	privacy, anonymization, PII, evaluation, inference-attacks, benchmarks
`2604.21308`	CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents PDF	cs.CR, cs.CL	92	Enterprise agent privacy benchmark grounded in contextual integrity; shows utility–leakage trade-off	agents, privacy, information-flow, benchmark, RAG, enterprise, evaluation
`2604.21255`	When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors PDF	cs.CL	92	New metrics quantify distillation-driven homogenization in agent tool-use; useful for auditing ecosystem risk	agents, tool-use, distillation, behavioral-similarity, evaluation, model-auditing
`2604.21827`	Alignment has a Fantasia Problem PDF	cs.AI, cs.HC	91	Alignment framing: users lack fixed goals; proposes intent-formation support to avoid failures.	alignment, HCI, goal-ambiguity, agent-assistants, human-factors
`2604.21829`	Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study PDF	cs.CR	90	First empirical black-box study of stealing proprietary agent skills; taxonomy + attack surface	agents, model-extraction, prompt-stealing, IP, security, threat-model
`2604.21564`	Measuring Opinion Bias and Sycophancy via LLM-based Coercion PDF	cs.CL	90	Open-source bench to elicit latent opinions/sycophancy in realistic multi-turn coercion settings	sycophancy, bias, evaluation, multi-turn, benchmarks, red-teaming
`2604.21229`	EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval PDF	cs.CL, cs.AI	90	Benchmark for long-term conversational memory + compares graph vs vector vs full-context; includes adversarial abstention	long-term-memory, benchmarks, RAG, graph-retrieval, evaluation, assistants
`2604.21840`	TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication PDF	cs.CR, cs.AI	88	Sandboxed operator+adjudicator agents for safe interactive phishing URL triage; evidence bundling	agentic-systems, sandboxing, cybersecurity, phishing, tool-use, evaluation
`2604.21523`	Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models PDF	cs.CV, cs.CL	88	Benchmark exposes reliability blind spots of VLMs used as evaluators across I2T/T2I perturbations	VLM, LLM-as-judge, evaluation, robustness, hallucinations, benchmarks
`2604.21334`	Ideological Bias in LLMs' Economic Causal Reasoning PDF	cs.AI, cs.CE, cs.CL, cs.LG, econ.GN	88	Large-scale eval of ideological bias in economic causal reasoning; ideology-contested subset from verified effects	bias, causal-reasoning, evaluation, economics, benchmarks, LLMs
`2604.21794`	Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems PDF	cs.AI, cs.CL, cs.MA	88	End-to-end learned latent inter-agent communication; could reshape multi-agent LLM system design.	multi-agent, communication, latent-interfaces, training, LLM-agents
`2604.21700`	Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers PDF	cs.CR, cs.AI, cs.CL	86	Stealthy LLM backdoors via natural style triggers; clearer end-to-end threat model & pipeline	backdoors, data-poisoning, LLM-security, style-triggers, supply-chain
`2604.21911`	When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs PDF	cs.CV, cs.AI, cs.CL, cs.LG	86	HalluScope isolates prompt-induced LVLM hallucinations; highlights instruction priors as key driver	LVLM, hallucinations, prompting, robustness, benchmark, grounding
`2604.21590`	AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use PDF	cs.CL	86	Industrial small agentic LMs trained with multi-round RL + dual data flywheels for tool use; high practical impact	agents, tool-use, reinforcement-learning, small-models, synthetic-data, post-training
`2604.21593`	Language as a Latent Variable for Reasoning Optimization PDF	cs.CL	86	Polyglot prompting/RL idea: language as latent variable can improve reasoning accuracy.	reasoning, multilingual, RLHF, GRPO, inference-strategies
`2604.21816`	Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows PDF	cs.AI	85	Cuts MCP/tools token overhead via dynamic tool gating + lazy schema loading; claims big token savings	agents, tool-use, efficiency, long-context, MCP, systems
`2604.21375`	VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation PDF	cs.CL, cs.AI, cs.SE	85	GUI agent framework with mandatory verifier + loop breaker to prevent premature stops and loops	agents, GUI automation, verification, reliability, tool-use, agent safety
`2604.21327`	Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning PDF	cs.LG, cs.AI, cs.CL	85	Analyzes spurious reward signals in test-time RL for math; proposes debias/denoise framework to reduce noise	test-time-training, reinforcement-learning, reasoning, robustness, math, optimization
`2604.21199`	ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response PDF	cs.LG, cs.CV	85	ARFBench TSQA for incident response; evaluates FMs on telemetry anomaly reasoning.	evaluation, benchmarks, time-series, incident-response, multimodal, ops
`2604.21571`	Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies PDF	cs.AI, cs.LG	84	Personalization w/ deletable per-user proxies enabling deterministic unlearning; reduces cross-user leak	privacy, unlearning, personalization, LoRA, adapters, data-deletion
`2604.21214`	SQLyzr: A Comprehensive Benchmark and Evaluation Platform for Text-to-SQL PDF	cs.DB, cs.AI	84	Text-to-SQL evaluation platform with realistic workload alignment + fine-grained metrics beyond single score	text-to-sql, evaluation, benchmarks, databases, LLMs, metrics
`2604.21716`	From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation PDF	cs.CL, cs.SE	83	Shows codegen bias is underestimated: ML pipeline generation includes sensitive attrs in 87.7% cases	code generation, bias, fairness, evaluation, ML pipelines, safety
`2604.21421`	Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation PDF	cs.CR, cs.AI, cs.CL	83	Comparative study of DP vs NER vs LLMs for clinical note de-ID (Dutch); directly relevant to privacy in LLM pipelines	privacy, differential-privacy, de-identification, clinical-NLP, LLMs, security
`2604.21344`	Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts PDF	cs.CL, cs.AI, cs.CV, cs.LG, cs.MA	83	PolyChartQA benchmark exposes large drop for VLMs on multi-chart reasoning.	multimodal, VLM, benchmark, chart-QA, reasoning, evaluation
`2604.21197`	Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach PDF	cs.LG	81	Membership inference tailored to federated LLM fine-tuning; projection-residual method on gradients	privacy, membership-inference, federated-learning, LLMs, security
`2604.21309`	When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation PDF	cs.CL	81	Large fairness eval of political bias in multi-news summarization across 13 LLMs + metrics.	fairness, bias, summarization, evaluation, politics, LLMs
`2604.21854`	Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation PDF	cs.AI	80	Proposes statistical certification to quantify/verify acceptable risk for AI regulation compliance	AI regulation, risk certification, assurance, governance, deployment safety
`2604.21769`	Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards PDF	cs.AI, cs.CY, cs.HC	80	Shows leaderboard rankings depend on prompt slices; proposes interactive user-defined evaluation of LLM leaderboards	evaluation, leaderboards, LMArena, benchmarking, human-preferences, governance

AI Paper Insight Brief

2026-04-25

0) Executive takeaways (read this first)

“Gradient-only” and “federated” are not privacy shields for LLM fine-tuning: a single round of PEFT gradients can enable near-perfect membership inference via a simple projection-residual test (ProjRes), and lightweight defenses only help when they also crush utility.
Enterprise agent privacy is failing in realistic dense-retrieval workflows: CI-Work shows substantial leakage/violation rates and a clear privacy–utility coupling; “try harder / bigger model” can increase leakage (inverse scaling) and user pressure makes things worse.
Tool/agent security is shifting from prompt injection to protocol + developer pitfalls + trace auditing: MCP Pitfall Lab shows deterministic static checks can eliminate many server-side pitfalls cheaply, while black-box “skill stealing” and stateless multi-turn attacks (TTI) demonstrate how much can leak through normal interfaces.
Evaluation itself is a growing attack surface and failure point: evaluator VLMs miss obvious degradations (FOCUS), and multi-chart QA + time-series incident QA benchmarks show large capability gaps precisely where real-world reasoning is compositional and cross-context.
Reliability gains are coming from “systems” not just models: GUI automation improves by enforcing completion verification + loop recovery (VLAA-GUI), and multi-agent systems improve by learning latent communication (DiffMAS) rather than exchanging only text.
Bias/fairness findings are increasingly “non-monotonic with scale” and task-dependent: medium-sized models can be best for political fairness in summarization, and code-generation bias looks far worse when you evaluate realistic ML pipelines (feature selection) rather than toy if-statements.

2) Key themes (clusters)

Theme: Federated & personalized LLM privacy is brittle (and needs new primitives)

Why it matters: Federated/PEFT deployments and personalization are moving into regulated domains, but both gradient leakage and “entangled weights” make deletion and privacy guarantees fragile.
Representative papers:
- Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
- Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
Common approach:
- Exploit/avoid PEFT structure (adapters/LoRA) as the key locus of privacy risk/control.
- Treat privacy as auditable signals: residuals from gradient subspaces (attack) vs. KL-to-baseline verification after deletion (defense-by-design).
- Emphasize single-round practicality (attacks) and deterministic deletion (architectures).
Open questions / failure modes:
- Utility-preserving defenses against projection-style MIAs remain unclear (DP/pruning trade off sharply).
- Proxy artifacts concentrate user info (SEA) and become an exfiltration target.
- How these methods behave under stronger adversaries (semantic MIAs; cross-model proxy transfer) is unresolved.

Theme: Contextual privacy for agents in enterprise/tool ecosystems

Why it matters: Real agents operate over dense internal context and tool protocols; privacy failures are often systemic (retrieval density, user pressure, protocol surfaces), not just “bad prompts.”
Representative papers:
Common approach:
- Build workflow-grounded benchmarks with explicit privacy objectives (Leakage/Violation/Conveyance; trace validators).
- Use trace-grounded evaluation rather than trusting agent narratives (MCP Pitfall Lab; also highlights narrative–trace divergence).
- Attack via normal interfaces: repeated black-box queries (skill stealing), stateless multi-turn accumulation (TTI), tool metadata/supply chain vectors (MCP pitfalls).
Open questions / failure modes:
- Prompt defenses reduce leakage but often reduce utility; user pressure can create “lose–lose” outcomes (CI-Work).
- Detectors/filters show strong results on constructed sets (skill stealing), but robustness to real benign traffic distributions is uncertain.
- Stateless moderation is structurally vulnerable unless session-level aggregation is adopted (TTI).

Theme: Benchmarks are getting more realistic—and models look worse on compositional, cross-context tasks

Why it matters: As benchmarks move from synthetic/single-turn to real incidents, multi-chart figures, and long-term memory, the remaining gaps become clearer and more actionable for product teams.
Representative papers:
Common approach:
- Use realistic artifacts (production incidents; paper figures; multi-session histories; large preference logs).
- Slice performance by difficulty families (tiers, question types, cross-space vs single-space, etc.).
- Introduce system-level baselines (TSFM–VLM hybrids; graph memory vs vector retrieval vs full-context; decomposition+verification pipelines).
Open questions / failure modes:
- Cross-series/time reasoning remains weak (ARFBench Tier III; EngramaBench temporal slice).
- Multi-chart localization + retrieval questions cause large drops; decomposition helps but costs more (PolyChartQA + VDSP).
- “Global leaderboard rank” can be misleading; slice weighting changes decisions (interactive leaderboard work).

Theme: Reliability via explicit verification, recovery, and learned coordination

Why it matters: Many failures are procedural (premature stopping, loops, unstable adaptation, lossy communication). Explicit mechanisms and learnable interfaces are showing measurable gains.
Representative papers:
Common approach:
- Add verifiers/gates (completion verifier; consensus refinement; stability metrics).
- Treat agent behavior as structured objects (KV traces; action loops; pseudo-label frequency regions).
- Use ablation-driven engineering to isolate which modules matter (VLAA-GUI; DDRL; DiffMAS step-count sensitivity).
Open questions / failure modes:
- Verification can still miss dominant failure modes (VLAA-GUI notes false completion remains dominant for some backbones).
- Latent trace growth and step-count sensitivity can degrade performance (DiffMAS).
- TTRL-style methods may not generalize beyond math-like correctness signals (DDRL limitation).

Theme: Bias/fairness measurement is moving to “mechanism-relevant” tasks (and scale isn’t a fix)

Why it matters: Bias can hide in realistic outputs (summaries, pipelines, causal reasoning). Evaluations that match real mechanisms reveal stronger and more directional failures.
Representative papers:
Common approach:
- Evaluate directional asymmetries (intervention vs market sign accuracy; centrist underrepresentation).
- Move from toy proxies to realistic mechanisms (feature selection in ML pipelines; multi-doc viewpoint distributions; multi-turn debate pressure).
- Test mitigations (prompts, judge-selection, one-shot examples) and report limited/variable effectiveness.
Open questions / failure modes:
- Prompt-based debiasing is inconsistent; entity sentiment preservation is resistant (FairNews).
- One-shot steering doesn’t reliably remove directional skew and can inflate confidence (economic causal reasoning).
- Multi-turn debate can dramatically increase sycophancy vs direct probes (llm-bias-bench).

3) Technical synthesis

Multiple papers converge on “auditability via traces/evidence”: MCP Pitfall Lab validates via MCP traces; TraceScope (URL triage) uses immutable evidence + checklist adjudication; EngramaBench annotates evidence IDs; this is a broader shift away from trusting model narratives.
Single-round / low-history attacks are getting stronger: ProjRes needs only single-round gradients; skill stealing claims extraction with only a few interactions; TTI exploits per-turn stateless moderation.
Utility–privacy coupling is now empirically quantified in agent settings (CI-Work correlation between conveyance and leakage/violation), echoing DP trade-offs in federated and clinical de-ID evaluations.
Decomposition + verification is a recurring reliability pattern: VDSP for multi-chart QA, completion verifier + loop breaker for GUI agents, consensus off-policy refinement for test-time RL.
“Bigger model” is not a universal fix: inverse scaling for leakage (CI-Work), medium-size best fairness trade-offs (FairNews), and evaluator VLMs still have large blind spots (FOCUS).
Preference/judge-based evaluation is itself unreliable: FOCUS shows evaluator VLM failures; interactive leaderboard analysis shows preference rankings vary by slice and humans pick wrong answers in deterministic math 26% of the time.
Latent interfaces are emerging as a performance lever: DiffMAS trains KV-trace communication; this parallels other work that treats non-text internal structure as optimizable rather than fixed.
Synthetic data is used heavily but with different roles: ARFBench uses synthetic post-training plus small real set; AgenticQwen uses dual flywheels; HalluVL-DPO uses large synthetic preference data—raising common questions about bias/transfer and evaluation realism.
Security threat models are broadening from prompt injection to supply chain + protocol + tool metadata + multimodal (BADSTYLE style triggers; MCP Pitfall Lab; skill stealing).

4) Top 5 papers (with “why now”)

1) Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

Shows a single-round, no-shadow-model membership inference attack tailored to FedLLMs/PEFT using projection residuals on hidden embeddings.
Reports near-perfect AUC (often 1.00) across multiple LLMs/datasets and strong gains over prior FL MIAs.
Evaluates defenses and finds DP only helps at utility-destroying noise, pruning only partially helps.
Skepticism / limitation: non-trivial runtime overhead (per-layer attacks) and no utility-preserving defense proposed.

2) CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

Introduces an enterprise CI benchmark with dense retrieval trajectories and explicit Essential vs Sensitive entries.
Finds substantial violation/leakage and a measurable privacy–utility trade-off, plus inverse scaling where larger models can leak more.
Shows user pressure can sharply increase leakage and even reduce conveyance (“lose–lose”).
Skepticism / limitation: synthetic scenarios and LLM-judge under-reporting mean leakage is likely a lower bound; org-specific norms not captured.

3) MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Operationalizes a developer pitfall taxonomy and provides trace-grounded validators for confidentiality/integrity objectives.
Tier-1 static analyzer achieves F1=1.0 on statically checkable pitfall classes and is CI-friendly (~5.2 ms).
Hardening reduces findings 29→0 with mean ~27 LOC changes; also documents frequent trace–narrative divergence.
Skepticism / limitation: evaluation scope is small (few scenarios; preliminary corpus), and multimodal analysis is not yet thorough.

4) Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Releases FOCUS: >4,000 human-validated perturbation instances for meta-evaluating evaluator VLMs on I2T and T2I.
Finds high evaluator failure rates, especially in single-answer scoring; pairwise comparison is more reliable.
Shows reasoning budget doesn’t reliably help and evaluators can note errors in text but not reflect them in scores.
Skepticism / limitation: gold outputs are model-generated (though manually reviewed); only four evaluator VLMs tested.

5) VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Targets two dominant GUI-agent failures: premature completion and loops, via completion gating + independent verifier + multi-tier loop breaker + search.
Reports 77.45% success on OSWorld-Verified (Opus 4.6) surpassing reported human level (72.4%), and strong WAA results.
Provides ablations showing which modules reduce false completion and wasted steps.
Skepticism / limitation: tool overhead can hurt under tight budgets for weaker backbones; false completion remains a dominant failure mode for some models.

5) Practical next steps

For federated/PEFT deployments: add a red-team audit that explicitly tests single-round gradient leakage (ProjRes-style) before shipping; treat “no raw data sharing” as insufficient.
For enterprise agents: measure Leakage/Violation/Conveyance under dense retrieval and user pressure conditions (CI-Work-style), not just on clean prompts; track whether scaling increases leakage.
Adopt trace-grounded security QA for tool servers: integrate Tier-1 static checks (MCP Pitfall Lab) into CI, and require protocol trace logging so validators can detect exfiltration/integrity violations.
Harden against black-box extraction: test for skill/package leakage with automated prompt suites; consider output filtering and inference hardening, but also evaluate semantic leakage (not just exact match).
Fix stateless moderation gaps: implement session-level aggregation or risk scoring to detect distributed multi-turn intent (TTI), and benchmark against stateless multi-turn attacks.
Stop trusting evaluator VLMs by default: validate your evaluator on perturbation suites (FOCUS-like); prefer pairwise paradigms when feasible and monitor justification–score inconsistencies.
For GUI/agent reliability: add explicit completion criteria + independent verifier and loop escalation; log false-completion and wasted-step ratios as first-class metrics (VLAA-GUI).
For fairness audits: evaluate on mechanism-relevant tasks (e.g., ML pipeline feature selection, multi-doc viewpoint preservation, directional causal sign) and don’t assume larger models reduce bias.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-25

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Federated & personalized LLM privacy is brittle (and needs new primitives)

Theme: Contextual privacy for agents in enterprise/tool ecosystems

Theme: Benchmarks are getting more realistic—and models look worse on compositional, cross-context tasks

Theme: Reliability via explicit verification, recovery, and learned coordination

Theme: Bias/fairness measurement is moving to “mechanism-relevant” tasks (and scale isn’t a fix)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps