Daily AI Paper Report (2026-04-25)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 221
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-23T00:00:00Z → 2026-04-24T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.21477MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks
PDF
cs.CR95Protocol-aware MCP security testbed w/ reproducible pitfalls, traces, validators; multi-vector attacksagents, MCP, tool-security, prompt-injection, supply-chain, benchmark, evaluation
2604.21860Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
PDF
cs.CR, cs.AI93New multi-turn jailbreak exploiting stateless moderation; broad eval across frontier & OSS modelsjailbreaks, multi-turn, moderation, adversarial, red-teaming, security
2604.21211Subject-level Inference for Realistic Text Anonymization Evaluation
PDF
cs.CL93New benchmark shows span-masking can still leak identity via subject-level inference.privacy, anonymization, PII, evaluation, inference-attacks, benchmarks
2604.21308CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
PDF
cs.CR, cs.CL92Enterprise agent privacy benchmark grounded in contextual integrity; shows utility–leakage trade-offagents, privacy, information-flow, benchmark, RAG, enterprise, evaluation
2604.21255When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
PDF
cs.CL92New metrics quantify distillation-driven homogenization in agent tool-use; useful for auditing ecosystem riskagents, tool-use, distillation, behavioral-similarity, evaluation, model-auditing
2604.21827Alignment has a Fantasia Problem
PDF
cs.AI, cs.HC91Alignment framing: users lack fixed goals; proposes intent-formation support to avoid failures.alignment, HCI, goal-ambiguity, agent-assistants, human-factors
2604.21829Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study
PDF
cs.CR90First empirical black-box study of stealing proprietary agent skills; taxonomy + attack surfaceagents, model-extraction, prompt-stealing, IP, security, threat-model
2604.21564Measuring Opinion Bias and Sycophancy via LLM-based Coercion
PDF
cs.CL90Open-source bench to elicit latent opinions/sycophancy in realistic multi-turn coercion settingssycophancy, bias, evaluation, multi-turn, benchmarks, red-teaming
2604.21229EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
PDF
cs.CL, cs.AI90Benchmark for long-term conversational memory + compares graph vs vector vs full-context; includes adversarial abstentionlong-term-memory, benchmarks, RAG, graph-retrieval, evaluation, assistants
2604.21840TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication
PDF
cs.CR, cs.AI88Sandboxed operator+adjudicator agents for safe interactive phishing URL triage; evidence bundlingagentic-systems, sandboxing, cybersecurity, phishing, tool-use, evaluation
2604.21523Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
PDF
cs.CV, cs.CL88Benchmark exposes reliability blind spots of VLMs used as evaluators across I2T/T2I perturbationsVLM, LLM-as-judge, evaluation, robustness, hallucinations, benchmarks
2604.21334Ideological Bias in LLMs' Economic Causal Reasoning
PDF
cs.AI, cs.CE, cs.CL, cs.LG, econ.GN88Large-scale eval of ideological bias in economic causal reasoning; ideology-contested subset from verified effectsbias, causal-reasoning, evaluation, economics, benchmarks, LLMs
2604.21794Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
PDF
cs.AI, cs.CL, cs.MA88End-to-end learned latent inter-agent communication; could reshape multi-agent LLM system design.multi-agent, communication, latent-interfaces, training, LLM-agents
2604.21700Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
PDF
cs.CR, cs.AI, cs.CL86Stealthy LLM backdoors via natural style triggers; clearer end-to-end threat model & pipelinebackdoors, data-poisoning, LLM-security, style-triggers, supply-chain
2604.21911When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
PDF
cs.CV, cs.AI, cs.CL, cs.LG86HalluScope isolates prompt-induced LVLM hallucinations; highlights instruction priors as key driverLVLM, hallucinations, prompting, robustness, benchmark, grounding
2604.21590AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use
PDF
cs.CL86Industrial small agentic LMs trained with multi-round RL + dual data flywheels for tool use; high practical impactagents, tool-use, reinforcement-learning, small-models, synthetic-data, post-training
2604.21593Language as a Latent Variable for Reasoning Optimization
PDF
cs.CL86Polyglot prompting/RL idea: language as latent variable can improve reasoning accuracy.reasoning, multilingual, RLHF, GRPO, inference-strategies
2604.21816Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
PDF
cs.AI85Cuts MCP/tools token overhead via dynamic tool gating + lazy schema loading; claims big token savingsagents, tool-use, efficiency, long-context, MCP, systems
2604.21375VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
PDF
cs.CL, cs.AI, cs.SE85GUI agent framework with mandatory verifier + loop breaker to prevent premature stops and loopsagents, GUI automation, verification, reliability, tool-use, agent safety
2604.21327Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
PDF
cs.LG, cs.AI, cs.CL85Analyzes spurious reward signals in test-time RL for math; proposes debias/denoise framework to reduce noisetest-time-training, reinforcement-learning, reasoning, robustness, math, optimization
2604.21199ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
PDF
cs.LG, cs.CV85ARFBench TSQA for incident response; evaluates FMs on telemetry anomaly reasoning.evaluation, benchmarks, time-series, incident-response, multimodal, ops
2604.21571Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
PDF
cs.AI, cs.LG84Personalization w/ deletable per-user proxies enabling deterministic unlearning; reduces cross-user leakprivacy, unlearning, personalization, LoRA, adapters, data-deletion
2604.21214SQLyzr: A Comprehensive Benchmark and Evaluation Platform for Text-to-SQL
PDF
cs.DB, cs.AI84Text-to-SQL evaluation platform with realistic workload alignment + fine-grained metrics beyond single scoretext-to-sql, evaluation, benchmarks, databases, LLMs, metrics
2604.21716From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation
PDF
cs.CL, cs.SE83Shows codegen bias is underestimated: ML pipeline generation includes sensitive attrs in 87.7% casescode generation, bias, fairness, evaluation, ML pipelines, safety
2604.21421Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation
PDF
cs.CR, cs.AI, cs.CL83Comparative study of DP vs NER vs LLMs for clinical note de-ID (Dutch); directly relevant to privacy in LLM pipelinesprivacy, differential-privacy, de-identification, clinical-NLP, LLMs, security
2604.21344Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
PDF
cs.CL, cs.AI, cs.CV, cs.LG, cs.MA83PolyChartQA benchmark exposes large drop for VLMs on multi-chart reasoning.multimodal, VLM, benchmark, chart-QA, reasoning, evaluation
2604.21197Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
PDF
cs.LG81Membership inference tailored to federated LLM fine-tuning; projection-residual method on gradientsprivacy, membership-inference, federated-learning, LLMs, security
2604.21309When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation
PDF
cs.CL81Large fairness eval of political bias in multi-news summarization across 13 LLMs + metrics.fairness, bias, summarization, evaluation, politics, LLMs
2604.21854Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation
PDF
cs.AI80Proposes statistical certification to quantify/verify acceptable risk for AI regulation complianceAI regulation, risk certification, assurance, governance, deployment safety
2604.21769Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
PDF
cs.AI, cs.CY, cs.HC80Shows leaderboard rankings depend on prompt slices; proposes interactive user-defined evaluation of LLM leaderboardsevaluation, leaderboards, LMArena, benchmarking, human-preferences, governance

AI Paper Insight Brief

2026-04-25

0) Executive takeaways (read this first)

  • “Gradient-only” and “federated” are not privacy shields for LLM fine-tuning: a single round of PEFT gradients can enable near-perfect membership inference via a simple projection-residual test (ProjRes), and lightweight defenses only help when they also crush utility.
  • Enterprise agent privacy is failing in realistic dense-retrieval workflows: CI-Work shows substantial leakage/violation rates and a clear privacy–utility coupling; “try harder / bigger model” can increase leakage (inverse scaling) and user pressure makes things worse.
  • Tool/agent security is shifting from prompt injection to protocol + developer pitfalls + trace auditing: MCP Pitfall Lab shows deterministic static checks can eliminate many server-side pitfalls cheaply, while black-box “skill stealing” and stateless multi-turn attacks (TTI) demonstrate how much can leak through normal interfaces.
  • Evaluation itself is a growing attack surface and failure point: evaluator VLMs miss obvious degradations (FOCUS), and multi-chart QA + time-series incident QA benchmarks show large capability gaps precisely where real-world reasoning is compositional and cross-context.
  • Reliability gains are coming from “systems” not just models: GUI automation improves by enforcing completion verification + loop recovery (VLAA-GUI), and multi-agent systems improve by learning latent communication (DiffMAS) rather than exchanging only text.
  • Bias/fairness findings are increasingly “non-monotonic with scale” and task-dependent: medium-sized models can be best for political fairness in summarization, and code-generation bias looks far worse when you evaluate realistic ML pipelines (feature selection) rather than toy if-statements.

2) Key themes (clusters)

Theme: Federated & personalized LLM privacy is brittle (and needs new primitives)

Theme: Contextual privacy for agents in enterprise/tool ecosystems

Theme: Benchmarks are getting more realistic—and models look worse on compositional, cross-context tasks

Theme: Reliability via explicit verification, recovery, and learned coordination

Theme: Bias/fairness measurement is moving to “mechanism-relevant” tasks (and scale isn’t a fix)

3) Technical synthesis

  • Multiple papers converge on “auditability via traces/evidence”: MCP Pitfall Lab validates via MCP traces; TraceScope (URL triage) uses immutable evidence + checklist adjudication; EngramaBench annotates evidence IDs; this is a broader shift away from trusting model narratives.
  • Single-round / low-history attacks are getting stronger: ProjRes needs only single-round gradients; skill stealing claims extraction with only a few interactions; TTI exploits per-turn stateless moderation.
  • Utility–privacy coupling is now empirically quantified in agent settings (CI-Work correlation between conveyance and leakage/violation), echoing DP trade-offs in federated and clinical de-ID evaluations.
  • Decomposition + verification is a recurring reliability pattern: VDSP for multi-chart QA, completion verifier + loop breaker for GUI agents, consensus off-policy refinement for test-time RL.
  • “Bigger model” is not a universal fix: inverse scaling for leakage (CI-Work), medium-size best fairness trade-offs (FairNews), and evaluator VLMs still have large blind spots (FOCUS).
  • Preference/judge-based evaluation is itself unreliable: FOCUS shows evaluator VLM failures; interactive leaderboard analysis shows preference rankings vary by slice and humans pick wrong answers in deterministic math 26% of the time.
  • Latent interfaces are emerging as a performance lever: DiffMAS trains KV-trace communication; this parallels other work that treats non-text internal structure as optimizable rather than fixed.
  • Synthetic data is used heavily but with different roles: ARFBench uses synthetic post-training plus small real set; AgenticQwen uses dual flywheels; HalluVL-DPO uses large synthetic preference data—raising common questions about bias/transfer and evaluation realism.
  • Security threat models are broadening from prompt injection to supply chain + protocol + tool metadata + multimodal (BADSTYLE style triggers; MCP Pitfall Lab; skill stealing).

4) Top 5 papers (with “why now”)

1) Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

  • Shows a single-round, no-shadow-model membership inference attack tailored to FedLLMs/PEFT using projection residuals on hidden embeddings.
  • Reports near-perfect AUC (often 1.00) across multiple LLMs/datasets and strong gains over prior FL MIAs.
  • Evaluates defenses and finds DP only helps at utility-destroying noise, pruning only partially helps.
  • Skepticism / limitation: non-trivial runtime overhead (per-layer attacks) and no utility-preserving defense proposed.

2) CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

  • Introduces an enterprise CI benchmark with dense retrieval trajectories and explicit Essential vs Sensitive entries.
  • Finds substantial violation/leakage and a measurable privacy–utility trade-off, plus inverse scaling where larger models can leak more.
  • Shows user pressure can sharply increase leakage and even reduce conveyance (“lose–lose”).
  • Skepticism / limitation: synthetic scenarios and LLM-judge under-reporting mean leakage is likely a lower bound; org-specific norms not captured.

3) MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

  • Operationalizes a developer pitfall taxonomy and provides trace-grounded validators for confidentiality/integrity objectives.
  • Tier-1 static analyzer achieves F1=1.0 on statically checkable pitfall classes and is CI-friendly (~5.2 ms).
  • Hardening reduces findings 29→0 with mean ~27 LOC changes; also documents frequent trace–narrative divergence.
  • Skepticism / limitation: evaluation scope is small (few scenarios; preliminary corpus), and multimodal analysis is not yet thorough.

4) Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

  • Releases FOCUS: >4,000 human-validated perturbation instances for meta-evaluating evaluator VLMs on I2T and T2I.
  • Finds high evaluator failure rates, especially in single-answer scoring; pairwise comparison is more reliable.
  • Shows reasoning budget doesn’t reliably help and evaluators can note errors in text but not reflect them in scores.
  • Skepticism / limitation: gold outputs are model-generated (though manually reviewed); only four evaluator VLMs tested.

5) VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

  • Targets two dominant GUI-agent failures: premature completion and loops, via completion gating + independent verifier + multi-tier loop breaker + search.
  • Reports 77.45% success on OSWorld-Verified (Opus 4.6) surpassing reported human level (72.4%), and strong WAA results.
  • Provides ablations showing which modules reduce false completion and wasted steps.
  • Skepticism / limitation: tool overhead can hurt under tight budgets for weaker backbones; false completion remains a dominant failure mode for some models.

5) Practical next steps

  • For federated/PEFT deployments: add a red-team audit that explicitly tests single-round gradient leakage (ProjRes-style) before shipping; treat “no raw data sharing” as insufficient.
  • For enterprise agents: measure Leakage/Violation/Conveyance under dense retrieval and user pressure conditions (CI-Work-style), not just on clean prompts; track whether scaling increases leakage.
  • Adopt trace-grounded security QA for tool servers: integrate Tier-1 static checks (MCP Pitfall Lab) into CI, and require protocol trace logging so validators can detect exfiltration/integrity violations.
  • Harden against black-box extraction: test for skill/package leakage with automated prompt suites; consider output filtering and inference hardening, but also evaluate semantic leakage (not just exact match).
  • Fix stateless moderation gaps: implement session-level aggregation or risk scoring to detect distributed multi-turn intent (TTI), and benchmark against stateless multi-turn attacks.
  • Stop trusting evaluator VLMs by default: validate your evaluator on perturbation suites (FOCUS-like); prefer pairwise paradigms when feasible and monitor justification–score inconsistencies.
  • For GUI/agent reliability: add explicit completion criteria + independent verifier and loop escalation; log false-completion and wasted-step ratios as first-class metrics (VLAA-GUI).
  • For fairness audits: evaluate on mechanism-relevant tasks (e.g., ML pipeline feature selection, multi-doc viewpoint preservation, directional causal sign) and don’t assume larger models reduce bias.

Generated from per-paper analyses; no external browsing.