Daily AI Paper Report (2026-04-25)
Published:
Chinese version: [中文]
Run stats
- Candidates: 221
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-23T00:00:00Z → 2026-04-24T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.21477 | MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks | cs.CR | 95 | Protocol-aware MCP security testbed w/ reproducible pitfalls, traces, validators; multi-vector attacks | agents, MCP, tool-security, prompt-injection, supply-chain, benchmark, evaluation |
2604.21860 | Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models | cs.CR, cs.AI | 93 | New multi-turn jailbreak exploiting stateless moderation; broad eval across frontier & OSS models | jailbreaks, multi-turn, moderation, adversarial, red-teaming, security |
2604.21211 | Subject-level Inference for Realistic Text Anonymization Evaluation | cs.CL | 93 | New benchmark shows span-masking can still leak identity via subject-level inference. | privacy, anonymization, PII, evaluation, inference-attacks, benchmarks |
2604.21308 | CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents | cs.CR, cs.CL | 92 | Enterprise agent privacy benchmark grounded in contextual integrity; shows utility–leakage trade-off | agents, privacy, information-flow, benchmark, RAG, enterprise, evaluation |
2604.21255 | When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors | cs.CL | 92 | New metrics quantify distillation-driven homogenization in agent tool-use; useful for auditing ecosystem risk | agents, tool-use, distillation, behavioral-similarity, evaluation, model-auditing |
2604.21827 | Alignment has a Fantasia Problem | cs.AI, cs.HC | 91 | Alignment framing: users lack fixed goals; proposes intent-formation support to avoid failures. | alignment, HCI, goal-ambiguity, agent-assistants, human-factors |
2604.21829 | Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study | cs.CR | 90 | First empirical black-box study of stealing proprietary agent skills; taxonomy + attack surface | agents, model-extraction, prompt-stealing, IP, security, threat-model |
2604.21564 | Measuring Opinion Bias and Sycophancy via LLM-based Coercion | cs.CL | 90 | Open-source bench to elicit latent opinions/sycophancy in realistic multi-turn coercion settings | sycophancy, bias, evaluation, multi-turn, benchmarks, red-teaming |
2604.21229 | EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval | cs.CL, cs.AI | 90 | Benchmark for long-term conversational memory + compares graph vs vector vs full-context; includes adversarial abstention | long-term-memory, benchmarks, RAG, graph-retrieval, evaluation, assistants |
2604.21840 | TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication | cs.CR, cs.AI | 88 | Sandboxed operator+adjudicator agents for safe interactive phishing URL triage; evidence bundling | agentic-systems, sandboxing, cybersecurity, phishing, tool-use, evaluation |
2604.21523 | Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models | cs.CV, cs.CL | 88 | Benchmark exposes reliability blind spots of VLMs used as evaluators across I2T/T2I perturbations | VLM, LLM-as-judge, evaluation, robustness, hallucinations, benchmarks |
2604.21334 | Ideological Bias in LLMs' Economic Causal Reasoning | cs.AI, cs.CE, cs.CL, cs.LG, econ.GN | 88 | Large-scale eval of ideological bias in economic causal reasoning; ideology-contested subset from verified effects | bias, causal-reasoning, evaluation, economics, benchmarks, LLMs |
2604.21794 | Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems | cs.AI, cs.CL, cs.MA | 88 | End-to-end learned latent inter-agent communication; could reshape multi-agent LLM system design. | multi-agent, communication, latent-interfaces, training, LLM-agents |
2604.21700 | Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers | cs.CR, cs.AI, cs.CL | 86 | Stealthy LLM backdoors via natural style triggers; clearer end-to-end threat model & pipeline | backdoors, data-poisoning, LLM-security, style-triggers, supply-chain |
2604.21911 | When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs | cs.CV, cs.AI, cs.CL, cs.LG | 86 | HalluScope isolates prompt-induced LVLM hallucinations; highlights instruction priors as key driver | LVLM, hallucinations, prompting, robustness, benchmark, grounding |
2604.21590 | AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use | cs.CL | 86 | Industrial small agentic LMs trained with multi-round RL + dual data flywheels for tool use; high practical impact | agents, tool-use, reinforcement-learning, small-models, synthetic-data, post-training |
2604.21593 | Language as a Latent Variable for Reasoning Optimization | cs.CL | 86 | Polyglot prompting/RL idea: language as latent variable can improve reasoning accuracy. | reasoning, multilingual, RLHF, GRPO, inference-strategies |
2604.21816 | Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows | cs.AI | 85 | Cuts MCP/tools token overhead via dynamic tool gating + lazy schema loading; claims big token savings | agents, tool-use, efficiency, long-context, MCP, systems |
2604.21375 | VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation | cs.CL, cs.AI, cs.SE | 85 | GUI agent framework with mandatory verifier + loop breaker to prevent premature stops and loops | agents, GUI automation, verification, reliability, tool-use, agent safety |
2604.21327 | Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning | cs.LG, cs.AI, cs.CL | 85 | Analyzes spurious reward signals in test-time RL for math; proposes debias/denoise framework to reduce noise | test-time-training, reinforcement-learning, reasoning, robustness, math, optimization |
2604.21199 | ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response | cs.LG, cs.CV | 85 | ARFBench TSQA for incident response; evaluates FMs on telemetry anomaly reasoning. | evaluation, benchmarks, time-series, incident-response, multimodal, ops |
2604.21571 | Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies | cs.AI, cs.LG | 84 | Personalization w/ deletable per-user proxies enabling deterministic unlearning; reduces cross-user leak | privacy, unlearning, personalization, LoRA, adapters, data-deletion |
2604.21214 | SQLyzr: A Comprehensive Benchmark and Evaluation Platform for Text-to-SQL | cs.DB, cs.AI | 84 | Text-to-SQL evaluation platform with realistic workload alignment + fine-grained metrics beyond single score | text-to-sql, evaluation, benchmarks, databases, LLMs, metrics |
2604.21716 | From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation | cs.CL, cs.SE | 83 | Shows codegen bias is underestimated: ML pipeline generation includes sensitive attrs in 87.7% cases | code generation, bias, fairness, evaluation, ML pipelines, safety |
2604.21421 | Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation | cs.CR, cs.AI, cs.CL | 83 | Comparative study of DP vs NER vs LLMs for clinical note de-ID (Dutch); directly relevant to privacy in LLM pipelines | privacy, differential-privacy, de-identification, clinical-NLP, LLMs, security |
2604.21344 | Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA | 83 | PolyChartQA benchmark exposes large drop for VLMs on multi-chart reasoning. | multimodal, VLM, benchmark, chart-QA, reasoning, evaluation |
2604.21197 | Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach | cs.LG | 81 | Membership inference tailored to federated LLM fine-tuning; projection-residual method on gradients | privacy, membership-inference, federated-learning, LLMs, security |
2604.21309 | When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation | cs.CL | 81 | Large fairness eval of political bias in multi-news summarization across 13 LLMs + metrics. | fairness, bias, summarization, evaluation, politics, LLMs |
2604.21854 | Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation | cs.AI | 80 | Proposes statistical certification to quantify/verify acceptable risk for AI regulation compliance | AI regulation, risk certification, assurance, governance, deployment safety |
2604.21769 | Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards | cs.AI, cs.CY, cs.HC | 80 | Shows leaderboard rankings depend on prompt slices; proposes interactive user-defined evaluation of LLM leaderboards | evaluation, leaderboards, LMArena, benchmarking, human-preferences, governance |
AI Paper Insight Brief
2026-04-25
0) Executive takeaways (read this first)
- “Gradient-only” and “federated” are not privacy shields for LLM fine-tuning: a single round of PEFT gradients can enable near-perfect membership inference via a simple projection-residual test (ProjRes), and lightweight defenses only help when they also crush utility.
- Enterprise agent privacy is failing in realistic dense-retrieval workflows: CI-Work shows substantial leakage/violation rates and a clear privacy–utility coupling; “try harder / bigger model” can increase leakage (inverse scaling) and user pressure makes things worse.
- Tool/agent security is shifting from prompt injection to protocol + developer pitfalls + trace auditing: MCP Pitfall Lab shows deterministic static checks can eliminate many server-side pitfalls cheaply, while black-box “skill stealing” and stateless multi-turn attacks (TTI) demonstrate how much can leak through normal interfaces.
- Evaluation itself is a growing attack surface and failure point: evaluator VLMs miss obvious degradations (FOCUS), and multi-chart QA + time-series incident QA benchmarks show large capability gaps precisely where real-world reasoning is compositional and cross-context.
- Reliability gains are coming from “systems” not just models: GUI automation improves by enforcing completion verification + loop recovery (VLAA-GUI), and multi-agent systems improve by learning latent communication (DiffMAS) rather than exchanging only text.
- Bias/fairness findings are increasingly “non-monotonic with scale” and task-dependent: medium-sized models can be best for political fairness in summarization, and code-generation bias looks far worse when you evaluate realistic ML pipelines (feature selection) rather than toy if-statements.
2) Key themes (clusters)
Theme: Federated & personalized LLM privacy is brittle (and needs new primitives)
- Why it matters: Federated/PEFT deployments and personalization are moving into regulated domains, but both gradient leakage and “entangled weights” make deletion and privacy guarantees fragile.
- Representative papers:
- Common approach:
- Exploit/avoid PEFT structure (adapters/LoRA) as the key locus of privacy risk/control.
- Treat privacy as auditable signals: residuals from gradient subspaces (attack) vs. KL-to-baseline verification after deletion (defense-by-design).
- Emphasize single-round practicality (attacks) and deterministic deletion (architectures).
- Open questions / failure modes:
- Utility-preserving defenses against projection-style MIAs remain unclear (DP/pruning trade off sharply).
- Proxy artifacts concentrate user info (SEA) and become an exfiltration target.
- How these methods behave under stronger adversaries (semantic MIAs; cross-model proxy transfer) is unresolved.
Theme: Contextual privacy for agents in enterprise/tool ecosystems
- Why it matters: Real agents operate over dense internal context and tool protocols; privacy failures are often systemic (retrieval density, user pressure, protocol surfaces), not just “bad prompts.”
- Representative papers:
- CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
- MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks
- Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study
- Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
- Common approach:
- Build workflow-grounded benchmarks with explicit privacy objectives (Leakage/Violation/Conveyance; trace validators).
- Use trace-grounded evaluation rather than trusting agent narratives (MCP Pitfall Lab; also highlights narrative–trace divergence).
- Attack via normal interfaces: repeated black-box queries (skill stealing), stateless multi-turn accumulation (TTI), tool metadata/supply chain vectors (MCP pitfalls).
- Open questions / failure modes:
- Prompt defenses reduce leakage but often reduce utility; user pressure can create “lose–lose” outcomes (CI-Work).
- Detectors/filters show strong results on constructed sets (skill stealing), but robustness to real benign traffic distributions is uncertain.
- Stateless moderation is structurally vulnerable unless session-level aggregation is adopted (TTI).
Theme: Benchmarks are getting more realistic—and models look worse on compositional, cross-context tasks
- Why it matters: As benchmarks move from synthetic/single-turn to real incidents, multi-chart figures, and long-term memory, the remaining gaps become clearer and more actionable for product teams.
- Representative papers:
- ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
- Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
- EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
- Who Defines “Best”? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
- Common approach:
- Use realistic artifacts (production incidents; paper figures; multi-session histories; large preference logs).
- Slice performance by difficulty families (tiers, question types, cross-space vs single-space, etc.).
- Introduce system-level baselines (TSFM–VLM hybrids; graph memory vs vector retrieval vs full-context; decomposition+verification pipelines).
- Open questions / failure modes:
- Cross-series/time reasoning remains weak (ARFBench Tier III; EngramaBench temporal slice).
- Multi-chart localization + retrieval questions cause large drops; decomposition helps but costs more (PolyChartQA + VDSP).
- “Global leaderboard rank” can be misleading; slice weighting changes decisions (interactive leaderboard work).
Theme: Reliability via explicit verification, recovery, and learned coordination
- Why it matters: Many failures are procedural (premature stopping, loops, unstable adaptation, lossy communication). Explicit mechanisms and learnable interfaces are showing measurable gains.
- Representative papers:
- Common approach:
- Add verifiers/gates (completion verifier; consensus refinement; stability metrics).
- Treat agent behavior as structured objects (KV traces; action loops; pseudo-label frequency regions).
- Use ablation-driven engineering to isolate which modules matter (VLAA-GUI; DDRL; DiffMAS step-count sensitivity).
- Open questions / failure modes:
- Verification can still miss dominant failure modes (VLAA-GUI notes false completion remains dominant for some backbones).
- Latent trace growth and step-count sensitivity can degrade performance (DiffMAS).
- TTRL-style methods may not generalize beyond math-like correctness signals (DDRL limitation).
Theme: Bias/fairness measurement is moving to “mechanism-relevant” tasks (and scale isn’t a fix)
- Why it matters: Bias can hide in realistic outputs (summaries, pipelines, causal reasoning). Evaluations that match real mechanisms reveal stronger and more directional failures.
- Representative papers:
- Common approach:
- Evaluate directional asymmetries (intervention vs market sign accuracy; centrist underrepresentation).
- Move from toy proxies to realistic mechanisms (feature selection in ML pipelines; multi-doc viewpoint distributions; multi-turn debate pressure).
- Test mitigations (prompts, judge-selection, one-shot examples) and report limited/variable effectiveness.
- Open questions / failure modes:
- Prompt-based debiasing is inconsistent; entity sentiment preservation is resistant (FairNews).
- One-shot steering doesn’t reliably remove directional skew and can inflate confidence (economic causal reasoning).
- Multi-turn debate can dramatically increase sycophancy vs direct probes (llm-bias-bench).
3) Technical synthesis
- Multiple papers converge on “auditability via traces/evidence”: MCP Pitfall Lab validates via MCP traces; TraceScope (URL triage) uses immutable evidence + checklist adjudication; EngramaBench annotates evidence IDs; this is a broader shift away from trusting model narratives.
- Single-round / low-history attacks are getting stronger: ProjRes needs only single-round gradients; skill stealing claims extraction with only a few interactions; TTI exploits per-turn stateless moderation.
- Utility–privacy coupling is now empirically quantified in agent settings (CI-Work correlation between conveyance and leakage/violation), echoing DP trade-offs in federated and clinical de-ID evaluations.
- Decomposition + verification is a recurring reliability pattern: VDSP for multi-chart QA, completion verifier + loop breaker for GUI agents, consensus off-policy refinement for test-time RL.
- “Bigger model” is not a universal fix: inverse scaling for leakage (CI-Work), medium-size best fairness trade-offs (FairNews), and evaluator VLMs still have large blind spots (FOCUS).
- Preference/judge-based evaluation is itself unreliable: FOCUS shows evaluator VLM failures; interactive leaderboard analysis shows preference rankings vary by slice and humans pick wrong answers in deterministic math 26% of the time.
- Latent interfaces are emerging as a performance lever: DiffMAS trains KV-trace communication; this parallels other work that treats non-text internal structure as optimizable rather than fixed.
- Synthetic data is used heavily but with different roles: ARFBench uses synthetic post-training plus small real set; AgenticQwen uses dual flywheels; HalluVL-DPO uses large synthetic preference data—raising common questions about bias/transfer and evaluation realism.
- Security threat models are broadening from prompt injection to supply chain + protocol + tool metadata + multimodal (BADSTYLE style triggers; MCP Pitfall Lab; skill stealing).
4) Top 5 papers (with “why now”)
- Shows a single-round, no-shadow-model membership inference attack tailored to FedLLMs/PEFT using projection residuals on hidden embeddings.
- Reports near-perfect AUC (often 1.00) across multiple LLMs/datasets and strong gains over prior FL MIAs.
- Evaluates defenses and finds DP only helps at utility-destroying noise, pruning only partially helps.
- Skepticism / limitation: non-trivial runtime overhead (per-layer attacks) and no utility-preserving defense proposed.
2) CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
- Introduces an enterprise CI benchmark with dense retrieval trajectories and explicit Essential vs Sensitive entries.
- Finds substantial violation/leakage and a measurable privacy–utility trade-off, plus inverse scaling where larger models can leak more.
- Shows user pressure can sharply increase leakage and even reduce conveyance (“lose–lose”).
- Skepticism / limitation: synthetic scenarios and LLM-judge under-reporting mean leakage is likely a lower bound; org-specific norms not captured.
3) MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks
- Operationalizes a developer pitfall taxonomy and provides trace-grounded validators for confidentiality/integrity objectives.
- Tier-1 static analyzer achieves F1=1.0 on statically checkable pitfall classes and is CI-friendly (~5.2 ms).
- Hardening reduces findings 29→0 with mean ~27 LOC changes; also documents frequent trace–narrative divergence.
- Skepticism / limitation: evaluation scope is small (few scenarios; preliminary corpus), and multimodal analysis is not yet thorough.
4) Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
- Releases FOCUS: >4,000 human-validated perturbation instances for meta-evaluating evaluator VLMs on I2T and T2I.
- Finds high evaluator failure rates, especially in single-answer scoring; pairwise comparison is more reliable.
- Shows reasoning budget doesn’t reliably help and evaluators can note errors in text but not reflect them in scores.
- Skepticism / limitation: gold outputs are model-generated (though manually reviewed); only four evaluator VLMs tested.
5) VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
- Targets two dominant GUI-agent failures: premature completion and loops, via completion gating + independent verifier + multi-tier loop breaker + search.
- Reports 77.45% success on OSWorld-Verified (Opus 4.6) surpassing reported human level (72.4%), and strong WAA results.
- Provides ablations showing which modules reduce false completion and wasted steps.
- Skepticism / limitation: tool overhead can hurt under tight budgets for weaker backbones; false completion remains a dominant failure mode for some models.
5) Practical next steps
- For federated/PEFT deployments: add a red-team audit that explicitly tests single-round gradient leakage (ProjRes-style) before shipping; treat “no raw data sharing” as insufficient.
- For enterprise agents: measure Leakage/Violation/Conveyance under dense retrieval and user pressure conditions (CI-Work-style), not just on clean prompts; track whether scaling increases leakage.
- Adopt trace-grounded security QA for tool servers: integrate Tier-1 static checks (MCP Pitfall Lab) into CI, and require protocol trace logging so validators can detect exfiltration/integrity violations.
- Harden against black-box extraction: test for skill/package leakage with automated prompt suites; consider output filtering and inference hardening, but also evaluate semantic leakage (not just exact match).
- Fix stateless moderation gaps: implement session-level aggregation or risk scoring to detect distributed multi-turn intent (TTI), and benchmark against stateless multi-turn attacks.
- Stop trusting evaluator VLMs by default: validate your evaluator on perturbation suites (FOCUS-like); prefer pairwise paradigms when feasible and monitor justification–score inconsistencies.
- For GUI/agent reliability: add explicit completion criteria + independent verifier and loop escalation; log false-completion and wasted-step ratios as first-class metrics (VLAA-GUI).
- For fairness audits: evaluate on mechanism-relevant tasks (e.g., ML pipeline feature selection, multi-doc viewpoint preservation, directional causal sign) and don’t assume larger models reduce bias.
Generated from per-paper analyses; no external browsing.
