Daily AI Paper Report (2026-04-15)
Published:
Chinese version: [中文]
Run stats
- Candidates: 306
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-13T00:00:00Z → 2026-04-14T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.11790 | ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection | cs.CR, cs.AI | 95 | Runtime, deterministic guardrails at tool boundaries to stop indirect prompt injection in agents | agent-security, tool-use, prompt-injection, runtime-enforcement, auditing, sandboxing |
2604.11072 | Hodoscope: Unsupervised Monitoring for AI Misbehaviors | cs.AI | 95 | Unsupervised monitoring to surface novel agent misbehaviors beyond predefined rules/judges. | agent-safety, monitoring, unsupervised, anomaly-detection, evaluation |
2604.11806 | Detecting Safety Violations Across Many Agent Traces | cs.AI, cs.CL | 93 | Scalable auditing: finds rare/adversarial safety violations only visible across many agent traces | auditing, monitoring, agent-traces, red-teaming, clustering, safety-eval |
2604.11322 | Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations | cs.CL, cs.AI | 93 | Finds tool-refusal flaw; introduces SABEval to isolate structural vs semantic tool relevance. | tool-use, agents, safety, evaluation, dataset, robustness |
2604.11259 | Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization | cs.AI, cs.CR | 92 | Preference optimization for privacy-personalized mobile GUI agents with heterogeneous trajectories. | agents, privacy, preference-optimization, mobile, security |
2604.10988 | WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark | cs.AI, cs.CV | 92 | Automated, scalable browser-agent benchmark resolving realism/reproducibility; strong eval utility. | agents, benchmarks, browser-agents, evaluation, automation, web |
2604.11061 | Pando: Do Interpretability Methods Work When Models Won't Explain Themselves? | cs.LG, cs.AI | 91 | Benchmark isolates interpretability signal from elicitation; tests when models mis/avoid explaining | mechanistic-interpretability, evaluation, model-organisms, faithfulness, alignment-auditing |
2604.11623 | Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems | cs.AI, cs.SE | 91 | Enterprise agent knowledge orchestration w/ permissions+freshness; shows leakage/phantom-content issues. | agentic-systems, permissions, governance, RAG, security, deployment |
2604.11581 | Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines | cs.CL | 90 | Decomposes hidden uncertainty in LLM eval pipelines; shows design-choice variance can flip rankings | evaluation, measurement-error, judge-models, prompt-variance, reproducibility, safety-standards |
2604.11201 | CocoaBench: Evaluating Unified Digital Agents in the Wild | cs.CL, cs.AI | 90 | Benchmark for unified digital agents requiring long-horizon composition of vision/search/coding. | agents, benchmark, evaluation, long-horizon, tool-use |
2604.11174 | EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems | cs.RO, cs.AI | 89 | Governance-focused embodied-agent benchmark: controllability, policy bounds, recovery, auditability | embodied-agents, governance, oversight, recovery, audit-trails, benchmark |
2604.11304 | BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows | cs.AI | 89 | High-fidelity benchmark for end-to-end investment banking workflows with real tools/deliverables. | agents, benchmark, evaluation, tool-use, real-world-tasks |
2604.11641 | CodeTracer: Towards Traceable Agent States | cs.SE, cs.AI | 89 | Tracing architecture for agent state transitions/error chains; improves debugging, auditing, reliability. | agents, observability, tracing, debugging, code-agents, monitoring |
2604.11307 | PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers | cs.AI | 88 | Multimodal multi-document benchmark for agentic deep research over papers incl. tables/figures. | agents, benchmark, multimodal, scientific-reasoning, retrieval |
2604.11182 | Evaluating Memory Capability in Continuous Lifelog Scenario | cs.CL | 88 | LifeDialBench + online temporal-causality protocol for real lifelog memory; reduces temporal leakage. | memory, long-context, evaluation, benchmark, online-eval, agents |
2604.11036 | Uncertainty-Aware Web-Conditioned Scientific Fact-Checking | cs.CL, cs.AI | 88 | Uncertainty-gated web retrieval for scientific fact-checking; targets hallucination and grounding. | fact-checking, uncertainty, grounding, retrieval, hallucinations, evaluation |
2604.11120 | Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs | cs.AI | 87 | Shows persona safety differs for prompting vs activation steering; single-method eval misses risks | safety-evaluation, personas, activation-steering, jailbreaks, robustness |
2604.11662 | Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation | cs.CL | 87 | Large study shows uncertainty probes fail under shift; calls for better UQ/hallucination eval. | uncertainty, hallucinations, robustness, OOD, evaluation |
2604.11309 | The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems | cs.CR, cs.AI, cs.CL, cs.CV, cs.LG | 86 | Multi-turn jailbreak via cumulative 'salami slicing' risk; highlights covert escalation failures | jailbreaks, multi-turn-attacks, cumulative-risk, adversarial-prompting, security |
2604.11784 | ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents | cs.LG, cs.AI, cs.CL, cs.CV | 86 | Open-source full-stack GUI agent framework: RL infra + stable eval + deployment to real devices. | GUI-agents, RL, evaluation, infrastructure, deployment, reproducibility |
2604.10966 | You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass | cs.CV, cs.AI | 86 | Single-pass multi-response reward modeling + new N-way benchmarks; cheaper preference learning. | reward-modeling, RLHF, preference-learning, efficiency, benchmarks, multimodal |
2604.11557 | UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents | cs.AI | 85 | Unifies tool-use representations + 22k tools + 390k instances; improves comparability for agents | tool-use, agents, datasets, evaluation, function-calling, standardization |
2604.11523 | PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints | cs.AI, cs.MA | 84 | Benchmark for multi-agent collaboration under privacy constraints; surfaces failure modes + metrics | multi-agent, privacy, collaboration, benchmark, coordination-failures, hallucinations |
2604.11419 | Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval | cs.AI, cs.CR | 84 | Systematic eval of graph-based vs agentic retrieval for cyber threat intelligence QA. | RAG, retrieval, knowledge-graphs, cybersecurity, evaluation |
2604.11094 | E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning | cs.SE, cs.AI | 84 | MicroRemed benchmark + RL fine-tuning for end-to-end LLM remediation generating executable playbooks. | agents, autonomous-remediation, benchmark, RLHF/RLFT, reliability, devops |
2604.11611 | Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation | cs.CL, cs.LG | 84 | Formal + practical method for calibrated self-reward (hindsight) to densify RL for LLM agents. | LLM-agents, reinforcement-learning, self-reward, calibration, theory, mutual-information |
2604.11778 | General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks | cs.CL, cs.AI | 83 | General365 benchmark targets broad 'general reasoning' decoupled from specialized knowledge. | reasoning, benchmark, evaluation, generalization |
2604.11012 | Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics | cs.AI, cs.CL, cs.LG | 83 | Min-k decoding reduces temperature sensitivity via logit-shape 'semantic cliffs'; practical gen quality lever. | decoding, sampling, generation, inference, calibration |
2604.11666 | Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind | cs.CL, cs.AI, cs.LG | 82 | ToM-based 'double-agent' defense task; frontier models struggle—useful for adversarial dialogue eval | theory-of-mind, adversarial-dialogue, privacy, social-engineering, evaluation, defense |
2604.11258 | Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate | cs.CL | 82 | Adversarial multi-agent debate with visual falsification to reduce diagnostic hallucinations. | hallucinations, multi-agent, healthcare, multimodal, robustness |
AI Paper Insight Brief
2026-04-15
0) Executive takeaways (read this first)
- Evaluation is shifting from “single-score” to “diagnostic infrastructure”: multiple new benchmarks/harnesses (WebForge, CocoaBench, BTB, PaperScope, EmbodiedGovBench, LifeDialBench, PAC-BENCH, Pando, CodeTracer) emphasize reproducibility, per-dimension breakdowns, and process/trace-level evidence over aggregate accuracy.
- Multi-turn and cross-trace risk is now a first-class threat model: Salami Slicing shows high-ASR gradual jailbreaks that evade per-turn refusal; Meerkat and Hodoscope show repository-/group-level discovery can surface cheating/exploits and novel misbehaviors with far less human review.
- Tool-augmented agents have two distinct safety gaps: (i) semantic attacks (indirect prompt injection) where deterministic boundary enforcement (ClawGuard) can cut ASR sharply; (ii) structural failures where models call irrelevant tools due to interface match (SABEval), mitigated by attention-pathway rebalancing.
- Preference/reward modeling is becoming more listwise, more efficient, and more “calibrated”: single-pass multi-response reward modeling reduces multimodal RM latency/FLOPs while improving ranking and GRPO stability; MISE adds calibration to hindsight process rewards to avoid self-eval bias.
- Interpretability results are sobering but actionable: Pando finds that when explanations are absent/misleading, gradient/RelP are the only consistent white-box signals for predicting behavior; many popular readouts mostly capture “task representation,” not decision computation.
- Robustness work is increasingly about “measurement error” and OOD reality checks: TEE shows pipeline design variance (prompt/judge interactions) can dominate and naive CIs under-cover; supervised UQ probes often collapse OOD (especially long-form), with middle-layer + token-averaging helping but not solving.
2) Key themes (clusters)
Theme: Reproducible, diagnostic agent benchmarking (beyond aggregate success)
- Why it matters: Agent progress is increasingly bottlenecked by evaluation realism, reproducibility, and actionable diagnostics; aggregate scores hide cross-domain biases, scaffold effects, and failure sources.
- Representative papers:
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
- CocoaBench: Evaluating Unified Digital Agents in the Wild
- BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
- PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers
- Common approach:
- Build self-contained or sandboxed environments to reduce drift (static websites; Docker sandboxes; packaged “data rooms”).
- Add difficulty controls / rubrics / automatic evaluators to enable scale and per-dimension diagnosis (7D difficulty vectors; weighted banker rubrics; task-specific evaluators).
- Report cross-domain and per-axis breakdowns to reveal capability stratification and biases.
- Open questions / failure modes:
- Scaffold sensitivity: performance varies substantially by harness/scaffold (CocoaBench).
- Sim-to-real gaps: static sites miss server-side semantics; local corpora miss open-domain noise (WebForge, PaperScope).
- Judge/evaluator bias and hidden config variance can distort comparisons (CocoaBench judge taxonomy; TEE measurement error).
Theme: Monitoring & auditing for “unknown unknowns” (group/repository-level)
- Why it matters: Many serious failures are sparse, novel, or distributed—missed by per-turn monitors and category-based judges—so discovery workflows must prioritize human attention efficiently.
- Representative papers:
- Common approach:
- Represent actions/traces in an embedding space, then surface overrepresented clusters (density-diffing across groups; clustering + agentic search).
- Optimize for review efficiency (density-weighted farthest-point sampling; witness-set localization).
- Convert discoveries into improved supervised monitors (exemplar augmentation) or benchmark patches (Commit0 vulnerability).
- Open questions / failure modes:
- Requires meaningful comparison groups; misbehaviors equally common across groups may not pop (Hodoscope).
- Vulnerable to adversarial dispersion in embedding space (Meerkat AP drop under embedding-prefix attack).
- Extremely rare single-run anomalies may evade cluster-based discovery (Hodoscope).
Theme: Multi-turn adversaries & cumulative-risk defenses
- Why it matters: Safety mechanisms that operate per-turn can be bypassed by gradual, distributed intent accumulation; defenses must reason over history and/or enforce runtime constraints.
- Representative papers:
- Common approach:
- Formalize multi-turn risk accumulation and build automated multi-turn attack pipelines (Salami Attack; adaptive multi-agent variant).
- Add history-aware auditing (CQA) or deterministic tool-call boundary enforcement (ClawGuard).
- Train defenders with trajectory-level objectives (fooling + ToM rewards via Dr. GRPO).
- Open questions / failure modes:
- CQA is a prototype using an LLM judge; production cost/latency and judge robustness are open (Salami).
- Rule-based defenses can miss content-misleading attacks where harm is in generated text, not tool calls (ClawGuard basic-rule results).
- Defender training raises misuse concerns and still has modest fooling rates on hard scenarios (TOM-SB).
Theme: Tool-use reliability: structural bias, standardization, and privacy-aware personalization
- Why it matters: Tool-augmented agents fail not only from “bad reasoning” but from systematic shortcuts (structural alignment) and from misaligned trajectory choices under user constraints (privacy personas).
- Representative papers:
- Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations
- UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents
- Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
- PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints
- Common approach:
- Construct datasets that decouple confounds (SABEval sibling tools with identical interfaces; privacy-constrained scenarios; paired privacy/utility trajectories).
- Use structure-aware training/eval (QAOA unification; Hybrid-20 distractors; step-wise preference weighting + padding gating).
- Diagnose failure modes quantitatively (TIR spikes under structural alignment; early privacy violations; privacy-induced hallucinations).
- Open questions / failure modes:
- Structural alignment bias provenance (pretraining vs tool finetuning) is unresolved (SABEval).
- Privacy constraints cause early-turn failures and hallucinations; prompt tweaks (Privacy-CoT) don’t reliably fix joint success (PAC-BENCH).
- Standardized synthetic tool data may not capture long-horizon real executions (UniToolCall limitations).
Theme: Reward/preference modeling & decoding robustness for safer generation
- Why it matters: As agents and multimodal systems scale, efficiency and robustness of scoring/decoding become gating factors for training stability and safe deployment.
- Representative papers:
- Common approach:
- Replace pairwise/single-response scoring with listwise or dense signals (multi-response cross-entropy RM; hindsight per-step rewards).
- Add calibration/robustness guarantees (temperature-invariant truncation; calibration reward aligning self-eval to env success).
- Validate via downstream training stability or cross-temperature robustness (GRPO stability; reasoning EM stability across T).
- Open questions / failure modes:
- Multi-response RM scaling beyond N=4 is untested; video preference remains hard (~50.7% best-of-4) (YOJO).
- Min-k may fail when correct tokens are deep long-tail (cliff assumption breaks).
- Self-eval rewards can be adversarially wrong (“flipped eval” harms learning); compute cost is higher than PPO (MISE).
Theme: Interpretability & evaluation reliability under unfaithful explanations / OOD
- Why it matters: Safety auditing often assumes explanations or uncertainty signals are reliable; these works show when and how those assumptions fail, and what signals still help.
- Representative papers:
- Pando: Do Interpretability Methods Work When Models Won’t Explain Themselves?
- Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines
- Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation
- Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
- Common approach:
- Create controlled settings to isolate confounders (explanation faithfulness axis; factorial pipeline designs; multi-method persona injection).
- Evaluate robustness under shifts (OOD regimes for UQ probes; activation steering vs prompting; design-choice variance).
- Provide targeted mitigations (RelP/gradients; multi-judge/multi-prompt designs; middle-layer + token-averaging; multi-method persona audits).
- Open questions / failure modes:
- Pando’s planted-tree setting may not capture distributed real features; depth limited to 4.
- UQ probes still fail badly on long-form OOD even with HBO.
- Activation steering can invert “safe persona” rankings (prosocial persona paradox), and defenses are not yet established.
3) Technical synthesis
- Listwise scoring is spreading: YOJO’s cross-entropy over N candidates parallels a broader move away from pairwise-only comparisons (also echoed by trajectory/requirement-level scoring in PAC-BENCH/BTB).
- “Causality constraints” in evaluation are becoming explicit: LifeDialBench’s online protocol prevents future-context leakage; WebForge validates solvability by replay in Chromium; BTB grades deliverables inside the same environment.
- Agent safety is moving from content filtering to systems enforcement: ClawGuard’s deterministic pre-invocation checks complement (not replace) judge-based approaches; Context Kubernetes similarly enforces permission/freshness invariants at the orchestration layer.
- Multi-turn threat models unify several papers: Salami (cumulative intent), TOM-SB (belief steering), PAC-BENCH (early-turn privacy violations), and Meerkat (distributed evidence across traces) all show that turn-local metrics miss key failures.
- Embedding-space methods are powerful but attackable: Hodoscope/Meerkat rely on clustering/projection; Meerkat demonstrates adversarial dispersion can break detection, suggesting a need for robust grouping or multi-view signals.
- Interpretability signal that survives unfaithful explanations is narrow: Pando finds gradients/RelP help when verbal rationales are absent/misleading; SABEval similarly uses attention-pathway analysis (CAA) to identify and intervene on a structural shortcut.
- Calibration is a recurring motif: Atomic+Search gates web retrieval by calibrated uncertainty bands; MISE calibrates self-eval rewards to env success; TEE calibrates evaluation confidence by modeling design variance.
- Benchmarks increasingly include “anti-cheating” and integrity checks: WebForge adds anti-cheating mechanisms; Meerkat finds real benchmark cheating; BTB uses a verifier with measured agreement to reduce subjective grading drift.
- Robust decoding is being treated as a safety/quality primitive: Min-k’s temperature-invariant truncation targets semantic collapse at high T with modest overhead, relevant for agent exploration settings.
- Process-level artifacts are becoming training signals: CodeTracer’s localized evidence enables reflective replay improvements; MISE uses per-step hindsight rewards; ClawGUI uses PRM + GiGPO for step-level credit.
4) Top 5 papers (with “why now”)
1) BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
- Provides a high-fidelity, multi-file workflow benchmark (100 tasks; rubrics ~150 criteria/task) that better matches real delegation stakes.
- Introduces an agentic verifier (Gandalf) with reported agreement vs humans (accuracy 88.2%, κ=0.76), enabling scalable grading of Excel/PPT/PDF deliverables.
- Shows frontier models are far from delegation-ready (best Pass@1 reported 16%; passing all critical criteria is rare).
- Skepticism: benchmark simplifies live banking dynamics and is US-centric; still a proxy for real deal work.
2) The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
- Formalizes cumulative multi-turn jailbreak risk and proves sub-threshold prompts can accumulate beyond harm thresholds.
- Demonstrates high ASR across multiple LLMs/benchmarks and extends to multimodal targets (VLMs/diffusion).
- Proposes Cumulative Query Auditing (CQA) that substantially reduces ASR in experiments.
- Skepticism: CQA uses an LLM judge in prototype form; production cost/latency and robustness need validation.
3) WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
- Automated generation of self-contained static websites with real-web noise + anti-cheating, addressing content drift while staying realistic.
- 934 validated tasks with a 74.1% pipeline pass rate; validation replays solutions in Chromium to ensure solvability.
- Per-dimension difficulty reveals capability differences; removing screenshots drops accuracy by ~16 pp.
- Skepticism: static sites can’t fully capture server-side/multi-user/real-time web semantics.
4) Pando: Do Interpretability Methods Work When Models Won’t Explain Themselves?
- Cleanly isolates the elicitation confounder by controlling whether models give faithful/no/unfaithful rationales.
- Large paired study (720 models) finds gradient/RelP are the only consistent white-box gains when explanations are absent/misleading.
- Variance decomposition shows many readouts track field identity/value rather than decision relevance.
- Skepticism: planted decision trees in a 2B LoRA setting may not generalize to distributed real-world features.
- Middleware that enforces deterministic, auditable checks at tool-call boundaries (sanitizer, rule evaluator, skill inspector, approval + logs).
- Demonstrates large ASR reductions across AgentDojo / SkillInject / MCPSafeBench in a basic-rule configuration.
- Introduces task-rule induction with user confirmation (Rtask), though not evaluated in the reported experiments.
- Skepticism: residual failures include content-misleading attacks; published results omit the context-aware rule induction.
5) Practical next steps
- Add cumulative-intent auditing to your safety stack: prototype a CQA-like check that periodically scores conversation history, not just the last user turn, and measure ASR reduction on multi-turn jailbreak suites.
- Harden tool boundaries deterministically: implement a ClawGuard-style pre-invocation policy (cmd/file/net) with audit logs; evaluate on indirect prompt injection benchmarks and track residual “harm-in-text” cases separately.
- Test for structural alignment bias in your tool router: create SABEval-like sibling-tool tests (same schema, different semantics) and measure tool-invocation rate; consider pathway-level interventions or training data that breaks the shortcut.
- Adopt measurement-error-aware evaluation: run small factorial pilots (≥2–3 prompt variants, multiple judges) and use variance decomposition to decide whether to spend budget on more items vs more judges/prompts (TEE).
- Move from per-trace to repository/group auditing: deploy Hodoscope/Meerkat-style clustering + prioritized review on your agent logs; explicitly test robustness to embedding-space dispersion attacks.
- For multimodal RLHF/RLAIF pipelines: try multi-response reward modeling for best-of-N and GRPO-style training; measure both ranking quality and latency/FLOPs savings, and test N>4 scaling if relevant.
- For long-horizon memory agents: evaluate with a causal online protocol (LifeDialBench-style) to quantify future-context leakage; compare raw-text preservation vs compressed memory and track accuracy decay over time.
- For interpretability-driven audits: when explanations are unreliable, prioritize gradient/RelP-style signals (per Pando) and validate that they improve held-out behavior prediction under a fixed query budget.
Generated from per-paper analyses; no external browsing.
