Daily AI Paper Report (2026-04-22)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 311
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-20T00:00:00Z → 2026-04-21T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.18463Using large language models for embodied planning introduces systematic safety risks
PDF
cs.AI, cs.LG, cs.RO96DESPITE benchmark shows LLM planning can be highly capable yet systematically unsafe in robotics tasksagent-safety, embodied-agents, robotics, planning, benchmark, risk-evaluation
2604.18487Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
PDF
cs.CL, cs.AI95Large jailbreak benchmark; big ASR jump under stylistic obfuscation across 31 frontier modelsjailbreaks, robustness, benchmark, red-teaming, safety-eval, stylistic-attacks
2604.18519LLM Safety From Within: Detecting Harmful Content with Internal Representations
PDF
cs.AI94Guardrail via internal-layer features; big gains with tiny params; better OOD generalizationsafety, harmful-content-detection, internal-representations, interpretability, guard-models
2604.18510Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
PDF
cs.CR, cs.AI, cs.CL93Compares jailbreak routes; shows mechanistic/behavioral divergence despite similar harmful compliancejailbreaks, mechanistic-analysis, RLVR, SFT, abliteration, safety-failure-modes
2604.17860TitanCA: Lessons from Orchestrating LLM Agents to Discover 100+ CVEs
PDF
cs.CR93Real-world multi-agent vuln discovery; 203 zero-days/118 CVEs; strong security lessonsagentic-security, vulnerability-discovery, LLM-agents, cybersecurity, red-teaming, software-security
2604.18179Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
PDF
cs.CR, cs.AI93Commit-open protocol using SAE feature traces to detect hosted LLM silent model substitutionsecurity, auditing, model-integrity, SAE, verification, hosted-llms
2604.17691SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
PDF
cs.LG, cs.AI92Targets safety erosion under continual domain adaptation; anchors safety subspaces during LoRA updatesalignment, continual-learning, safety-preservation, fine-tuning, LoRA
2604.18248Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
PDF
cs.CR, cs.CL90Seven cross-domain prompt-injection detection ideas aimed at adaptive adversaries beyond regex/classifiersprompt-injection, agent-security, detection, adversarial-robustness, LLM-security
2604.17730MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
PDF
cs.CL, cs.AI, cs.HC89Interaction-level mental health safety eval with role-aware harm taxonomy for multi-turn counselingmental-health, safety-eval, multi-turn, harm-taxonomy, clinical-safety, agents
2604.18231AgenTEE: Confidential LLM Agent Execution on Edge Devices
PDF
cs.CR, cs.OS88TEE-based confidential execution for LLM agents on edge; reduces attack surface and protects prompts/stateagent-security, TEE, confidential-computing, edge, system-prompts, privacy
2604.18362ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
PDF
cs.CL, cs.IR88Pre-generation conflict arbitration for long-form RAG; explicit support/contradiction claim graphRAG, factuality, hallucinations, evidence-arbitration, long-form-generation
2604.18164MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
PDF
cs.CL, cs.AI, cs.CV88Benchmark for compositional bias in MLLM-as-judge; controlled perturbations + metricsevaluation, judge-models, multimodal, bias, robustness, benchmarks
2604.18103Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
PDF
cs.AI88Training-free selective halting for long-context prefilling; big speedups while keeping accuracyllm-efficiency, long-context, attention, inference-optimization, flashattention-compatible
2604.17768When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
PDF
cs.AI87Shows VLM judges ignore images (informativeness bias) and proposes a mitigation methodevaluation, VLM-as-judge, multimodal, bias, grounding, reliability
2604.18240AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
PDF
cs.AI86Benchmark for Agent-as-a-Judge that interacts with tools/envs to verify behavior beyond static judgingevaluation, agentic-systems, LLM-judge, verification, benchmarks, tool-use
2604.17943Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
PDF
cs.CL86Defense-doc RAG benchmark with auditable evidence; reports large gains + hallucination reductionRAG, benchmark, attribution, hallucinations, domain-eval
2604.17843Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
PDF
cs.HC, cs.AI86Evidence-based multi-agent system with citations + abstention; large in-the-wild evalRAG, epistemic-humility, abstention, citations, deployment, misinformation
2604.17866Latent Abstraction for Retrieval-Augmented Generation
PDF
cs.CL, cs.AI86Unifies RAG in latent space: LLM generates dense retrieval vectors instead of text queriesRAG, retrieval, latent-retrieval, grounding, hallucinations, architecture
2604.18109FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings
PDF
cs.CL, cs.SD86Shows lexical content recoverable from embeddings; strong privacy/interpretability diagnostic for encoders.embeddings, interpretability, privacy-leakage, multilingual, multimodal, representation-analysis
2604.17803Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
PDF
cs.AI, cs.LG84Adversarial competition framework to generate diverse safety-alignment conversation data at scaledata-generation, red-teaming, alignment-data, crowdsourcing, adversarial-training
2604.17948RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs
PDF
cs.CR, cs.AI, cs.MA84LLM-agent + RAG for vulnerability root-cause reports; structured template and curated security KBcybersecurity, agents, RAG, vulnerability-analysis, software-security
2604.18235Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
PDF
cs.CL, cs.AI84Analyzes GRPO instability for deep-search agents; proposes advantage calibration fixagents, RLHF, GRPO, training-stability, search-agents, credit-assignment
2604.17761Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
PDF
cs.AI, cs.CL84Contrastive attribution framework to analyze real benchmark failures; cross-layer graphs for long contextinterpretability, attribution, debugging, llm-failures, evaluation
2604.17957Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
PDF
cs.CL83Scales PRM data via PDDL planning; ~1M step-level rewards beyond math; reusable for reasoning evalprocess-reward-models, reasoning, datasets, planning, evaluation
2604.18224WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
PDF
cs.SE, cs.AI83WebCompass benchmark for multimodal web coding lifecycle (gen/edit/repair); human-in-loopcode-agents, evaluation, multimodal, benchmarks, web-development, repair
2604.17739Tool Learning Needs Nothing More Than a Free 8B Language Model
PDF
cs.LG, cs.CL83Data-free tool-agent training with simulated environments from free 8B LMs + adaptive curriculumtool-use, agents, rl, synthetic-environments, open-models, training
2604.17769Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
PDF
cs.CL, cs.AI82Automated toxic data synthesis via inverted constitution; probability-clamped RLAIF to curb reward hackingadversarial-data, RLAIF, toxicity, red-teaming, reward-hacking, safety-training
2604.17886Latent Preference Modeling for Cross-Session Personalized Tool Calling
PDF
cs.CL, cs.AI82Benchmark + method for cross-session personalized tool calling; big token savings vs full historyagents, tool-use, personalization, memory, benchmarks
2604.17817Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
PDF
cs.HC, cs.AI, cs.MA82DailyDroid benchmark + failure analysis for smartphone agents; compares text vs screenshotsmobile-agents, evaluation, HCI, multimodal, failure-analysis, automation
2604.18584MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
PDF
cs.AI, cs.DL, cs.IR, cs.LG82Large multilingual multimodal Olympiad math benchmark + paired retrieval set for equivalence/similaritybenchmark, math-reasoning, multimodal, multilingual, retrieval, evaluation

AI Paper Insight Brief

2026-04-22

0) Executive takeaways (read this first)

  • Evaluation is shifting from “single response” to “interaction + environment”: multi-turn, role-conditioned mental-health red-teaming (MHSafeEval) and replayable agent-judge verification (AJ-Bench) both show large gaps that static judging misses.
  • Automated judges are demonstrably biased and can be fixed (partially) with better protocols: VLM judges often ignore images and over-reward “informativeness”; BIRCH improves accuracy by ~9–10% and reduces bias but doubles inference time.
  • Safety failures compound under realistic post-training pipelines: sequential LoRA domain adaptation can cause cumulative safety erosion; SafeAnchor retains ~93% of original safety while keeping domain performance near standard LoRA.
  • Security is becoming “agentic + operational” rather than benchmark-only: TitanCA reports 118 CVEs from an orchestrated LLM-agent pipeline; Adversarial Arena shows tournament-generated multi-turn data can materially improve secure coding/refusal metrics after fine-tuning.
  • RAG reliability work is moving earlier in the pipeline: ArbGraph arbitrates contradictory evidence before generation and improves long-form factual recall (e.g., 83.3–84.9% FR), while DoRA shows domain-grounded synthetic benchmarks + light LoRA SFT can halve hallucination in a defense-doc QA setting.
  • Two complementary safety primitives are emerging: (a) internal-representation guards (SIREN) that beat open guard models with far fewer trainable params, and (b) serving-time auditing (committed SAE traces + Merkle) to detect hosted-model substitution with ≤2.1% overhead.

2) Key themes (clusters)

Theme: Continual alignment under sequential adaptation

Theme: High-fidelity safety evaluation beyond single-turn prompts

Theme: Judge reliability and bias (text + multimodal)

Theme: RAG robustness via domain grounding and conflict arbitration

Theme: Agent training and data generation at scale (simulated + competitive)

Theme: Security & privacy primitives for real deployments

3) Technical synthesis

  • “Closed-loop search” is becoming the default for finding failures: MHSafeEval uses MAP-Elites-like archives; AJ-Bench uses interactive verification; Adversarial Arena uses tournaments—each increases coverage vs static prompts.
  • Judging pipelines are being treated as systems with measurable biases: informativeness bias (IB) and image reliance (IRS) quantify judge failure; BIRCH mitigates via a truthful anchor rather than length equalization alone.
  • Safety preservation is moving from “one-shot fine-tune” to “continual control”: SafeAnchor combines Fisher-based subspace identification + orthogonal gradient projection + monitoring-triggered repair.
  • Representation-level safety is now both an attack surface and a defense surface: jailbreak routes diverge mechanistically (RLVR vs SFT vs abliteration), while SIREN leverages internal layers for better harmfulness detection.
  • RAG reliability is splitting into (a) benchmark realism and (b) evidence arbitration: DoRA focuses on contamination-aware, intent-diverse domain QA; ArbGraph focuses on contradiction resolution before generation.
  • Operational security pipelines emphasize calibration and precision: TitanCA’s confidence calibration reduces false positives (28%→20%) while maintaining recall under imbalance; this mirrors the broader trend of “trust-preserving” tooling.
  • Efficiency work targets the prefill bottleneck, not just decoding: DASH prunes stabilized tokens after a start layer and remains FlashAttention-compatible, enabling length-dependent speedups (e.g., theoretical 1.83× at 16k tokens).
  • Benchmarks increasingly include cost/latency as first-class metrics: DailyDroid quantifies multimodal cost blowups; BIRCH reports ~2× inference time; AgenTEE reports <5.15% overhead vs processes.

4) Top 5 papers (with “why now”)

1) SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

  • Shows safety can degrade compounding across sequential domain LoRA; SafeAnchor retains 85.2±0.9 safety vs base 91.4 (≈93.2% retention) while keeping domain performance near standard LoRA.
  • Practical recipe: Fisher-based “safety subspace” + orthogonal gradient projection + probe-triggered repair.
  • Improves adversarial robustness (GCG refusal 78.4±2.1 vs 54.6±2.6 best baseline).
  • Skepticism: evaluated mainly at 7B and short sequences (3 domains; some extension to T=5); depends on probe quality (LlamaGuard) and Fisher approximations.

2) MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

  • Reframes mental-health safety as trajectory-level harm discovery with a role×category taxonomy (28 behaviors).
  • Closed-loop search dramatically increases attack success vs seed-only (e.g., GPT-3.5 ASR 0.603→0.943).
  • Finds relational harms (dependency induction, gaslighting, overpathologizing) are easy to elicit even when comprehension is high.
  • Skepticism: relies on simulated interactions and LLM-based clinical judging (gpt-4o-mini); frontier-scale coverage limited by cost.

3) When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias

  • Quantifies that VLM judges often barely use images (IRS typically <3–5%) and prefer “informative” but wrong answers.
  • BIRCH mitigates via a truthful informative anchor; improves judge accuracy (e.g., GPT-4o 66.45%→75.78%) and reduces IB (e.g., Llama-3.2 IB 52.9%→35.9%).
  • Skepticism: anchor errors can propagate; compute roughly doubles and bias is reduced but not eliminated.

4) Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

  • Introduces a commit-open protocol (Merkle commitment over SAE top-k feature sketches) that closes the “parallel-serve” loophole in probe-after-return verification.
  • Reports detection across substitute classes and an SVIP comparison (SVIP misses 11/11; commit-open detects 11/11 in the rerun set).
  • Low serving overhead (≤2.1% at batch 32; 224-byte payload).
  • Skepticism: scoped to specific backbones/SAEs (1.7–9B) and threat model; flagship-scale and stronger white-box adaptation remain open.

5) Using large language models for embodied planning introduces systematic safety risks (DESPITE)

  • Deterministic PDDL benchmark (12,279 tasks) separates Feasibility from Safety Intention; shows safety awareness scales slowly (βSI=4.5) vs feasibility (βF=26.8).
  • Striking example: Gemini-3-Pro-Preview is infeasible only 0.4% but produces dangerous plans 28.7%.
  • Provides a clean decomposition: Safety ≈ Feasibility × Safety Intention (R²≈0.99).
  • Skepticism: symbolic/deterministic setting (no perception, no continuous dynamics); interpret as lower bound for real robotics.

5) Practical next steps

  • If you do continual specialization: add a post-adaptation safety monitor (probe set + threshold) and test orthogonal-gradient constraints (SafeAnchor-style) for LoRA pipelines; track safety retention across multiple sequential domains, not just one.
  • If you rely on LLM/VLM judges: measure and report bias splits (informativeness-driven vs correctness-driven) and image reliance (IRS); consider anchor-based judging (BIRCH) when correctness must dominate.
  • For agent evaluation: adopt environment-replayable judge setups (AJ-Bench style) for at least one domain you care about; compare LLM-as-judge vs agent-as-judge F1 and budget sensitivity.
  • For RAG in sensitive domains: build a DoRA-like synthetic, evidence-linked regression set from your private corpus; then test whether light LoRA SFT improves both task metrics and hallucination diagnostics under a fixed retriever.
  • For long-form RAG: prototype pre-generation claim arbitration (ArbGraph-style) on a small slice; measure factual recall / hallucination vs your current “retrieve-then-generate” baseline.
  • For hosted-model integrity: evaluate whether a commit-before-open trace (e.g., SAE sketch + Merkle) is feasible in your serving stack; quantify overhead and decide what attacker classes you need to cover.
  • For red-teaming coverage: add stylistic obfuscation transformations (AHB-style) to your single-turn safety suite; track ∆ASR under rhetorical displacement as a robustness KPI.

Generated from per-paper analyses; no external browsing.