Daily AI Paper Report (2026-04-22)
Published:
Chinese version: [中文]
Run stats
- Candidates: 311
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-20T00:00:00Z → 2026-04-21T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.18463 | Using large language models for embodied planning introduces systematic safety risks | cs.AI, cs.LG, cs.RO | 96 | DESPITE benchmark shows LLM planning can be highly capable yet systematically unsafe in robotics tasks | agent-safety, embodied-agents, robotics, planning, benchmark, risk-evaluation |
2604.18487 | Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety | cs.CL, cs.AI | 95 | Large jailbreak benchmark; big ASR jump under stylistic obfuscation across 31 frontier models | jailbreaks, robustness, benchmark, red-teaming, safety-eval, stylistic-attacks |
2604.18519 | LLM Safety From Within: Detecting Harmful Content with Internal Representations | cs.AI | 94 | Guardrail via internal-layer features; big gains with tiny params; better OOD generalization | safety, harmful-content-detection, internal-representations, interpretability, guard-models |
2604.18510 | Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks | cs.CR, cs.AI, cs.CL | 93 | Compares jailbreak routes; shows mechanistic/behavioral divergence despite similar harmful compliance | jailbreaks, mechanistic-analysis, RLVR, SFT, abliteration, safety-failure-modes |
2604.17860 | TitanCA: Lessons from Orchestrating LLM Agents to Discover 100+ CVEs | cs.CR | 93 | Real-world multi-agent vuln discovery; 203 zero-days/118 CVEs; strong security lessons | agentic-security, vulnerability-discovery, LLM-agents, cybersecurity, red-teaming, software-security |
2604.18179 | Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs | cs.CR, cs.AI | 93 | Commit-open protocol using SAE feature traces to detect hosted LLM silent model substitution | security, auditing, model-integrity, SAE, verification, hosted-llms |
2604.17691 | SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models | cs.LG, cs.AI | 92 | Targets safety erosion under continual domain adaptation; anchors safety subspaces during LoRA updates | alignment, continual-learning, safety-preservation, fine-tuning, LoRA |
2604.18248 | Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection | cs.CR, cs.CL | 90 | Seven cross-domain prompt-injection detection ideas aimed at adaptive adversaries beyond regex/classifiers | prompt-injection, agent-security, detection, adversarial-robustness, LLM-security |
2604.17730 | MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models | cs.CL, cs.AI, cs.HC | 89 | Interaction-level mental health safety eval with role-aware harm taxonomy for multi-turn counseling | mental-health, safety-eval, multi-turn, harm-taxonomy, clinical-safety, agents |
2604.18231 | AgenTEE: Confidential LLM Agent Execution on Edge Devices | cs.CR, cs.OS | 88 | TEE-based confidential execution for LLM agents on edge; reduces attack surface and protects prompts/state | agent-security, TEE, confidential-computing, edge, system-prompts, privacy |
2604.18362 | ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation | cs.CL, cs.IR | 88 | Pre-generation conflict arbitration for long-form RAG; explicit support/contradiction claim graph | RAG, factuality, hallucinations, evidence-arbitration, long-form-generation |
2604.18164 | MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge | cs.CL, cs.AI, cs.CV | 88 | Benchmark for compositional bias in MLLM-as-judge; controlled perturbations + metrics | evaluation, judge-models, multimodal, bias, robustness, benchmarks |
2604.18103 | Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling | cs.AI | 88 | Training-free selective halting for long-context prefilling; big speedups while keeping accuracy | llm-efficiency, long-context, attention, inference-optimization, flashattention-compatible |
2604.17768 | When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias | cs.AI | 87 | Shows VLM judges ignore images (informativeness bias) and proposes a mitigation method | evaluation, VLM-as-judge, multimodal, bias, grounding, reliability |
2604.18240 | AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation | cs.AI | 86 | Benchmark for Agent-as-a-Judge that interacts with tools/envs to verify behavior beyond static judging | evaluation, agentic-systems, LLM-judge, verification, benchmarks, tool-use |
2604.17943 | Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents | cs.CL | 86 | Defense-doc RAG benchmark with auditable evidence; reports large gains + hallucination reduction | RAG, benchmark, attribution, hallucinations, domain-eval |
2604.17843 | Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research | cs.HC, cs.AI | 86 | Evidence-based multi-agent system with citations + abstention; large in-the-wild eval | RAG, epistemic-humility, abstention, citations, deployment, misinformation |
2604.17866 | Latent Abstraction for Retrieval-Augmented Generation | cs.CL, cs.AI | 86 | Unifies RAG in latent space: LLM generates dense retrieval vectors instead of text queries | RAG, retrieval, latent-retrieval, grounding, hallucinations, architecture |
2604.18109 | FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings | cs.CL, cs.SD | 86 | Shows lexical content recoverable from embeddings; strong privacy/interpretability diagnostic for encoders. | embeddings, interpretability, privacy-leakage, multilingual, multimodal, representation-analysis |
2604.17803 | Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition | cs.AI, cs.LG | 84 | Adversarial competition framework to generate diverse safety-alignment conversation data at scale | data-generation, red-teaming, alignment-data, crowdsourcing, adversarial-training |
2604.17948 | RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs | cs.CR, cs.AI, cs.MA | 84 | LLM-agent + RAG for vulnerability root-cause reports; structured template and curated security KB | cybersecurity, agents, RAG, vulnerability-analysis, software-security |
2604.18235 | Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search | cs.CL, cs.AI | 84 | Analyzes GRPO instability for deep-search agents; proposes advantage calibration fix | agents, RLHF, GRPO, training-stability, search-agents, credit-assignment |
2604.17761 | Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks | cs.AI, cs.CL | 84 | Contrastive attribution framework to analyze real benchmark failures; cross-layer graphs for long context | interpretability, attribution, debugging, llm-failures, evaluation |
2604.17957 | Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards | cs.CL | 83 | Scales PRM data via PDDL planning; ~1M step-level rewards beyond math; reusable for reasoning eval | process-reward-models, reasoning, datasets, planning, evaluation |
2604.18224 | WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models | cs.SE, cs.AI | 83 | WebCompass benchmark for multimodal web coding lifecycle (gen/edit/repair); human-in-loop | code-agents, evaluation, multimodal, benchmarks, web-development, repair |
2604.17739 | Tool Learning Needs Nothing More Than a Free 8B Language Model | cs.LG, cs.CL | 83 | Data-free tool-agent training with simulated environments from free 8B LMs + adaptive curriculum | tool-use, agents, rl, synthetic-environments, open-models, training |
2604.17769 | Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF | cs.CL, cs.AI | 82 | Automated toxic data synthesis via inverted constitution; probability-clamped RLAIF to curb reward hacking | adversarial-data, RLAIF, toxicity, red-teaming, reward-hacking, safety-training |
2604.17886 | Latent Preference Modeling for Cross-Session Personalized Tool Calling | cs.CL, cs.AI | 82 | Benchmark + method for cross-session personalized tool calling; big token savings vs full history | agents, tool-use, personalization, memory, benchmarks |
2604.17817 | Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots | cs.HC, cs.AI, cs.MA | 82 | DailyDroid benchmark + failure analysis for smartphone agents; compares text vs screenshots | mobile-agents, evaluation, HCI, multimodal, failure-analysis, automation |
2604.18584 | MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval | cs.AI, cs.DL, cs.IR, cs.LG | 82 | Large multilingual multimodal Olympiad math benchmark + paired retrieval set for equivalence/similarity | benchmark, math-reasoning, multimodal, multilingual, retrieval, evaluation |
AI Paper Insight Brief
2026-04-22
0) Executive takeaways (read this first)
- Evaluation is shifting from “single response” to “interaction + environment”: multi-turn, role-conditioned mental-health red-teaming (MHSafeEval) and replayable agent-judge verification (AJ-Bench) both show large gaps that static judging misses.
- Automated judges are demonstrably biased and can be fixed (partially) with better protocols: VLM judges often ignore images and over-reward “informativeness”; BIRCH improves accuracy by ~9–10% and reduces bias but doubles inference time.
- Safety failures compound under realistic post-training pipelines: sequential LoRA domain adaptation can cause cumulative safety erosion; SafeAnchor retains ~93% of original safety while keeping domain performance near standard LoRA.
- Security is becoming “agentic + operational” rather than benchmark-only: TitanCA reports 118 CVEs from an orchestrated LLM-agent pipeline; Adversarial Arena shows tournament-generated multi-turn data can materially improve secure coding/refusal metrics after fine-tuning.
- RAG reliability work is moving earlier in the pipeline: ArbGraph arbitrates contradictory evidence before generation and improves long-form factual recall (e.g., 83.3–84.9% FR), while DoRA shows domain-grounded synthetic benchmarks + light LoRA SFT can halve hallucination in a defense-doc QA setting.
- Two complementary safety primitives are emerging: (a) internal-representation guards (SIREN) that beat open guard models with far fewer trainable params, and (b) serving-time auditing (committed SAE traces + Merkle) to detect hosted-model substitution with ≤2.1% overhead.
2) Key themes (clusters)
Theme: Continual alignment under sequential adaptation
- Why it matters: Real deployments repeatedly specialize models (medicine→law→code). Safety regressions can accumulate across steps, not just per-task.
- Representative papers:
- Common approach:
- Treat safety as something that can be tracked/anchored in parameter or representation space (Fisher subspaces; refusal directions).
- Use targeted constraints/repairs rather than “re-align from scratch” (orthogonal gradient projection; directional repair).
- Open questions / failure modes:
- Does safety-subspace projection exhaust capacity over long adaptation sequences (T≫5) or larger models?
- Are defenses route-specific (RLVR vs SFT vs abliteration) such that “one fix” won’t generalize?
Theme: High-fidelity safety evaluation beyond single-turn prompts
- Why it matters: Many harms are relational, cumulative, or only visible when a judge can interact with the environment; single-turn datasets under-detect these failures.
- Representative papers:
- Common approach:
- Closed-loop adversarial search over trajectories (MAP-Elites-like archives; interaction budgets).
- Explicit taxonomies (role×harm categories; verification dimensions; stylistic transformations).
- Replayable environments and tool access for judges (agentic verification).
- Open questions / failure modes:
- Reliance on LLM judges and simulated users (mental health) may mis-estimate real clinical harm.
- Environment instability and cost (agentic judging) limit scale; “thinking” can even hurt verification.
Theme: Judge reliability and bias (text + multimodal)
- Why it matters: Judges are used for evaluation and reward; systematic bias (e.g., preferring longer/more detailed answers) can mis-train models.
- Representative papers:
- Common approach:
- Diagnose bias with explicit metrics/splits (IRS/IB/LB; abstention + provenance).
- Add protocol-level mitigations without retraining (anchor-based judging; verification + abstention).
- Open questions / failure modes:
- Anchor generation can itself be wrong and propagate errors; compute cost roughly doubles.
- How to validate judge improvements against human preferences at scale (especially multimodal)?
Theme: RAG robustness via domain grounding and conflict arbitration
- Why it matters: In high-stakes domains, retrieval noise/contradictions drive hallucinations; better retrieval and evidence selection are needed.
- Representative papers:
- Common approach:
- Make evidence explicit and auditable (evidence bundles; atomic claims; support/contradiction graphs).
- Add control policies for retrieval depth/stop decisions (control heads; arbitration budgets).
- Show lightweight adaptation (LoRA SFT) can materially improve faithfulness in-domain.
- Open questions / failure modes:
- Pre-generation arbitration adds latency; runtime/cost not fully quantified.
- Latent RAG claims efficiency but lacks reported latency/retrieval-call counts in the provided results.
Theme: Agent training and data generation at scale (simulated + competitive)
- Why it matters: Tool agents and safety training are bottlenecked by interactive data; scalable generation pipelines can unlock RL and alignment progress.
- Representative papers:
- Common approach:
- Replace expensive real environments with LM-simulated components (task/user/tool/verifier).
- Incentivize diversity and realism via competition (attacker/defender tournaments; diversity-weighted scoring).
- Stabilize adversarial optimization to avoid reward hacking (probability clamping).
- Open questions / failure modes:
- Simulator LM limitations and instability can abort trajectories; many curriculum hyperparameters.
- Toxic-data synthesis depends on AI judges and clamping bounds; downstream alignment impact not yet systematically measured.
Theme: Security & privacy primitives for real deployments
- Why it matters: As LLMs move into production (hosted APIs, edge agents, code security), we need verifiable identity, confidentiality, and precision-focused pipelines.
- Representative papers:
- Common approach:
- Bind claims to computation (Merkle commitments over SAE feature sketches; calibrated probe libraries).
- Hardware-backed isolation for agent components (Arm CCA realms + attestation + confidential shared memory).
- Multi-module orchestration emphasizing precision and calibration (match→filter→inspect→adapt).
- Open questions / failure modes:
- Auditing is scoped to ≤9B backbones and specific SAE/probe libraries; flagship-scale generalization is open.
- Edge confidential execution excludes side-channels/physical attacks; performance shown on prototype hardware.
3) Technical synthesis
- “Closed-loop search” is becoming the default for finding failures: MHSafeEval uses MAP-Elites-like archives; AJ-Bench uses interactive verification; Adversarial Arena uses tournaments—each increases coverage vs static prompts.
- Judging pipelines are being treated as systems with measurable biases: informativeness bias (IB) and image reliance (IRS) quantify judge failure; BIRCH mitigates via a truthful anchor rather than length equalization alone.
- Safety preservation is moving from “one-shot fine-tune” to “continual control”: SafeAnchor combines Fisher-based subspace identification + orthogonal gradient projection + monitoring-triggered repair.
- Representation-level safety is now both an attack surface and a defense surface: jailbreak routes diverge mechanistically (RLVR vs SFT vs abliteration), while SIREN leverages internal layers for better harmfulness detection.
- RAG reliability is splitting into (a) benchmark realism and (b) evidence arbitration: DoRA focuses on contamination-aware, intent-diverse domain QA; ArbGraph focuses on contradiction resolution before generation.
- Operational security pipelines emphasize calibration and precision: TitanCA’s confidence calibration reduces false positives (28%→20%) while maintaining recall under imbalance; this mirrors the broader trend of “trust-preserving” tooling.
- Efficiency work targets the prefill bottleneck, not just decoding: DASH prunes stabilized tokens after a start layer and remains FlashAttention-compatible, enabling length-dependent speedups (e.g., theoretical 1.83× at 16k tokens).
- Benchmarks increasingly include cost/latency as first-class metrics: DailyDroid quantifies multimodal cost blowups; BIRCH reports ~2× inference time; AgenTEE reports <5.15% overhead vs processes.
4) Top 5 papers (with “why now”)
- Shows safety can degrade compounding across sequential domain LoRA; SafeAnchor retains 85.2±0.9 safety vs base 91.4 (≈93.2% retention) while keeping domain performance near standard LoRA.
- Practical recipe: Fisher-based “safety subspace” + orthogonal gradient projection + probe-triggered repair.
- Improves adversarial robustness (GCG refusal 78.4±2.1 vs 54.6±2.6 best baseline).
- Skepticism: evaluated mainly at 7B and short sequences (3 domains; some extension to T=5); depends on probe quality (LlamaGuard) and Fisher approximations.
2) MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
- Reframes mental-health safety as trajectory-level harm discovery with a role×category taxonomy (28 behaviors).
- Closed-loop search dramatically increases attack success vs seed-only (e.g., GPT-3.5 ASR 0.603→0.943).
- Finds relational harms (dependency induction, gaslighting, overpathologizing) are easy to elicit even when comprehension is high.
- Skepticism: relies on simulated interactions and LLM-based clinical judging (gpt-4o-mini); frontier-scale coverage limited by cost.
3) When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
- Quantifies that VLM judges often barely use images (IRS typically <3–5%) and prefer “informative” but wrong answers.
- BIRCH mitigates via a truthful informative anchor; improves judge accuracy (e.g., GPT-4o 66.45%→75.78%) and reduces IB (e.g., Llama-3.2 IB 52.9%→35.9%).
- Skepticism: anchor errors can propagate; compute roughly doubles and bias is reduced but not eliminated.
4) Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
- Introduces a commit-open protocol (Merkle commitment over SAE top-k feature sketches) that closes the “parallel-serve” loophole in probe-after-return verification.
- Reports detection across substitute classes and an SVIP comparison (SVIP misses 11/11; commit-open detects 11/11 in the rerun set).
- Low serving overhead (≤2.1% at batch 32; 224-byte payload).
- Skepticism: scoped to specific backbones/SAEs (1.7–9B) and threat model; flagship-scale and stronger white-box adaptation remain open.
5) Using large language models for embodied planning introduces systematic safety risks (DESPITE)
- Deterministic PDDL benchmark (12,279 tasks) separates Feasibility from Safety Intention; shows safety awareness scales slowly (βSI=4.5) vs feasibility (βF=26.8).
- Striking example: Gemini-3-Pro-Preview is infeasible only 0.4% but produces dangerous plans 28.7%.
- Provides a clean decomposition: Safety ≈ Feasibility × Safety Intention (R²≈0.99).
- Skepticism: symbolic/deterministic setting (no perception, no continuous dynamics); interpret as lower bound for real robotics.
5) Practical next steps
- If you do continual specialization: add a post-adaptation safety monitor (probe set + threshold) and test orthogonal-gradient constraints (SafeAnchor-style) for LoRA pipelines; track safety retention across multiple sequential domains, not just one.
- If you rely on LLM/VLM judges: measure and report bias splits (informativeness-driven vs correctness-driven) and image reliance (IRS); consider anchor-based judging (BIRCH) when correctness must dominate.
- For agent evaluation: adopt environment-replayable judge setups (AJ-Bench style) for at least one domain you care about; compare LLM-as-judge vs agent-as-judge F1 and budget sensitivity.
- For RAG in sensitive domains: build a DoRA-like synthetic, evidence-linked regression set from your private corpus; then test whether light LoRA SFT improves both task metrics and hallucination diagnostics under a fixed retriever.
- For long-form RAG: prototype pre-generation claim arbitration (ArbGraph-style) on a small slice; measure factual recall / hallucination vs your current “retrieve-then-generate” baseline.
- For hosted-model integrity: evaluate whether a commit-before-open trace (e.g., SAE sketch + Merkle) is feasible in your serving stack; quantify overhead and decide what attacker classes you need to cover.
- For red-teaming coverage: add stylistic obfuscation transformations (AHB-style) to your single-turn safety suite; track ∆ASR under rhetorical displacement as a robustness KPI.
Generated from per-paper analyses; no external browsing.
