Daily AI Paper Report (2026-04-23)
Published:
Chinese version: [中文]
Run stats
- Candidates: 241
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-21T00:00:00Z → 2026-04-22T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.19657 | An AI Agent Execution Environment to Safeguard User Data | cs.CR, cs.AI, cs.OS | 95 | Execution environment w/ user-specified permissions to prevent agent data exfiltration/prompt injection. | agent-security, privacy, sandboxing, permissions, prompt-injection, confidential-computing |
2604.19001 | When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains | cs.CL | 94 | Benchmark for sentence-level harmful behaviors inside reasoning traces; enables monitoring/intervention. | AI safety, reasoning, chain-of-thought, harm detection, benchmark, monitoring, jailbreaks |
2604.19461 | Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4 | cs.CR | 93 | New jailbreak class (IICL) with strong ablations and high bypass rates; important for safety training limits. | jailbreaks, in-context-attacks, robustness, red-teaming, alignment-failure-modes |
2604.18934 | AutomationBench | cs.AI | 92 | Benchmark for cross-app API agents incl. endpoint discovery + policy adherence; strong for agent eval/safety. | agents, benchmark, tool-use, api, policy-adherence, evaluation, automation |
2604.18946 | Reasoning Structure Matters for Safety Alignment of Reasoning Models | cs.AI | 92 | Targets LRM safety by changing reasoning structure; claims strong gains with only 1K SFT examples. | safety-alignment, reasoning-models, post-training, SFT, jailbreak-robustness |
2604.19656 | Pause or Fabricate? Training Language Models for Grounded Reasoning | cs.CL | 91 | RL framework to make models pause/clarify under missing premises, reducing confident fabrication. | grounded reasoning, hallucinations, RL, interactive RL, uncertainty, reliability, alignment |
2604.19274 | HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing | cs.CL | 90 | Benchmark for draft-based co-authoring jailbreaks in high-risk domains; realistic collaborative writing threat. | benchmarks, jailbreaks, harmful-content, human-LLM-collaboration, safety-eval |
2604.19018 | Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control | cs.LG, cs.AI, eess.SY, math.OC, stat.ML | 90 | Formalizes activation steering as feedback control using local linearity; could improve reliable inference-time alignment. | activation-steering, inference-time-alignment, control, interpretability, robustness |
2604.19638 | SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models | cs.AI, cs.CL, cs.RO | 88 | Embodied safety benchmark measuring hazard mitigation planning (not just recognition); exposes alignment gap. | embodied-agents, planning, hazard-mitigation, multimodal, safety-evaluation, robotics |
2604.18976 | STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming | cs.CL | 88 | Automated black-box red teaming via multi-agent strategy-response network; interpretable vuln mapping. | red teaming, jailbreaks, adversarial prompts, multi-agent, evaluation, security |
2604.19540 | Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems | cs.MA, cs.AI | 88 | Protocol for cross-session multi-agent memory with field-level acceptance + traceability; key for reliable agents. | multi-agent, memory, provenance, traceability, coordination, agent-infrastructure |
2604.19295 | TEMPO: Scaling Test-time Training for Large Reasoning Models | cs.LG | 88 | Scales test-time training for LRMs via EM-style critic recalibration; addresses reward drift and diversity collapse. | test-time-training, reasoning, RL, critic-calibration, inference-scaling |
2604.19533 | Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps | cs.CR, cs.AI | 86 | Agentic SecOps threat-hunting benchmark with large event logs + RL env; strong real-world evaluation setup. | cybersecurity, agents, benchmark, tool-use, SQL, evaluation |
2604.19049 | Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery | cs.CR, cs.AI, cs.SE | 86 | Stage-gated adversarial multi-agent review to cut false positives in LLM defect discovery campaigns. | software security, agents, verification, LLM reliability, triage, vulnerability discovery |
2604.19354 | Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges | cs.AI, cs.CR, cs.SE | 85 | CTF agent benchmark in isolated VMs with partial-credit scoring + full traces; useful for capability auditing. | cybersecurity, agent-evaluation, CTF, partial-credit, traces, offensive-security |
2604.19083 | ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety | cs.CR, cs.AI | 84 | Interpretability for MLLM backdoors focusing on projector role; clarifies mechanisms and mitigation targets. | multimodal, backdoors, data-poisoning, interpretability, model-security |
2604.19561 | Detecting Data Contamination in Large Language Models | cs.AI | 84 | Unified evaluation of black-box membership inference for LLM data contamination; proposes new method. | privacy, membership inference, data contamination, copyright, LLM auditing, security |
2604.19299 | Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms | cs.CL, cs.AI | 84 | Large-scale study of <10B models under tool/multi-agent paradigms; practical deployment trade-offs. | small-models, agents, tool-use, multi-agent, efficiency, deployment, evaluation |
2604.19572 | A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression | cs.CL | 84 | Self-evolving observation compression for terminal agents; reduces long-horizon token blowup and cost. | agents, long-horizon, context-compression, terminal-agents, efficiency |
2604.18982 | SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution | cs.AI | 83 | Shapley-based reward attribution for social dialogue RL; principled credit assignment for language agents. | rl, credit-assignment, shapley, dialogue-agents, social, alignment |
2604.19405 | Lost in Translation: Do LVLM Judges Generalize Across Languages? | cs.CL | 82 | Large multilingual multimodal judge benchmark; critical for reward-model generalization and eval reliability. | evaluation, judge-models, reward-models, multilingual, vision-language, benchmarks |
2604.19047 | RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora | cs.CL, cs.AI, cs.IR | 82 | RAG retrieval eval that accounts for redundancy/high-similarity corpora; more realistic benchmarks. | RAG, retrieval evaluation, benchmarks, redundancy, IR, grounding |
2604.19089 | Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression | cs.AI | 82 | Lifelong knowledge editing with selective suppression to reduce forgetting; aims scalable sequential updates. | knowledge-editing, continual-learning, hallucinations, model-updates, reliability |
2604.18963 | Distillation Traps and Guards: A Calibration Knob for LLM Distillability | cs.LG, cs.AI | 81 | Analyzes distillation failure modes and proposes calibration to control distillability; relevant to leakage risk. | distillation, calibration, model-leakage, post-training, reliability |
2604.19565 | Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps | cs.CL, cs.AI, cs.LG | 80 | Inference-time hallucination detection for SpeechLLMs using attention-map features; no gold needed. | hallucination detection, speech LLMs, inference-time, attention, reliability, monitoring |
2604.19254 | ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning | cs.CL, cs.AI | 80 | New PEFT method (shadow module) shifting adaptation to shared layer-level refinement; could improve tuning efficiency. | peft, fine-tuning, lora-alternative, efficiency, llm-training |
2604.19395 | Does Self-Consistency Improve the Recall of Encyclopedic Knowledge? | cs.CL | 80 | Creates MMLU knowledge-recall split; finds self-consistency boosts encyclopedic recall, not just reasoning. | evaluation, self-consistency, knowledge-recall, MMLU, prompting |
2604.18970 | Mechanistic Anomaly Detection via Functional Attribution | cs.LG, cs.CR | 79 | Mechanistic anomaly detection via functional attribution/influence; aims to detect anomalous internal behavior. | anomaly-detection, mechanistic-interpretability, influence-functions, model-monitoring, security |
2604.19728 | VLA Foundry: A Unified Framework for Training Vision-Language-Action Models | cs.RO, cs.AI, cs.CV, cs.LG, cs.SE | 79 | Unified open framework for LLM→VLM→VLA training; releases models + closed-loop eval in simulator. | vla, robotics, vision-language-action, framework, training-pipeline, open-source |
2604.19092 | RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation | cs.RO, cs.AI | 79 | Benchmark for world models grounded in executable robot actions; tests physical plausibility beyond realism. | benchmarks, world-models, robotics, embodied-agents, evaluation |
AI Paper Insight Brief
2026-04-23
0) Executive takeaways (read this first)
- Agentic “real work” benchmarks are exposing a large capability gap: cross-app business automation (<10% pass on AutomationBench), SOC threat hunting (best model ~3.82% submitted-flag recall), and VM-based CTF exploitation (best ~35% checkpoint completion) all show frontier models are far from reliable autonomy in high-stakes environments.
- Safety is shifting from output-only to process/mechanism-level control: sentence-level harm detection in reasoning traces (HARMTHOUGHTS) shows big performance collapse at fine granularity, while activation steering via closed-loop control (Activation-LQR) and functional-attribution anomaly detection (BIF/SGLD correlations) offer mechanism-aware levers.
- Jailbreaks are increasingly “structural,” not obfuscation-based: draft-based co-authoring prompts (HarDBench) and Involuntary In-Context Learning (IICL) bypass safety by exploiting completion/pattern mechanisms; defenses that only scan for encoded payloads or keywords will miss these.
- Practical alignment interventions are getting cheaper and more “surgical”: ALTTRAIN changes reasoning structure with ~1K SFT examples; LightEdit performs lifelong knowledge edits without parameter updates via selective retrieval + first-token suppression; both emphasize targeted control over broad RL.
- Evaluation realism is improving in RAG and robotics: redundancy-aware retrieval evaluation (RARE/RedQA) shows multi-hop retrieval collapses in high-similarity enterprise corpora; RoboWM-Bench operationalizes “executability” of world-model rollouts by converting predicted videos into actions and executing in real-to-sim.
2) Key themes (clusters)
Theme: Agent benchmarks that measure state change, not chat quality
- Why it matters: Production automation/security work is judged by deterministic end-state changes (records updated, flags submitted, shells obtained), and current models fail badly under these criteria.
- Representative papers:
- Common approach:
- Simulated-but-realistic environments (multi-app REST APIs; Windows event logs in SQLite; isolated attacker/target VMs).
- Hard budgets (steps/tool calls; 50 SQL queries; time/step limits) and programmatic or rubric-based scoring.
- Emphasis on partial credit (checkpoints) or end-state assertions (no LLM-as-judge for AutomationBench).
- Open questions / failure modes:
- Agents often “declare success” without achieving state changes (AutomationBench) or observe evidence but fail to submit/attribute it (Cyber Defense Benchmark).
- Rubric dependence and summarization sensitivity for partial-credit judging (DeepRed).
- How to train agents to improve without overfitting to benchmark-specific tool surfaces and hardening tricks.
Theme: Process-level safety: detect and intervene inside reasoning
- Why it matters: Harm can emerge in intermediate reasoning steps even when final outputs look safer; output-only safety misses escalation dynamics and blocks targeted mitigations.
- Representative papers:
- Common approach:
- Fine-grained taxonomies/labels for reasoning behavior (16 sentence-level harm-propagation labels).
- Structural changes to reasoning pipelines (PU→HA→CR) via small SFT datasets.
- Inference-time activation interventions using explicit dynamical models (Jacobian-based LTV + LQR feedback).
- Open questions / failure modes:
- Fine-grained detection remains weak (e.g., Macro-F1 around ~0.46–0.56 on 16-way labels).
- Steering sensitivity to hyperparameters and representational limits (e.g., “benign nonrefusal” unless token-wise interventions).
- Generalization beyond evaluated model families and beyond text-only settings (ALTTRAIN multimodal untested).
Theme: Structural jailbreaks and realistic misuse surfaces
- Why it matters: Real deployments (co-authoring, pattern completion) create attack surfaces that bypass conventional moderation and refusal training.
- Representative papers:
- HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
- Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4
- STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
- Common approach:
- Attack framing that exploits completion instincts (incomplete harmful drafts) or ICL pattern constraints (operator + validator).
- Automated red-teaming systems that learn which strategy families map to which unsafe response clusters.
- Preference-optimization mitigation targeting safety–utility balance (SUBA via KTO/GRPO).
- Open questions / failure modes:
- Moderation misses co-authoring intent shifts (HarDBench reports large unsafe-rate drop for CoJP vs direct HQ).
- IICL robustness appears bimodal across models; mechanistic explanation remains speculative without white-box analysis.
- Red-teaming pipelines depend on scorer reliability and can drift over time (STAR-Teaming limitations).
Theme: Security & privacy controls beyond “trust the model”
- Why it matters: As agents touch private data and critical systems, defenses must hold even if prompts/models are adversarial; also, model extraction/distillation is both a capability tool and an IP risk.
- Representative papers:
- Common approach:
- Deterministic enforcement layers (information-flow control over agent-generated code; persistent permissions + disclosure logs).
- Mechanism-aware detection signals that are decorrelated from activation clustering (loss-trace correlations under localized posterior sampling).
- Post-hoc teacher calibration to control distillability (η knob to make teachers more/less distillable).
- Open questions / failure modes:
- Compute overhead (SGLD sampling for MAD; RL fine-tuning for distillability calibration).
- Trusted artifacts and UX burden (GAAP tool annotations; many permission prompts).
- Dual-use: undistillable teachers and backdoor detection methods can inform attackers as well as defenders.
Theme: Evaluation realism for retrieval and embodied world models
- Why it matters: Enterprise RAG corpora are redundant/high-similarity, and robotics world models must be executable—not just visually plausible.
- Representative papers:
- Common approach:
- New benchmarks that isolate the missing axis (redundancy-aware evidence sets; executability via action extraction; QA-to-embodied mitigation gap).
- Multi-stage pipelines with explicit filtering/validation (CRRF for stable LLM judging; hierarchical step checkers for executability).
- Diagnostics that separate perception vs planning (metadata-augmented observations in SafetyALFRED).
- Open questions / failure modes:
- Retrieval collapses sharply with hop depth in high-overlap corpora (PerfRecall@10 at 4-hop drops to single digits in some domains).
- World-model rollouts can look good but fail under execution; fine-tuning helps but doesn’t fix contact/spatial inconsistencies.
- Embodied safety: strong hazard recognition does not translate to mitigation; multi-agent decoupling helps only partially.
3) Technical synthesis
- Deterministic, end-state grading (AutomationBench) and partial-credit checkpointing (DeepRed) are converging on a shared goal: measure agent progress without subjective LLM judging, or constrain LLM judging to rubric application.
- Multiple papers highlight capability fragmentation: different frontier models solve disjoint subsets of automation tasks (low Jaccard overlap), suggesting ensembles or routing could outperform single models even before training improvements.
- Safety evaluation is moving “earlier in the pipeline”: HARMTHOUGHTS shows detectors that work for binary harmfulness degrade sharply for fine-grained behaviors, motivating sequence/context-aware detectors rather than sentence-independent classifiers.
- Two complementary mechanism tools emerge: activation-space control (Activation-LQR’s Jacobian/LQR closed-loop steering) and parameter-space attribution (BIF/SGLD loss-trace correlations) for detecting anomalous mechanisms like backdoors.
- Structural alignment interventions appear effective with low data: ALTTRAIN’s reasoning-structure SFT on ~1K examples reduces harmful responses while preserving capabilities, with ablations indicating HA is critical for safety.
- Jailbreak research is emphasizing prompt-structure vulnerabilities (IICL operator framing; co-authoring drafts) that bypass content filters; this aligns with the need for structure-aware defenses rather than keyword/payload detection.
- Retrieval evaluation is being redesigned for enterprise reality: RARE’s atomic-fact redundancy tracking and redundancy-aware gold sets show that “single canonical passage” labeling can mis-score valid retrieval.
- Test-time adaptation is becoming more principled: TEMPO frames TTT as EM with periodic critic recalibration to prevent reward drift and diversity collapse, showing sustained gains with more test-time iterations.
- Practical deployment work is quantifying system trade-offs: SLM agent paradigms show SAS improves normalized quality but reduces completion rate; MAS adds coordination failures and token overhead.
- Several works emphasize persistent state and policy as core infrastructure: GAAP’s disclosure log/permissions DB and Mesh Memory Protocol’s write-time remix + lineage both treat persistence as a first-class safety/reliability primitive.
4) Top 5 papers (with “why now”)
- Introduces a cross-application automation benchmark requiring API discovery + policy adherence + deterministic state changes across ~47 apps and ~500 endpoints.
- Shows frontier models are <10% on private tasks, with distinct solved subsets across models (low overlap), indicating headroom and potential for routing/ensembles.
- Useful now because it matches how businesses evaluate automation: end-state correctness, not conversational plausibility.
- Skepticism / limitation: simulated APIs and synthetic tasks may diverge from production behavior; ongoing auditing/versioning needed.
2) Mechanistic Anomaly Detection via Functional Attribution
- Reframes anomaly/backdoor detection as functional attribution from trusted samples using Bayesian influence functions (SGLD loss-trace correlations).
- Reports strong results on BackdoorBench and near-perfect AUROC in several LLM backdoor settings, including robustness to activation obfuscation.
- Useful now as a decorrelated signal to activation-space detectors, addressing a known evasion route.
- Skepticism / limitation: computationally expensive (many SGLD draws) and requires a trusted reference set.
3) Reasoning Structure Matters for Safety Alignment of Reasoning Models
- Proposes ALTTRAIN: change reasoning from PU→SR to PU→HA→CR via SFT on ~1K structured examples (no RL).
- Reports substantial harmfulness reduction with minimal capability impact; ablations show HA is key and scaling data reduces over-refusal.
- Useful now as a low-cost alignment knob for reasoning models that tend to “solve even when harmful.”
- Skepticism / limitation: multimodal generalization untested; relies on HA sentences generated by an LLM and sampled from existing red-team data.
4) HarDBench: Draft-Based Co-Authoring Jailbreak Attacks
- Defines and benchmarks a realistic misuse mode: incomplete harmful drafts framed as editing requests that induce detailed harmful completions.
- Shows high ASR under co-authoring framing (e.g., GPT-4o reported ASR 96.75% under CoJP) and that moderation misses intent shifts.
- Provides SUBA (KTO/GRPO) that reduces ASR dramatically while largely preserving long-form writing utility.
- Skepticism / limitation: limited to four domains and fixed templates; multi-turn adaptive attacks not covered.
5) TEMPO: Scaling Test-time Training for Large Reasoning Models
- Addresses TTT reward drift by alternating critic recalibration on labeled data with policy refinement on unlabeled test questions (EM framing).
- Reports large gains on AIME 2024 (e.g., OLMO3-7B avg@16 33.0%→51.1%; Qwen3-14B 42.3%→65.8%) and preserved diversity where baselines collapse.
- Useful now because it turns extra inference-time compute into continued improvement, not plateauing.
- Skepticism / limitation: requires labeled calibration data and actor+critic compute/memory; domain coverage is mostly reasoning/math.
5) Practical next steps
- Adopt end-state evaluation for internal agent work: replicate AutomationBench-style deterministic assertions (no partial credit) for your own tool/API workflows; track false “success” declarations explicitly.
- Instrument process-level safety: log and classify intermediate reasoning steps (HARMTHOUGHTS-style) and measure where harm emerges; don’t rely on final-output labels alone.
- Red-team with structural attacks: add co-authoring draft prompts (HarDBench) and operator/validator ICL prompts (IICL) to your safety suite; measure moderation miss rates separately from model refusal.
- Try low-cost structural alignment: prototype ALTTRAIN-like PU→HA→CR formatting with small SFT sets; evaluate over-refusal and multi-turn escalation robustness.
- Combine mechanism signals: ensemble activation-space steering/detection (e.g., behavior vectors, Activation-LQR) with functional-attribution anomaly detection (BIF correlations) to reduce correlated blind spots.
- For privacy-sensitive agents, enforce determinism outside the model: evaluate GAAP-style IFC/taint tracking with persistent permissions + disclosure logs for any workflow touching secrets; measure user prompt burden (permission prompts) as a first-class metric.
- If you deploy RAG in enterprise corpora: test retrieval under redundancy/high similarity (RARE/RedQA style) and report hop-depth curves; avoid single-canonical-passage labeling when redundancy is high.
- If exploring test-time adaptation: implement TEMPO’s periodic critic recalibration and monitor diversity collapse (pass@K, entropy) as a guardrail.
Generated from per-paper analyses; no external browsing.
