Daily AI Paper Report (2026-04-23)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 241
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-21T00:00:00Z → 2026-04-22T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.19657An AI Agent Execution Environment to Safeguard User Data
PDF
cs.CR, cs.AI, cs.OS95Execution environment w/ user-specified permissions to prevent agent data exfiltration/prompt injection.agent-security, privacy, sandboxing, permissions, prompt-injection, confidential-computing
2604.19001When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
PDF
cs.CL94Benchmark for sentence-level harmful behaviors inside reasoning traces; enables monitoring/intervention.AI safety, reasoning, chain-of-thought, harm detection, benchmark, monitoring, jailbreaks
2604.19461Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4
PDF
cs.CR93New jailbreak class (IICL) with strong ablations and high bypass rates; important for safety training limits.jailbreaks, in-context-attacks, robustness, red-teaming, alignment-failure-modes
2604.18934AutomationBench
PDF
cs.AI92Benchmark for cross-app API agents incl. endpoint discovery + policy adherence; strong for agent eval/safety.agents, benchmark, tool-use, api, policy-adherence, evaluation, automation
2604.18946Reasoning Structure Matters for Safety Alignment of Reasoning Models
PDF
cs.AI92Targets LRM safety by changing reasoning structure; claims strong gains with only 1K SFT examples.safety-alignment, reasoning-models, post-training, SFT, jailbreak-robustness
2604.19656Pause or Fabricate? Training Language Models for Grounded Reasoning
PDF
cs.CL91RL framework to make models pause/clarify under missing premises, reducing confident fabrication.grounded reasoning, hallucinations, RL, interactive RL, uncertainty, reliability, alignment
2604.19274HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
PDF
cs.CL90Benchmark for draft-based co-authoring jailbreaks in high-risk domains; realistic collaborative writing threat.benchmarks, jailbreaks, harmful-content, human-LLM-collaboration, safety-eval
2604.19018Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
PDF
cs.LG, cs.AI, eess.SY, math.OC, stat.ML90Formalizes activation steering as feedback control using local linearity; could improve reliable inference-time alignment.activation-steering, inference-time-alignment, control, interpretability, robustness
2604.19638SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
PDF
cs.AI, cs.CL, cs.RO88Embodied safety benchmark measuring hazard mitigation planning (not just recognition); exposes alignment gap.embodied-agents, planning, hazard-mitigation, multimodal, safety-evaluation, robotics
2604.18976STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
PDF
cs.CL88Automated black-box red teaming via multi-agent strategy-response network; interpretable vuln mapping.red teaming, jailbreaks, adversarial prompts, multi-agent, evaluation, security
2604.19540Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems
PDF
cs.MA, cs.AI88Protocol for cross-session multi-agent memory with field-level acceptance + traceability; key for reliable agents.multi-agent, memory, provenance, traceability, coordination, agent-infrastructure
2604.19295TEMPO: Scaling Test-time Training for Large Reasoning Models
PDF
cs.LG88Scales test-time training for LRMs via EM-style critic recalibration; addresses reward drift and diversity collapse.test-time-training, reasoning, RL, critic-calibration, inference-scaling
2604.19533Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
PDF
cs.CR, cs.AI86Agentic SecOps threat-hunting benchmark with large event logs + RL env; strong real-world evaluation setup.cybersecurity, agents, benchmark, tool-use, SQL, evaluation
2604.19049Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
PDF
cs.CR, cs.AI, cs.SE86Stage-gated adversarial multi-agent review to cut false positives in LLM defect discovery campaigns.software security, agents, verification, LLM reliability, triage, vulnerability discovery
2604.19354Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges
PDF
cs.AI, cs.CR, cs.SE85CTF agent benchmark in isolated VMs with partial-credit scoring + full traces; useful for capability auditing.cybersecurity, agent-evaluation, CTF, partial-credit, traces, offensive-security
2604.19083ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
PDF
cs.CR, cs.AI84Interpretability for MLLM backdoors focusing on projector role; clarifies mechanisms and mitigation targets.multimodal, backdoors, data-poisoning, interpretability, model-security
2604.19561Detecting Data Contamination in Large Language Models
PDF
cs.AI84Unified evaluation of black-box membership inference for LLM data contamination; proposes new method.privacy, membership inference, data contamination, copyright, LLM auditing, security
2604.19299Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
PDF
cs.CL, cs.AI84Large-scale study of <10B models under tool/multi-agent paradigms; practical deployment trade-offs.small-models, agents, tool-use, multi-agent, efficiency, deployment, evaluation
2604.19572A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
PDF
cs.CL84Self-evolving observation compression for terminal agents; reduces long-horizon token blowup and cost.agents, long-horizon, context-compression, terminal-agents, efficiency
2604.18982SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution
PDF
cs.AI83Shapley-based reward attribution for social dialogue RL; principled credit assignment for language agents.rl, credit-assignment, shapley, dialogue-agents, social, alignment
2604.19405Lost in Translation: Do LVLM Judges Generalize Across Languages?
PDF
cs.CL82Large multilingual multimodal judge benchmark; critical for reward-model generalization and eval reliability.evaluation, judge-models, reward-models, multilingual, vision-language, benchmarks
2604.19047RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora
PDF
cs.CL, cs.AI, cs.IR82RAG retrieval eval that accounts for redundancy/high-similarity corpora; more realistic benchmarks.RAG, retrieval evaluation, benchmarks, redundancy, IR, grounding
2604.19089Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
PDF
cs.AI82Lifelong knowledge editing with selective suppression to reduce forgetting; aims scalable sequential updates.knowledge-editing, continual-learning, hallucinations, model-updates, reliability
2604.18963Distillation Traps and Guards: A Calibration Knob for LLM Distillability
PDF
cs.LG, cs.AI81Analyzes distillation failure modes and proposes calibration to control distillability; relevant to leakage risk.distillation, calibration, model-leakage, post-training, reliability
2604.19565Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
PDF
cs.CL, cs.AI, cs.LG80Inference-time hallucination detection for SpeechLLMs using attention-map features; no gold needed.hallucination detection, speech LLMs, inference-time, attention, reliability, monitoring
2604.19254ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
PDF
cs.CL, cs.AI80New PEFT method (shadow module) shifting adaptation to shared layer-level refinement; could improve tuning efficiency.peft, fine-tuning, lora-alternative, efficiency, llm-training
2604.19395Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
PDF
cs.CL80Creates MMLU knowledge-recall split; finds self-consistency boosts encyclopedic recall, not just reasoning.evaluation, self-consistency, knowledge-recall, MMLU, prompting
2604.18970Mechanistic Anomaly Detection via Functional Attribution
PDF
cs.LG, cs.CR79Mechanistic anomaly detection via functional attribution/influence; aims to detect anomalous internal behavior.anomaly-detection, mechanistic-interpretability, influence-functions, model-monitoring, security
2604.19728VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
PDF
cs.RO, cs.AI, cs.CV, cs.LG, cs.SE79Unified open framework for LLM→VLM→VLA training; releases models + closed-loop eval in simulator.vla, robotics, vision-language-action, framework, training-pipeline, open-source
2604.19092RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
PDF
cs.RO, cs.AI79Benchmark for world models grounded in executable robot actions; tests physical plausibility beyond realism.benchmarks, world-models, robotics, embodied-agents, evaluation

AI Paper Insight Brief

2026-04-23

0) Executive takeaways (read this first)

  • Agentic “real work” benchmarks are exposing a large capability gap: cross-app business automation (<10% pass on AutomationBench), SOC threat hunting (best model ~3.82% submitted-flag recall), and VM-based CTF exploitation (best ~35% checkpoint completion) all show frontier models are far from reliable autonomy in high-stakes environments.
  • Safety is shifting from output-only to process/mechanism-level control: sentence-level harm detection in reasoning traces (HARMTHOUGHTS) shows big performance collapse at fine granularity, while activation steering via closed-loop control (Activation-LQR) and functional-attribution anomaly detection (BIF/SGLD correlations) offer mechanism-aware levers.
  • Jailbreaks are increasingly “structural,” not obfuscation-based: draft-based co-authoring prompts (HarDBench) and Involuntary In-Context Learning (IICL) bypass safety by exploiting completion/pattern mechanisms; defenses that only scan for encoded payloads or keywords will miss these.
  • Practical alignment interventions are getting cheaper and more “surgical”: ALTTRAIN changes reasoning structure with ~1K SFT examples; LightEdit performs lifelong knowledge edits without parameter updates via selective retrieval + first-token suppression; both emphasize targeted control over broad RL.
  • Evaluation realism is improving in RAG and robotics: redundancy-aware retrieval evaluation (RARE/RedQA) shows multi-hop retrieval collapses in high-similarity enterprise corpora; RoboWM-Bench operationalizes “executability” of world-model rollouts by converting predicted videos into actions and executing in real-to-sim.

2) Key themes (clusters)

Theme: Agent benchmarks that measure state change, not chat quality

  • Why it matters: Production automation/security work is judged by deterministic end-state changes (records updated, flags submitted, shells obtained), and current models fail badly under these criteria.
  • Representative papers:
  • Common approach:
    • Simulated-but-realistic environments (multi-app REST APIs; Windows event logs in SQLite; isolated attacker/target VMs).
    • Hard budgets (steps/tool calls; 50 SQL queries; time/step limits) and programmatic or rubric-based scoring.
    • Emphasis on partial credit (checkpoints) or end-state assertions (no LLM-as-judge for AutomationBench).
  • Open questions / failure modes:
    • Agents often “declare success” without achieving state changes (AutomationBench) or observe evidence but fail to submit/attribute it (Cyber Defense Benchmark).
    • Rubric dependence and summarization sensitivity for partial-credit judging (DeepRed).
    • How to train agents to improve without overfitting to benchmark-specific tool surfaces and hardening tricks.

Theme: Process-level safety: detect and intervene inside reasoning

Theme: Structural jailbreaks and realistic misuse surfaces

Theme: Security & privacy controls beyond “trust the model”

  • Why it matters: As agents touch private data and critical systems, defenses must hold even if prompts/models are adversarial; also, model extraction/distillation is both a capability tool and an IP risk.
  • Representative papers:
  • Common approach:
    • Deterministic enforcement layers (information-flow control over agent-generated code; persistent permissions + disclosure logs).
    • Mechanism-aware detection signals that are decorrelated from activation clustering (loss-trace correlations under localized posterior sampling).
    • Post-hoc teacher calibration to control distillability (η knob to make teachers more/less distillable).
  • Open questions / failure modes:
    • Compute overhead (SGLD sampling for MAD; RL fine-tuning for distillability calibration).
    • Trusted artifacts and UX burden (GAAP tool annotations; many permission prompts).
    • Dual-use: undistillable teachers and backdoor detection methods can inform attackers as well as defenders.

Theme: Evaluation realism for retrieval and embodied world models

3) Technical synthesis

  • Deterministic, end-state grading (AutomationBench) and partial-credit checkpointing (DeepRed) are converging on a shared goal: measure agent progress without subjective LLM judging, or constrain LLM judging to rubric application.
  • Multiple papers highlight capability fragmentation: different frontier models solve disjoint subsets of automation tasks (low Jaccard overlap), suggesting ensembles or routing could outperform single models even before training improvements.
  • Safety evaluation is moving “earlier in the pipeline”: HARMTHOUGHTS shows detectors that work for binary harmfulness degrade sharply for fine-grained behaviors, motivating sequence/context-aware detectors rather than sentence-independent classifiers.
  • Two complementary mechanism tools emerge: activation-space control (Activation-LQR’s Jacobian/LQR closed-loop steering) and parameter-space attribution (BIF/SGLD loss-trace correlations) for detecting anomalous mechanisms like backdoors.
  • Structural alignment interventions appear effective with low data: ALTTRAIN’s reasoning-structure SFT on ~1K examples reduces harmful responses while preserving capabilities, with ablations indicating HA is critical for safety.
  • Jailbreak research is emphasizing prompt-structure vulnerabilities (IICL operator framing; co-authoring drafts) that bypass content filters; this aligns with the need for structure-aware defenses rather than keyword/payload detection.
  • Retrieval evaluation is being redesigned for enterprise reality: RARE’s atomic-fact redundancy tracking and redundancy-aware gold sets show that “single canonical passage” labeling can mis-score valid retrieval.
  • Test-time adaptation is becoming more principled: TEMPO frames TTT as EM with periodic critic recalibration to prevent reward drift and diversity collapse, showing sustained gains with more test-time iterations.
  • Practical deployment work is quantifying system trade-offs: SLM agent paradigms show SAS improves normalized quality but reduces completion rate; MAS adds coordination failures and token overhead.
  • Several works emphasize persistent state and policy as core infrastructure: GAAP’s disclosure log/permissions DB and Mesh Memory Protocol’s write-time remix + lineage both treat persistence as a first-class safety/reliability primitive.

4) Top 5 papers (with “why now”)

1) AutomationBench

  • Introduces a cross-application automation benchmark requiring API discovery + policy adherence + deterministic state changes across ~47 apps and ~500 endpoints.
  • Shows frontier models are <10% on private tasks, with distinct solved subsets across models (low overlap), indicating headroom and potential for routing/ensembles.
  • Useful now because it matches how businesses evaluate automation: end-state correctness, not conversational plausibility.
  • Skepticism / limitation: simulated APIs and synthetic tasks may diverge from production behavior; ongoing auditing/versioning needed.

2) Mechanistic Anomaly Detection via Functional Attribution

  • Reframes anomaly/backdoor detection as functional attribution from trusted samples using Bayesian influence functions (SGLD loss-trace correlations).
  • Reports strong results on BackdoorBench and near-perfect AUROC in several LLM backdoor settings, including robustness to activation obfuscation.
  • Useful now as a decorrelated signal to activation-space detectors, addressing a known evasion route.
  • Skepticism / limitation: computationally expensive (many SGLD draws) and requires a trusted reference set.

3) Reasoning Structure Matters for Safety Alignment of Reasoning Models

  • Proposes ALTTRAIN: change reasoning from PU→SR to PU→HA→CR via SFT on ~1K structured examples (no RL).
  • Reports substantial harmfulness reduction with minimal capability impact; ablations show HA is key and scaling data reduces over-refusal.
  • Useful now as a low-cost alignment knob for reasoning models that tend to “solve even when harmful.”
  • Skepticism / limitation: multimodal generalization untested; relies on HA sentences generated by an LLM and sampled from existing red-team data.

4) HarDBench: Draft-Based Co-Authoring Jailbreak Attacks

  • Defines and benchmarks a realistic misuse mode: incomplete harmful drafts framed as editing requests that induce detailed harmful completions.
  • Shows high ASR under co-authoring framing (e.g., GPT-4o reported ASR 96.75% under CoJP) and that moderation misses intent shifts.
  • Provides SUBA (KTO/GRPO) that reduces ASR dramatically while largely preserving long-form writing utility.
  • Skepticism / limitation: limited to four domains and fixed templates; multi-turn adaptive attacks not covered.

5) TEMPO: Scaling Test-time Training for Large Reasoning Models

  • Addresses TTT reward drift by alternating critic recalibration on labeled data with policy refinement on unlabeled test questions (EM framing).
  • Reports large gains on AIME 2024 (e.g., OLMO3-7B avg@16 33.0%→51.1%; Qwen3-14B 42.3%→65.8%) and preserved diversity where baselines collapse.
  • Useful now because it turns extra inference-time compute into continued improvement, not plateauing.
  • Skepticism / limitation: requires labeled calibration data and actor+critic compute/memory; domain coverage is mostly reasoning/math.

5) Practical next steps

  • Adopt end-state evaluation for internal agent work: replicate AutomationBench-style deterministic assertions (no partial credit) for your own tool/API workflows; track false “success” declarations explicitly.
  • Instrument process-level safety: log and classify intermediate reasoning steps (HARMTHOUGHTS-style) and measure where harm emerges; don’t rely on final-output labels alone.
  • Red-team with structural attacks: add co-authoring draft prompts (HarDBench) and operator/validator ICL prompts (IICL) to your safety suite; measure moderation miss rates separately from model refusal.
  • Try low-cost structural alignment: prototype ALTTRAIN-like PU→HA→CR formatting with small SFT sets; evaluate over-refusal and multi-turn escalation robustness.
  • Combine mechanism signals: ensemble activation-space steering/detection (e.g., behavior vectors, Activation-LQR) with functional-attribution anomaly detection (BIF correlations) to reduce correlated blind spots.
  • For privacy-sensitive agents, enforce determinism outside the model: evaluate GAAP-style IFC/taint tracking with persistent permissions + disclosure logs for any workflow touching secrets; measure user prompt burden (permission prompts) as a first-class metric.
  • If you deploy RAG in enterprise corpora: test retrieval under redundancy/high similarity (RARE/RedQA style) and report hop-depth curves; avoid single-canonical-passage labeling when redundancy is high.
  • If exploring test-time adaptation: implement TEMPO’s periodic critic recalibration and monitor diversity collapse (pass@K, entropy) as a guardrail.

Generated from per-paper analyses; no external browsing.