Daily AI Paper Report (2026-04-23)

Published: April 23, 2026

Chinese version: [中文]

Run stats

Candidates: 241
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-21T00:00:00Z → 2026-04-22T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.19657`	An AI Agent Execution Environment to Safeguard User Data PDF	cs.CR, cs.AI, cs.OS	95	Execution environment w/ user-specified permissions to prevent agent data exfiltration/prompt injection.	agent-security, privacy, sandboxing, permissions, prompt-injection, confidential-computing
`2604.19001`	When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains PDF	cs.CL	94	Benchmark for sentence-level harmful behaviors inside reasoning traces; enables monitoring/intervention.	AI safety, reasoning, chain-of-thought, harm detection, benchmark, monitoring, jailbreaks
`2604.19461`	Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4 PDF	cs.CR	93	New jailbreak class (IICL) with strong ablations and high bypass rates; important for safety training limits.	jailbreaks, in-context-attacks, robustness, red-teaming, alignment-failure-modes
`2604.18934`	AutomationBench PDF	cs.AI	92	Benchmark for cross-app API agents incl. endpoint discovery + policy adherence; strong for agent eval/safety.	agents, benchmark, tool-use, api, policy-adherence, evaluation, automation
`2604.18946`	Reasoning Structure Matters for Safety Alignment of Reasoning Models PDF	cs.AI	92	Targets LRM safety by changing reasoning structure; claims strong gains with only 1K SFT examples.	safety-alignment, reasoning-models, post-training, SFT, jailbreak-robustness
`2604.19656`	Pause or Fabricate? Training Language Models for Grounded Reasoning PDF	cs.CL	91	RL framework to make models pause/clarify under missing premises, reducing confident fabrication.	grounded reasoning, hallucinations, RL, interactive RL, uncertainty, reliability, alignment
`2604.19274`	HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing PDF	cs.CL	90	Benchmark for draft-based co-authoring jailbreaks in high-risk domains; realistic collaborative writing threat.	benchmarks, jailbreaks, harmful-content, human-LLM-collaboration, safety-eval
`2604.19018`	Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control PDF	cs.LG, cs.AI, eess.SY, math.OC, stat.ML	90	Formalizes activation steering as feedback control using local linearity; could improve reliable inference-time alignment.	activation-steering, inference-time-alignment, control, interpretability, robustness
`2604.19638`	SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models PDF	cs.AI, cs.CL, cs.RO	88	Embodied safety benchmark measuring hazard mitigation planning (not just recognition); exposes alignment gap.	embodied-agents, planning, hazard-mitigation, multimodal, safety-evaluation, robotics
`2604.18976`	STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming PDF	cs.CL	88	Automated black-box red teaming via multi-agent strategy-response network; interpretable vuln mapping.	red teaming, jailbreaks, adversarial prompts, multi-agent, evaluation, security
`2604.19540`	Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems PDF	cs.MA, cs.AI	88	Protocol for cross-session multi-agent memory with field-level acceptance + traceability; key for reliable agents.	multi-agent, memory, provenance, traceability, coordination, agent-infrastructure
`2604.19295`	TEMPO: Scaling Test-time Training for Large Reasoning Models PDF	cs.LG	88	Scales test-time training for LRMs via EM-style critic recalibration; addresses reward drift and diversity collapse.	test-time-training, reasoning, RL, critic-calibration, inference-scaling
`2604.19533`	Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps PDF	cs.CR, cs.AI	86	Agentic SecOps threat-hunting benchmark with large event logs + RL env; strong real-world evaluation setup.	cybersecurity, agents, benchmark, tool-use, SQL, evaluation
`2604.19049`	Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery PDF	cs.CR, cs.AI, cs.SE	86	Stage-gated adversarial multi-agent review to cut false positives in LLM defect discovery campaigns.	software security, agents, verification, LLM reliability, triage, vulnerability discovery
`2604.19354`	Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges PDF	cs.AI, cs.CR, cs.SE	85	CTF agent benchmark in isolated VMs with partial-credit scoring + full traces; useful for capability auditing.	cybersecurity, agent-evaluation, CTF, partial-credit, traces, offensive-security
`2604.19083`	ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety PDF	cs.CR, cs.AI	84	Interpretability for MLLM backdoors focusing on projector role; clarifies mechanisms and mitigation targets.	multimodal, backdoors, data-poisoning, interpretability, model-security
`2604.19561`	Detecting Data Contamination in Large Language Models PDF	cs.AI	84	Unified evaluation of black-box membership inference for LLM data contamination; proposes new method.	privacy, membership inference, data contamination, copyright, LLM auditing, security
`2604.19299`	Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms PDF	cs.CL, cs.AI	84	Large-scale study of <10B models under tool/multi-agent paradigms; practical deployment trade-offs.	small-models, agents, tool-use, multi-agent, efficiency, deployment, evaluation
`2604.19572`	A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression PDF	cs.CL	84	Self-evolving observation compression for terminal agents; reduces long-horizon token blowup and cost.	agents, long-horizon, context-compression, terminal-agents, efficiency
`2604.18982`	SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution PDF	cs.AI	83	Shapley-based reward attribution for social dialogue RL; principled credit assignment for language agents.	rl, credit-assignment, shapley, dialogue-agents, social, alignment
`2604.19405`	Lost in Translation: Do LVLM Judges Generalize Across Languages? PDF	cs.CL	82	Large multilingual multimodal judge benchmark; critical for reward-model generalization and eval reliability.	evaluation, judge-models, reward-models, multilingual, vision-language, benchmarks
`2604.19047`	RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora PDF	cs.CL, cs.AI, cs.IR	82	RAG retrieval eval that accounts for redundancy/high-similarity corpora; more realistic benchmarks.	RAG, retrieval evaluation, benchmarks, redundancy, IR, grounding
`2604.19089`	Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression PDF	cs.AI	82	Lifelong knowledge editing with selective suppression to reduce forgetting; aims scalable sequential updates.	knowledge-editing, continual-learning, hallucinations, model-updates, reliability
`2604.18963`	Distillation Traps and Guards: A Calibration Knob for LLM Distillability PDF	cs.LG, cs.AI	81	Analyzes distillation failure modes and proposes calibration to control distillability; relevant to leakage risk.	distillation, calibration, model-leakage, post-training, reliability
`2604.19565`	Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps PDF	cs.CL, cs.AI, cs.LG	80	Inference-time hallucination detection for SpeechLLMs using attention-map features; no gold needed.	hallucination detection, speech LLMs, inference-time, attention, reliability, monitoring
`2604.19254`	ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning PDF	cs.CL, cs.AI	80	New PEFT method (shadow module) shifting adaptation to shared layer-level refinement; could improve tuning efficiency.	peft, fine-tuning, lora-alternative, efficiency, llm-training
`2604.19395`	Does Self-Consistency Improve the Recall of Encyclopedic Knowledge? PDF	cs.CL	80	Creates MMLU knowledge-recall split; finds self-consistency boosts encyclopedic recall, not just reasoning.	evaluation, self-consistency, knowledge-recall, MMLU, prompting
`2604.18970`	Mechanistic Anomaly Detection via Functional Attribution PDF	cs.LG, cs.CR	79	Mechanistic anomaly detection via functional attribution/influence; aims to detect anomalous internal behavior.	anomaly-detection, mechanistic-interpretability, influence-functions, model-monitoring, security
`2604.19728`	VLA Foundry: A Unified Framework for Training Vision-Language-Action Models PDF	cs.RO, cs.AI, cs.CV, cs.LG, cs.SE	79	Unified open framework for LLM→VLM→VLA training; releases models + closed-loop eval in simulator.	vla, robotics, vision-language-action, framework, training-pipeline, open-source
`2604.19092`	RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation PDF	cs.RO, cs.AI	79	Benchmark for world models grounded in executable robot actions; tests physical plausibility beyond realism.	benchmarks, world-models, robotics, embodied-agents, evaluation

AI Paper Insight Brief

2026-04-23

0) Executive takeaways (read this first)

Agentic “real work” benchmarks are exposing a large capability gap: cross-app business automation (<10% pass on AutomationBench), SOC threat hunting (best model ~3.82% submitted-flag recall), and VM-based CTF exploitation (best ~35% checkpoint completion) all show frontier models are far from reliable autonomy in high-stakes environments.
Safety is shifting from output-only to process/mechanism-level control: sentence-level harm detection in reasoning traces (HARMTHOUGHTS) shows big performance collapse at fine granularity, while activation steering via closed-loop control (Activation-LQR) and functional-attribution anomaly detection (BIF/SGLD correlations) offer mechanism-aware levers.
Jailbreaks are increasingly “structural,” not obfuscation-based: draft-based co-authoring prompts (HarDBench) and Involuntary In-Context Learning (IICL) bypass safety by exploiting completion/pattern mechanisms; defenses that only scan for encoded payloads or keywords will miss these.
Practical alignment interventions are getting cheaper and more “surgical”: ALTTRAIN changes reasoning structure with ~1K SFT examples; LightEdit performs lifelong knowledge edits without parameter updates via selective retrieval + first-token suppression; both emphasize targeted control over broad RL.
Evaluation realism is improving in RAG and robotics: redundancy-aware retrieval evaluation (RARE/RedQA) shows multi-hop retrieval collapses in high-similarity enterprise corpora; RoboWM-Bench operationalizes “executability” of world-model rollouts by converting predicted videos into actions and executing in real-to-sim.

2) Key themes (clusters)

Theme: Agent benchmarks that measure state change, not chat quality

Why it matters: Production automation/security work is judged by deterministic end-state changes (records updated, flags submitted, shells obtained), and current models fail badly under these criteria.
Representative papers:
Common approach:
- Simulated-but-realistic environments (multi-app REST APIs; Windows event logs in SQLite; isolated attacker/target VMs).
- Hard budgets (steps/tool calls; 50 SQL queries; time/step limits) and programmatic or rubric-based scoring.
- Emphasis on partial credit (checkpoints) or end-state assertions (no LLM-as-judge for AutomationBench).
Open questions / failure modes:
- Agents often “declare success” without achieving state changes (AutomationBench) or observe evidence but fail to submit/attribute it (Cyber Defense Benchmark).
- Rubric dependence and summarization sensitivity for partial-credit judging (DeepRed).
- How to train agents to improve without overfitting to benchmark-specific tool surfaces and hardening tricks.

Theme: Process-level safety: detect and intervene inside reasoning

Why it matters: Harm can emerge in intermediate reasoning steps even when final outputs look safer; output-only safety misses escalation dynamics and blocks targeted mitigations.
Representative papers:
Common approach:
- Fine-grained taxonomies/labels for reasoning behavior (16 sentence-level harm-propagation labels).
- Structural changes to reasoning pipelines (PU→HA→CR) via small SFT datasets.
- Inference-time activation interventions using explicit dynamical models (Jacobian-based LTV + LQR feedback).
Open questions / failure modes:
- Fine-grained detection remains weak (e.g., Macro-F1 around ~0.46–0.56 on 16-way labels).
- Steering sensitivity to hyperparameters and representational limits (e.g., “benign nonrefusal” unless token-wise interventions).
- Generalization beyond evaluated model families and beyond text-only settings (ALTTRAIN multimodal untested).

Theme: Structural jailbreaks and realistic misuse surfaces

Why it matters: Real deployments (co-authoring, pattern completion) create attack surfaces that bypass conventional moderation and refusal training.
Representative papers:
Common approach:
- Attack framing that exploits completion instincts (incomplete harmful drafts) or ICL pattern constraints (operator + validator).
- Automated red-teaming systems that learn which strategy families map to which unsafe response clusters.
- Preference-optimization mitigation targeting safety–utility balance (SUBA via KTO/GRPO).
Open questions / failure modes:
- Moderation misses co-authoring intent shifts (HarDBench reports large unsafe-rate drop for CoJP vs direct HQ).
- IICL robustness appears bimodal across models; mechanistic explanation remains speculative without white-box analysis.
- Red-teaming pipelines depend on scorer reliability and can drift over time (STAR-Teaming limitations).

Theme: Security & privacy controls beyond “trust the model”

Why it matters: As agents touch private data and critical systems, defenses must hold even if prompts/models are adversarial; also, model extraction/distillation is both a capability tool and an IP risk.
Representative papers:
Common approach:
- Deterministic enforcement layers (information-flow control over agent-generated code; persistent permissions + disclosure logs).
- Mechanism-aware detection signals that are decorrelated from activation clustering (loss-trace correlations under localized posterior sampling).
- Post-hoc teacher calibration to control distillability (η knob to make teachers more/less distillable).
Open questions / failure modes:
- Compute overhead (SGLD sampling for MAD; RL fine-tuning for distillability calibration).
- Trusted artifacts and UX burden (GAAP tool annotations; many permission prompts).
- Dual-use: undistillable teachers and backdoor detection methods can inform attackers as well as defenders.

Theme: Evaluation realism for retrieval and embodied world models

Why it matters: Enterprise RAG corpora are redundant/high-similarity, and robotics world models must be executable—not just visually plausible.
Representative papers:
Common approach:
- New benchmarks that isolate the missing axis (redundancy-aware evidence sets; executability via action extraction; QA-to-embodied mitigation gap).
- Multi-stage pipelines with explicit filtering/validation (CRRF for stable LLM judging; hierarchical step checkers for executability).
- Diagnostics that separate perception vs planning (metadata-augmented observations in SafetyALFRED).
Open questions / failure modes:
- Retrieval collapses sharply with hop depth in high-overlap corpora (PerfRecall@10 at 4-hop drops to single digits in some domains).
- World-model rollouts can look good but fail under execution; fine-tuning helps but doesn’t fix contact/spatial inconsistencies.
- Embodied safety: strong hazard recognition does not translate to mitigation; multi-agent decoupling helps only partially.

3) Technical synthesis

Deterministic, end-state grading (AutomationBench) and partial-credit checkpointing (DeepRed) are converging on a shared goal: measure agent progress without subjective LLM judging, or constrain LLM judging to rubric application.
Multiple papers highlight capability fragmentation: different frontier models solve disjoint subsets of automation tasks (low Jaccard overlap), suggesting ensembles or routing could outperform single models even before training improvements.
Safety evaluation is moving “earlier in the pipeline”: HARMTHOUGHTS shows detectors that work for binary harmfulness degrade sharply for fine-grained behaviors, motivating sequence/context-aware detectors rather than sentence-independent classifiers.
Two complementary mechanism tools emerge: activation-space control (Activation-LQR’s Jacobian/LQR closed-loop steering) and parameter-space attribution (BIF/SGLD loss-trace correlations) for detecting anomalous mechanisms like backdoors.
Structural alignment interventions appear effective with low data: ALTTRAIN’s reasoning-structure SFT on ~1K examples reduces harmful responses while preserving capabilities, with ablations indicating HA is critical for safety.
Jailbreak research is emphasizing prompt-structure vulnerabilities (IICL operator framing; co-authoring drafts) that bypass content filters; this aligns with the need for structure-aware defenses rather than keyword/payload detection.
Retrieval evaluation is being redesigned for enterprise reality: RARE’s atomic-fact redundancy tracking and redundancy-aware gold sets show that “single canonical passage” labeling can mis-score valid retrieval.
Test-time adaptation is becoming more principled: TEMPO frames TTT as EM with periodic critic recalibration to prevent reward drift and diversity collapse, showing sustained gains with more test-time iterations.
Practical deployment work is quantifying system trade-offs: SLM agent paradigms show SAS improves normalized quality but reduces completion rate; MAS adds coordination failures and token overhead.
Several works emphasize persistent state and policy as core infrastructure: GAAP’s disclosure log/permissions DB and Mesh Memory Protocol’s write-time remix + lineage both treat persistence as a first-class safety/reliability primitive.

4) Top 5 papers (with “why now”)

1) AutomationBench

Introduces a cross-application automation benchmark requiring API discovery + policy adherence + deterministic state changes across ~47 apps and ~500 endpoints.
Shows frontier models are <10% on private tasks, with distinct solved subsets across models (low overlap), indicating headroom and potential for routing/ensembles.
Useful now because it matches how businesses evaluate automation: end-state correctness, not conversational plausibility.
Skepticism / limitation: simulated APIs and synthetic tasks may diverge from production behavior; ongoing auditing/versioning needed.

2) Mechanistic Anomaly Detection via Functional Attribution

Reframes anomaly/backdoor detection as functional attribution from trusted samples using Bayesian influence functions (SGLD loss-trace correlations).
Reports strong results on BackdoorBench and near-perfect AUROC in several LLM backdoor settings, including robustness to activation obfuscation.
Useful now as a decorrelated signal to activation-space detectors, addressing a known evasion route.
Skepticism / limitation: computationally expensive (many SGLD draws) and requires a trusted reference set.

3) Reasoning Structure Matters for Safety Alignment of Reasoning Models

Proposes ALTTRAIN: change reasoning from PU→SR to PU→HA→CR via SFT on ~1K structured examples (no RL).
Reports substantial harmfulness reduction with minimal capability impact; ablations show HA is key and scaling data reduces over-refusal.
Useful now as a low-cost alignment knob for reasoning models that tend to “solve even when harmful.”
Skepticism / limitation: multimodal generalization untested; relies on HA sentences generated by an LLM and sampled from existing red-team data.

4) HarDBench: Draft-Based Co-Authoring Jailbreak Attacks

Defines and benchmarks a realistic misuse mode: incomplete harmful drafts framed as editing requests that induce detailed harmful completions.
Shows high ASR under co-authoring framing (e.g., GPT-4o reported ASR 96.75% under CoJP) and that moderation misses intent shifts.
Provides SUBA (KTO/GRPO) that reduces ASR dramatically while largely preserving long-form writing utility.
Skepticism / limitation: limited to four domains and fixed templates; multi-turn adaptive attacks not covered.

5) TEMPO: Scaling Test-time Training for Large Reasoning Models

Addresses TTT reward drift by alternating critic recalibration on labeled data with policy refinement on unlabeled test questions (EM framing).
Reports large gains on AIME 2024 (e.g., OLMO3-7B avg@16 33.0%→51.1%; Qwen3-14B 42.3%→65.8%) and preserved diversity where baselines collapse.
Useful now because it turns extra inference-time compute into continued improvement, not plateauing.
Skepticism / limitation: requires labeled calibration data and actor+critic compute/memory; domain coverage is mostly reasoning/math.

5) Practical next steps

Adopt end-state evaluation for internal agent work: replicate AutomationBench-style deterministic assertions (no partial credit) for your own tool/API workflows; track false “success” declarations explicitly.
Instrument process-level safety: log and classify intermediate reasoning steps (HARMTHOUGHTS-style) and measure where harm emerges; don’t rely on final-output labels alone.
Red-team with structural attacks: add co-authoring draft prompts (HarDBench) and operator/validator ICL prompts (IICL) to your safety suite; measure moderation miss rates separately from model refusal.
Try low-cost structural alignment: prototype ALTTRAIN-like PU→HA→CR formatting with small SFT sets; evaluate over-refusal and multi-turn escalation robustness.
Combine mechanism signals: ensemble activation-space steering/detection (e.g., behavior vectors, Activation-LQR) with functional-attribution anomaly detection (BIF correlations) to reduce correlated blind spots.
For privacy-sensitive agents, enforce determinism outside the model: evaluate GAAP-style IFC/taint tracking with persistent permissions + disclosure logs for any workflow touching secrets; measure user prompt burden (permission prompts) as a first-class metric.
If you deploy RAG in enterprise corpora: test retrieval under redundancy/high similarity (RARE/RedQA style) and report hop-depth curves; avoid single-canonical-passage labeling when redundancy is high.
If exploring test-time adaptation: implement TEMPO’s periodic critic recalibration and monitor diversity collapse (pass@K, entropy) as a guardrail.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-23

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Agent benchmarks that measure state change, not chat quality

Theme: Process-level safety: detect and intervene inside reasoning

Theme: Structural jailbreaks and realistic misuse surfaces

Theme: Security & privacy controls beyond “trust the model”

Theme: Evaluation realism for retrieval and embodied world models

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps