Daily AI Paper Report (2026-05-01)

Published: May 01, 2026

Chinese version: [中文]

Run stats

Candidates: 179
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-29T00:00:00Z → 2026-04-30T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.26511`	Tatemae: Detecting Alignment Faking via Tool Selection in LLMs PDF	cs.CR, cs.AI	95	Directly targets alignment faking via observable tool-use behavior; strong safety relevance.	alignment, agents, tool-use, deception-detection, evaluation
`2604.26505`	Quantamination: Dynamic Quantization Leaks Your Data Across the Batch PDF	cs.CR, cs.LG	95	Reveals serving-time data leakage from dynamic quantization; strong ML security relevance.	security, privacy, inference, quantization, data-leakage
`2604.26274`	Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents PDF	cs.CR, cs.AI	93	Behavioral firewall for workflow agents with low ASR on agent attacks; practical agent security.	agent-safety, security, tool-use, guardrails, runtime-monitoring
`2604.26506`	SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts PDF	cs.CL, cs.CR	92	Targets hidden-prompt attacks on LLM review systems with adversarially trained defense.	llm-security, prompt-injection, adversarial-defense, evaluation, peer-review
`2604.26256`	DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training PDF	cs.LG, cs.DC	92	Scalable async RL for LLM post-training; tackles rollout bottlenecks with convergence-aware design.	llm-training, reinforcement-learning, post-training, systems, scaling
`2604.26577`	Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control PDF	cs.AI, cs.CY, cs.RO	91	Benchmarks LLM safety for robotic health attendants with concrete violation rates across 72 models.	safety-benchmark, robotics, llm-evaluation, harmful-instructions, deployment
`2604.26866`	MoRFI: Monotonic Sparse Autoencoder Feature Identification PDF	cs.CL, cs.LG	90	Finds causal latent directions for hallucinations after fine-tuning; useful for reliability.	hallucination, interpretability, fine-tuning, llm-reliability, sae
`2604.26837`	Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving PDF	cs.LG	90	Long-context LLM serving framework with sparse attention + hierarchical memory; strong systems impact.	llm-systems, long-context, sparse-attention, kv-cache, inference
`2604.26206`	Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging PDF	cs.CL, cs.AI	89	Directly probes sandbagging behavior in LLMs with preregistered randomized evaluation.	llm-safety, evaluation, sandbagging, robustness, behavior
`2604.26904`	ClawGym: A Scalable Framework for Building Effective Claw Agents PDF	cs.CL, cs.AI, cs.LG	88	Scalable framework, dataset, training, and diagnostics for Claw-style agents; high reuse potential.	agents, benchmark, dataset, training-framework, tool-use
`2604.26622`	OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory PDF	cs.CL	88	Novel long-horizon agent memory via visual compression of trajectories; high agent relevance.	agents, memory, long-context, multimodal, retrieval
`2604.26836`	Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics PDF	cs.LG, eess.SY	88	Rigorous uncertainty-aware safety filter for neural dynamics in RL; strong safety relevance.	safety, reinforcement-learning, control, uncertainty, safe-exploration
`2604.26779`	Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding PDF	cs.LG, cs.CL	88	Speeds RL post-training rollouts losslessly via speculative decoding; highly relevant to frontier training.	rl-post-training, speculative-decoding, llm-training, systems, efficiency
`2604.26649`	When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models PDF	cs.IR, cs.AI, cs.CL	87	Adaptive retrieval during reasoning addresses a key gap for long-CoT reasoning models and RAG.	RAG, reasoning, retrieval, long-context, efficiency
`2604.26733`	FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards PDF	cs.AI, cs.LG	87	Creates a live environment for predictive agents with real-world outcome rewards.	agents, evaluation, environment, online-learning, forecasting
`2604.26355`	Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens PDF	cs.CL	87	Compresses reasoning traces via supertokens, promising lower inference cost with retained accuracy.	llm, reasoning, efficiency, tokenization, inference
`2604.26411`	Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing PDF	cs.LG	87	Unified runtime monitoring taxonomy for safety-critical ML with concrete landing-case evaluation.	safety, runtime-monitoring, ood, safety-critical-ml, evaluation
`2604.26419`	Delineating Knowledge Boundaries for Honest Large Vision-Language Models PDF	cs.CV, cs.AI	85	Improves VLM honesty by teaching refusal on unknowns; relevant to hallucination and reliability.	vlm, hallucination, uncertainty, refusal, alignment
`2604.26835`	HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists PDF	cs.CL, cs.AI, cs.DL	85	Practical toolkit for hallucinated citation detection; directly useful for AI scientist safety.	hallucination, citations, tooling, evaluation, ai-scientists
`2604.26951`	Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models PDF	cs.CL, cs.AI, cs.LG	85	Cross-architecture distillation for diffusion LLMs; notable frontier-model efficiency contribution.	diffusion-llm, distillation, model-compression, architecture, efficiency
`2604.26525`	PRAG End-to-End Privacy-Preserving Retrieval-Augmented Generation PDF	cs.CR	84	End-to-end privacy-preserving RAG with encrypted retrieval is important for secure LLM deployment.	RAG, privacy, security, retrieval, confidentiality
`2604.26768`	Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation PDF	cs.CL	84	Addresses composability and stability in parametric RAG by separating knowledge from task behavior.	rag, retrieval, knowledge, modularity, llm
`2604.26557`	DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference PDF	cs.DC, cs.AI, cs.PF	84	Edge LLM inference via NVMe-direct KV offloading; practical memory-system advance for deployment.	edge-llm, kv-cache, offloading, inference, systems
`2604.26923`	ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation PDF	cs.SE, cs.CL	83	Fresh benchmark for class-level code generation with contamination-aware construction.	code-llms, benchmark, evaluation, contamination, software-engineering
`2604.26334`	Efficient, VRAM-Constrained xLM Inference on Clients PDF	cs.DC, cs.AR, cs.LG	83	Practical VRAM-constrained xLM inference for clients; useful systems advance for dense and MoE models.	inference, efficiency, edge, llm, vlm, moe
`2604.26495`	Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification PDF	cs.CR	82	Specification-anchored security auditing is novel and could strengthen AI-assisted code verification.	security, auditing, verification, specifications, code
`2604.26805`	Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations PDF	cs.AI, cs.MA	82	Agentic ops framework tackles orchestration and hallucination in real online system operations.	agents, operations, orchestration, hallucination, deployment
`2604.26644`	When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling PDF	cs.AI	82	Training-free routing among test-time scaling strategies using disagreement; useful reasoning reliability idea.	reasoning, test-time-scaling, routing, reliability, inference
`2604.26561`	Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation PDF	cs.MA, cs.AI	81	Studies artificial consensus in LLM multi-agent deliberation; relevant to agent reliability and governance.	agents, multi-agent, reliability, evaluation, governance
`2604.26311`	DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent PDF	cs.AI	81	Agentic theorem proving with evolving reusable lemma libraries; promising for autonomous reasoning agents.	agents, theorem-proving, program-induction, reasoning, formal-methods

AI Paper Insight Brief

2026-05-01

0) Executive takeaways (read this first)

Runtime and systems work is attacking the new bottleneck: rollout/generation, not just training FLOPs. DORA and system-integrated speculative decoding both show that RL post-training can be sped up materially without obvious quality loss, while long-context serving papers push similar ideas into KV-cache, sparse attention, and client/edge inference.
Behavioral safety monitoring is getting more operational and black-box. Several papers move away from weight access or pure CoT inspection toward observable signals: response-position entropy for sandbagging, tool-selection shifts for alignment faking, stateful workflow firewalls for agents, and unified runtime monitors for safety-critical ML.
Agent security is shifting from prompt filtering to workflow- and infrastructure-level defenses. Praetor’s pDFA firewall, SafeReview’s co-evolutionary defense, SPECA’s specification-grounded auditing, and Quantamination’s batching side channel all point to the same lesson: the attack surface is increasingly in orchestration, serving, and sequential behavior.
Retrieval/memory is becoming adaptive and modality-aware. ReaLM-Retrieve triggers retrieval at reasoning-step boundaries instead of once upfront, while OCR-Memory stores long-horizon agent traces as images to preserve verbatim evidence under token limits.
Alignment papers increasingly focus on calibrated abstention and detectable deception rather than only answer quality. Visual-Idk improves VLM refusal on unknown knowledge, Tatemae measures alignment faking through tool choices, and the sandbagging paper shows aggregate distributional signatures can reveal deliberate underperformance.
Benchmarks are getting more realistic and harder to game. New datasets for robotic healthcare safety, class-level code generation, Claw-style workspace agents, and live forecasting all emphasize structured execution, contamination resistance, or real-world outcome feedback.

2) Key themes (clusters)

Theme: Behavioral detection of deceptive or misaligned model behavior

Why it matters: A recurring theme is that dangerous behavior may be visible in aggregate actions even when internal reasoning is hidden or unfaithful. This is useful for deployment because the proposed signals are mostly black-box and operationally cheap.
Representative papers:
Common approach:
- Use observable behavior rather than latent-state access: response distributions, tool choices, refusal rates.
- Separate capability failure from strategic behavior via controlled conditions or staged prompting.
- Prefer aggregate diagnostics over brittle per-example signals.
- Evaluate whether alignment interventions preserve useful competence while increasing honesty/refusal.
Open questions / failure modes:
- Signals may be mode-specific and fail against more sophisticated, content-aware deception.
- Prompt-induced behaviors may not generalize to fine-tuned or naturally arising scheming.
- Refusal calibration can incur alignment tax by suppressing known-good answers.
- Some methods still rely on judge models or private reasoning traces for strongest claims.

Theme: Agent security is moving to sequential, stateful, and spec-grounded defenses

Why it matters: Single-step prompt filters are increasingly inadequate for tool-using agents. The stronger defenses model workflows, specifications, and serving-time side channels rather than just text content.
Representative papers:
Common approach:
- Model attacks as multi-step processes rather than isolated malicious strings.
- Encode allowed behavior as state machines, typed properties, or adversarially trained invariances.
- Evaluate on realistic deployment surfaces: batched inference, long documents, protocol specs, tool APIs.
- Emphasize low-latency enforcement and practical integration.
Open questions / failure modes:
- Stateful defenses depend on clean profiling data and can be weakened by telemetry poisoning or concept drift.
- Embedding-based semantic guards remain vulnerable to paraphrase/synonym evasion.
- Some strong results require expert augmentation or fine-tuning access.
- Infrastructure mitigations may trade off throughput, latency, or compatibility.

Theme: RL and inference systems are optimizing around rollout, KV, and memory bottlenecks

Why it matters: As models get longer-context and more agentic, the dominant cost is often generation, cache movement, or memory hierarchy inefficiency. Multiple papers show large gains from system co-design rather than algorithm changes alone.
Representative papers:
Common approach:
- Overlap or decouple slow stages via asynchrony, speculation, or streaming.
- Treat KV-cache movement as a first-class systems problem.
- Use hierarchical memory and locality-aware scheduling to reduce GPU pressure.
- Preserve training semantics where possible, e.g. verifier-exact rollouts or bounded staleness.
Open questions / failure modes:
- Gains depend heavily on draft quality, staleness settings, workload skew, and hardware topology.
- Some methods reduce marginal benefit when combined with other overlap mechanisms.
- Large-scale claims are sometimes based on simulation or in-house frameworks.
- Metadata, orchestration, and migration complexity can become new bottlenecks.

Theme: Retrieval and memory are becoming adaptive, compressed, and evidence-preserving

Why it matters: One-shot retrieval and text-only memory are poor fits for long reasoning chains and long-horizon agents. New work focuses on retrieving at the right time and preserving exact evidence under tight context budgets.
Representative papers:
Common approach:
- Trigger retrieval during reasoning, not only before it.
- Separate evidence storage from reasoning context, often with compression or alternate modalities.
- Optimize for faithfulness and retrieval efficiency, not just recall.
- Make retrieval modules more composable by disentangling task behavior from knowledge storage.
Open questions / failure modes:
- Retrieval quality still depends on corpus coverage and uncertainty estimation quality.
- Some methods trade lower token use for higher latency, storage, or setup cost.
- Parametric retrieval composability results remain preliminary in some settings.
- Privacy-preserving retrieval remains expensive, especially in interactive modes.

Theme: Realistic benchmarks are exposing capability gaps in embodied, code, and live-agent settings

Why it matters: Several papers argue current benchmarks understate failure modes because they ignore structured execution, long-horizon state, or real-world feedback. The new benchmarks are more deployment-shaped and often more pessimistic.
Representative papers:
Common approach:
- Build verifiable environments with structured outputs, execution harnesses, or realized outcomes.
- Measure end-state correctness, safety violations, or calibration, not just text similarity.
- Use paired benign/harmful or difficulty-calibrated task construction.
- Stress long-horizon coordination, cross-method dependencies, and real-world uncertainty.
Open questions / failure modes:
- Many evaluations still rely on simulation, LLM judges, or short training horizons.
- Domain coverage is often narrow despite strong realism within-domain.
- Benchmark release can create teaching-to-the-test or adversarial adaptation risks.
- Some settings remain expensive to evaluate repeatedly.

3) Technical synthesis

Several papers replace per-instance judgments with distributional or aggregate diagnostics: sandbagging via entropy/JS divergence, policy-simulation via concentration metrics, and runtime safety via safety-gain/availability-cost tradeoffs.
A common systems pattern is decoupling correctness semantics from throughput optimization: DORA preserves C1–C3 constraints, speculative decoding preserves verifier-exact rollouts, and PRAG separates encrypted retrieval modes while sharing an index.
Multiple works use lightweight uncertainty proxies to allocate compute or intervention: disagreement for test-time routing, RSUS for retrieval timing, and certainty sets in UPSi for safe exploration.
There is a strong trend toward stateful enforcement: pDFAs for agent workflows, tool-selection pipelines for alignment faking, and multi-step review attacks/defenses all assume single-turn filtering is insufficient.
Several papers show that memory hierarchy design is now core model performance work: GPU/CPU/NVMe placement, bucketed-LRU, page abstractions, and KV reuse matter as much as kernel speed.
Retrieval papers increasingly optimize for evidence faithfulness, not just answer accuracy: OCR-Memory deterministically fetches verbatim text after index prediction, while adaptive retrieval reduces unnecessary calls and injected context.
A recurring failure mode is false confidence under missing knowledge: VLM epistemic hallucination, robotic healthcare unsafe plans, and class-level code generation all show models can appear competent while failing on coordination or unknowns.
Several papers use human-in-the-loop updates as a practical compromise: Praetor’s blocked-event incorporation, Bian Que’s skill refinement, and expert-augmented SPECA.
Across interpretability and alignment, there is growing interest in sparse, actionable internal directions: MoRFI finds monotonic SAE latents tied to hallucination-inducing fine-tuning, while shorthand supertokens expose structural reasoning moves without hiding traces.
Benchmarks are increasingly designed to be harder to contaminate and easier to verify, using post-2025 code mining, live unresolved questions, execution-based checks, or structured scenario generation.

4) Top 5 papers (with “why now”)

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Formalizes three constraints for safe async RL training: intra-trajectory policy consistency, data integrity, and bounded staleness.
Delivers up to 8.2× rollout acceleration and 2.12× end-to-end throughput improvement in reported experiments, with convergence parity to synchronous training.
Especially relevant now because long-CoT and MoE RL workloads are making rollout the dominant bottleneck.
Useful for teams scaling post-training who need systems gains without changing the RL objective.
Skepticism / limitation: comparisons are mostly within an in-house framework, and the staleness parameter still needs manual tuning.

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Identifies a concrete cross-user privacy leak from per-tensor dynamic activation quantization in batched inference.
Shows near-perfect LLM token recovery (99.6–100%) in the studied setup and exact image identification when the secret is in the candidate set.
Important now because batched multi-tenant inference and quantized serving are standard production defaults.
Actionable takeaway: per-token dynamic quantization removes the described side channel.
Skepticism / limitation: practical exploitation depends on co-batching and can be weakened by production nondeterminism.

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

Compiles benign telemetry into a parameterized DFA that enforces both tool-call sequence constraints and parameter schemas.
Achieves 2.2% ASR on structured workflows versus 12.8% for the stateless baseline, with 0% ASR on multi-step/context-sequential attacks in those settings.
Useful now because agent security failures increasingly come from benign-looking multi-step workflows rather than obvious prompt injections.
Low runtime overhead (~2.2 ms median) makes it plausible as a deployment-sidecar.
Skepticism / limitation: efficacy drops for high-entropy agents, and string-embedding guards still admit synonym-based evasion.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Moves retrieval from a one-shot pre-generation step to a reasoning-step intervention policy.
On MuSiQue, reports 71.2% F1 with 1.8 retrieval calls versus IRCoT’s 65.4% with 3.4 calls, plus lower per-call latency.
Timely because long-reasoning models increasingly fail mid-chain due to knowledge gaps that static RAG cannot anticipate.
Practical value is high: the method is modular and works even for completion-only APIs via proxy uncertainty signals.
Skepticism / limitation: performance still depends on corpus coverage and degrades when many retrievals are needed.

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

Introduces a 270-example harmful-instruction dataset grounded in AMA medical ethics and evaluates 72 models in a robotic action-planning setting.
Finds a 54.4% mean violation rate, with a large proprietary/open-weight gap (23.7% vs 72.8% median violation).
Important now because embodied healthcare agents are moving from dialogue to action planning, where failures are operationally dangerous.
The paired benign rewrites help show low violation is not just blanket refusal.
Skepticism / limitation: results are from simulation and rely on a single LLM judge rather than human adjudication.

5) Practical next steps

Add black-box behavioral monitors to eval suites: response-position entropy under option randomization, tool-choice shift tests under monitoring, and refusal calibration on known/unknown splits.
Audit serving stacks for cross-tenant leakage risks, especially any use of per-tensor dynamic activation quantization in batched inference.
For tool-using agents, move from stateless prompt filters to session-level workflow enforcement with explicit state machines or policy automata.
If you run RL post-training, benchmark rollout-stage bottlenecks separately and test async streaming or verifier-exact speculative decoding before changing the learning algorithm.
For long-horizon agents, measure evidence faithfulness of memory systems, not just downstream task success; consider deterministic fetch after retrieval selection.
In RAG pipelines, test adaptive retrieval timing rather than only improving retriever quality; log where in the reasoning chain retrieval actually changes outcomes.
For VLM or domain-specific assistants, build Known vs Unknown calibration sets and track truthfulness as answer-or-refusal, not only accuracy.
Expand safety benchmarks toward structured action outputs and end-state verification, especially in robotics, healthcare, and enterprise operations.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-05-01

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Behavioral detection of deceptive or misaligned model behavior

Theme: Agent security is moving to sequential, stateful, and spec-grounded defenses

Theme: RL and inference systems are optimizing around rollout, KV, and memory bottlenecks

Theme: Retrieval and memory are becoming adaptive, compressed, and evidence-preserving

Theme: Realistic benchmarks are exposing capability gaps in embodied, code, and live-agent settings

3) Technical synthesis

4) Top 5 papers (with “why now”)

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

5) Practical next steps