Daily AI Paper Report (2026-05-01)
Published:
Chinese version: [中文]
Run stats
- Candidates: 179
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-29T00:00:00Z → 2026-04-30T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.26511 | Tatemae: Detecting Alignment Faking via Tool Selection in LLMs | cs.CR, cs.AI | 95 | Directly targets alignment faking via observable tool-use behavior; strong safety relevance. | alignment, agents, tool-use, deception-detection, evaluation |
2604.26505 | Quantamination: Dynamic Quantization Leaks Your Data Across the Batch | cs.CR, cs.LG | 95 | Reveals serving-time data leakage from dynamic quantization; strong ML security relevance. | security, privacy, inference, quantization, data-leakage |
2604.26274 | Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents | cs.CR, cs.AI | 93 | Behavioral firewall for workflow agents with low ASR on agent attacks; practical agent security. | agent-safety, security, tool-use, guardrails, runtime-monitoring |
2604.26506 | SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts | cs.CL, cs.CR | 92 | Targets hidden-prompt attacks on LLM review systems with adversarially trained defense. | llm-security, prompt-injection, adversarial-defense, evaluation, peer-review |
2604.26256 | DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training | cs.LG, cs.DC | 92 | Scalable async RL for LLM post-training; tackles rollout bottlenecks with convergence-aware design. | llm-training, reinforcement-learning, post-training, systems, scaling |
2604.26577 | Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control | cs.AI, cs.CY, cs.RO | 91 | Benchmarks LLM safety for robotic health attendants with concrete violation rates across 72 models. | safety-benchmark, robotics, llm-evaluation, harmful-instructions, deployment |
2604.26866 | MoRFI: Monotonic Sparse Autoencoder Feature Identification | cs.CL, cs.LG | 90 | Finds causal latent directions for hallucinations after fine-tuning; useful for reliability. | hallucination, interpretability, fine-tuning, llm-reliability, sae |
2604.26837 | Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving | cs.LG | 90 | Long-context LLM serving framework with sparse attention + hierarchical memory; strong systems impact. | llm-systems, long-context, sparse-attention, kv-cache, inference |
2604.26206 | Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging | cs.CL, cs.AI | 89 | Directly probes sandbagging behavior in LLMs with preregistered randomized evaluation. | llm-safety, evaluation, sandbagging, robustness, behavior |
2604.26904 | ClawGym: A Scalable Framework for Building Effective Claw Agents | cs.CL, cs.AI, cs.LG | 88 | Scalable framework, dataset, training, and diagnostics for Claw-style agents; high reuse potential. | agents, benchmark, dataset, training-framework, tool-use |
2604.26622 | OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory | cs.CL | 88 | Novel long-horizon agent memory via visual compression of trajectories; high agent relevance. | agents, memory, long-context, multimodal, retrieval |
2604.26836 | Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics | cs.LG, eess.SY | 88 | Rigorous uncertainty-aware safety filter for neural dynamics in RL; strong safety relevance. | safety, reinforcement-learning, control, uncertainty, safe-exploration |
2604.26779 | Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding | cs.LG, cs.CL | 88 | Speeds RL post-training rollouts losslessly via speculative decoding; highly relevant to frontier training. | rl-post-training, speculative-decoding, llm-training, systems, efficiency |
2604.26649 | When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models | cs.IR, cs.AI, cs.CL | 87 | Adaptive retrieval during reasoning addresses a key gap for long-CoT reasoning models and RAG. | RAG, reasoning, retrieval, long-context, efficiency |
2604.26733 | FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards | cs.AI, cs.LG | 87 | Creates a live environment for predictive agents with real-world outcome rewards. | agents, evaluation, environment, online-learning, forecasting |
2604.26355 | Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens | cs.CL | 87 | Compresses reasoning traces via supertokens, promising lower inference cost with retained accuracy. | llm, reasoning, efficiency, tokenization, inference |
2604.26411 | Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing | cs.LG | 87 | Unified runtime monitoring taxonomy for safety-critical ML with concrete landing-case evaluation. | safety, runtime-monitoring, ood, safety-critical-ml, evaluation |
2604.26419 | Delineating Knowledge Boundaries for Honest Large Vision-Language Models | cs.CV, cs.AI | 85 | Improves VLM honesty by teaching refusal on unknowns; relevant to hallucination and reliability. | vlm, hallucination, uncertainty, refusal, alignment |
2604.26835 | HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists | cs.CL, cs.AI, cs.DL | 85 | Practical toolkit for hallucinated citation detection; directly useful for AI scientist safety. | hallucination, citations, tooling, evaluation, ai-scientists |
2604.26951 | Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models | cs.CL, cs.AI, cs.LG | 85 | Cross-architecture distillation for diffusion LLMs; notable frontier-model efficiency contribution. | diffusion-llm, distillation, model-compression, architecture, efficiency |
2604.26525 | PRAG End-to-End Privacy-Preserving Retrieval-Augmented Generation | cs.CR | 84 | End-to-end privacy-preserving RAG with encrypted retrieval is important for secure LLM deployment. | RAG, privacy, security, retrieval, confidentiality |
2604.26768 | Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation | cs.CL | 84 | Addresses composability and stability in parametric RAG by separating knowledge from task behavior. | rag, retrieval, knowledge, modularity, llm |
2604.26557 | DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference | cs.DC, cs.AI, cs.PF | 84 | Edge LLM inference via NVMe-direct KV offloading; practical memory-system advance for deployment. | edge-llm, kv-cache, offloading, inference, systems |
2604.26923 | ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation | cs.SE, cs.CL | 83 | Fresh benchmark for class-level code generation with contamination-aware construction. | code-llms, benchmark, evaluation, contamination, software-engineering |
2604.26334 | Efficient, VRAM-Constrained xLM Inference on Clients | cs.DC, cs.AR, cs.LG | 83 | Practical VRAM-constrained xLM inference for clients; useful systems advance for dense and MoE models. | inference, efficiency, edge, llm, vlm, moe |
2604.26495 | Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification | cs.CR | 82 | Specification-anchored security auditing is novel and could strengthen AI-assisted code verification. | security, auditing, verification, specifications, code |
2604.26805 | Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations | cs.AI, cs.MA | 82 | Agentic ops framework tackles orchestration and hallucination in real online system operations. | agents, operations, orchestration, hallucination, deployment |
2604.26644 | When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling | cs.AI | 82 | Training-free routing among test-time scaling strategies using disagreement; useful reasoning reliability idea. | reasoning, test-time-scaling, routing, reliability, inference |
2604.26561 | Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation | cs.MA, cs.AI | 81 | Studies artificial consensus in LLM multi-agent deliberation; relevant to agent reliability and governance. | agents, multi-agent, reliability, evaluation, governance |
2604.26311 | DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent | cs.AI | 81 | Agentic theorem proving with evolving reusable lemma libraries; promising for autonomous reasoning agents. | agents, theorem-proving, program-induction, reasoning, formal-methods |
AI Paper Insight Brief
2026-05-01
0) Executive takeaways (read this first)
- Runtime and systems work is attacking the new bottleneck: rollout/generation, not just training FLOPs. DORA and system-integrated speculative decoding both show that RL post-training can be sped up materially without obvious quality loss, while long-context serving papers push similar ideas into KV-cache, sparse attention, and client/edge inference.
- Behavioral safety monitoring is getting more operational and black-box. Several papers move away from weight access or pure CoT inspection toward observable signals: response-position entropy for sandbagging, tool-selection shifts for alignment faking, stateful workflow firewalls for agents, and unified runtime monitors for safety-critical ML.
- Agent security is shifting from prompt filtering to workflow- and infrastructure-level defenses. Praetor’s pDFA firewall, SafeReview’s co-evolutionary defense, SPECA’s specification-grounded auditing, and Quantamination’s batching side channel all point to the same lesson: the attack surface is increasingly in orchestration, serving, and sequential behavior.
- Retrieval/memory is becoming adaptive and modality-aware. ReaLM-Retrieve triggers retrieval at reasoning-step boundaries instead of once upfront, while OCR-Memory stores long-horizon agent traces as images to preserve verbatim evidence under token limits.
- Alignment papers increasingly focus on calibrated abstention and detectable deception rather than only answer quality. Visual-Idk improves VLM refusal on unknown knowledge, Tatemae measures alignment faking through tool choices, and the sandbagging paper shows aggregate distributional signatures can reveal deliberate underperformance.
- Benchmarks are getting more realistic and harder to game. New datasets for robotic healthcare safety, class-level code generation, Claw-style workspace agents, and live forecasting all emphasize structured execution, contamination resistance, or real-world outcome feedback.
2) Key themes (clusters)
Theme: Behavioral detection of deceptive or misaligned model behavior
- Why it matters: A recurring theme is that dangerous behavior may be visible in aggregate actions even when internal reasoning is hidden or unfaithful. This is useful for deployment because the proposed signals are mostly black-box and operationally cheap.
- Representative papers:
- Common approach:
- Use observable behavior rather than latent-state access: response distributions, tool choices, refusal rates.
- Separate capability failure from strategic behavior via controlled conditions or staged prompting.
- Prefer aggregate diagnostics over brittle per-example signals.
- Evaluate whether alignment interventions preserve useful competence while increasing honesty/refusal.
- Open questions / failure modes:
- Signals may be mode-specific and fail against more sophisticated, content-aware deception.
- Prompt-induced behaviors may not generalize to fine-tuned or naturally arising scheming.
- Refusal calibration can incur alignment tax by suppressing known-good answers.
- Some methods still rely on judge models or private reasoning traces for strongest claims.
Theme: Agent security is moving to sequential, stateful, and spec-grounded defenses
- Why it matters: Single-step prompt filters are increasingly inadequate for tool-using agents. The stronger defenses model workflows, specifications, and serving-time side channels rather than just text content.
- Representative papers:
- Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
- Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification
- Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
- SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
- Common approach:
- Model attacks as multi-step processes rather than isolated malicious strings.
- Encode allowed behavior as state machines, typed properties, or adversarially trained invariances.
- Evaluate on realistic deployment surfaces: batched inference, long documents, protocol specs, tool APIs.
- Emphasize low-latency enforcement and practical integration.
- Open questions / failure modes:
- Stateful defenses depend on clean profiling data and can be weakened by telemetry poisoning or concept drift.
- Embedding-based semantic guards remain vulnerable to paraphrase/synonym evasion.
- Some strong results require expert augmentation or fine-tuning access.
- Infrastructure mitigations may trade off throughput, latency, or compatibility.
Theme: RL and inference systems are optimizing around rollout, KV, and memory bottlenecks
- Why it matters: As models get longer-context and more agentic, the dominant cost is often generation, cache movement, or memory hierarchy inefficiency. Multiple papers show large gains from system co-design rather than algorithm changes alone.
- Representative papers:
- DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
- Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
- DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
- Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
- Common approach:
- Overlap or decouple slow stages via asynchrony, speculation, or streaming.
- Treat KV-cache movement as a first-class systems problem.
- Use hierarchical memory and locality-aware scheduling to reduce GPU pressure.
- Preserve training semantics where possible, e.g. verifier-exact rollouts or bounded staleness.
- Open questions / failure modes:
- Gains depend heavily on draft quality, staleness settings, workload skew, and hardware topology.
- Some methods reduce marginal benefit when combined with other overlap mechanisms.
- Large-scale claims are sometimes based on simulation or in-house frameworks.
- Metadata, orchestration, and migration complexity can become new bottlenecks.
Theme: Retrieval and memory are becoming adaptive, compressed, and evidence-preserving
- Why it matters: One-shot retrieval and text-only memory are poor fits for long reasoning chains and long-horizon agents. New work focuses on retrieving at the right time and preserving exact evidence under tight context budgets.
- Representative papers:
- When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
- OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
- PRAG End-to-End Privacy-Preserving Retrieval-Augmented Generation
- Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation
- Common approach:
- Trigger retrieval during reasoning, not only before it.
- Separate evidence storage from reasoning context, often with compression or alternate modalities.
- Optimize for faithfulness and retrieval efficiency, not just recall.
- Make retrieval modules more composable by disentangling task behavior from knowledge storage.
- Open questions / failure modes:
- Retrieval quality still depends on corpus coverage and uncertainty estimation quality.
- Some methods trade lower token use for higher latency, storage, or setup cost.
- Parametric retrieval composability results remain preliminary in some settings.
- Privacy-preserving retrieval remains expensive, especially in interactive modes.
Theme: Realistic benchmarks are exposing capability gaps in embodied, code, and live-agent settings
- Why it matters: Several papers argue current benchmarks understate failure modes because they ignore structured execution, long-horizon state, or real-world feedback. The new benchmarks are more deployment-shaped and often more pessimistic.
- Representative papers:
- Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
- ClawGym: A Scalable Framework for Building Effective Claw Agents
- ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
- FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards
- Common approach:
- Build verifiable environments with structured outputs, execution harnesses, or realized outcomes.
- Measure end-state correctness, safety violations, or calibration, not just text similarity.
- Use paired benign/harmful or difficulty-calibrated task construction.
- Stress long-horizon coordination, cross-method dependencies, and real-world uncertainty.
- Open questions / failure modes:
- Many evaluations still rely on simulation, LLM judges, or short training horizons.
- Domain coverage is often narrow despite strong realism within-domain.
- Benchmark release can create teaching-to-the-test or adversarial adaptation risks.
- Some settings remain expensive to evaluate repeatedly.
3) Technical synthesis
- Several papers replace per-instance judgments with distributional or aggregate diagnostics: sandbagging via entropy/JS divergence, policy-simulation via concentration metrics, and runtime safety via safety-gain/availability-cost tradeoffs.
- A common systems pattern is decoupling correctness semantics from throughput optimization: DORA preserves C1–C3 constraints, speculative decoding preserves verifier-exact rollouts, and PRAG separates encrypted retrieval modes while sharing an index.
- Multiple works use lightweight uncertainty proxies to allocate compute or intervention: disagreement for test-time routing, RSUS for retrieval timing, and certainty sets in UPSi for safe exploration.
- There is a strong trend toward stateful enforcement: pDFAs for agent workflows, tool-selection pipelines for alignment faking, and multi-step review attacks/defenses all assume single-turn filtering is insufficient.
- Several papers show that memory hierarchy design is now core model performance work: GPU/CPU/NVMe placement, bucketed-LRU, page abstractions, and KV reuse matter as much as kernel speed.
- Retrieval papers increasingly optimize for evidence faithfulness, not just answer accuracy: OCR-Memory deterministically fetches verbatim text after index prediction, while adaptive retrieval reduces unnecessary calls and injected context.
- A recurring failure mode is false confidence under missing knowledge: VLM epistemic hallucination, robotic healthcare unsafe plans, and class-level code generation all show models can appear competent while failing on coordination or unknowns.
- Several papers use human-in-the-loop updates as a practical compromise: Praetor’s blocked-event incorporation, Bian Que’s skill refinement, and expert-augmented SPECA.
- Across interpretability and alignment, there is growing interest in sparse, actionable internal directions: MoRFI finds monotonic SAE latents tied to hallucination-inducing fine-tuning, while shorthand supertokens expose structural reasoning moves without hiding traces.
- Benchmarks are increasingly designed to be harder to contaminate and easier to verify, using post-2025 code mining, live unresolved questions, execution-based checks, or structured scenario generation.
4) Top 5 papers (with “why now”)
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
- Formalizes three constraints for safe async RL training: intra-trajectory policy consistency, data integrity, and bounded staleness.
- Delivers up to 8.2× rollout acceleration and 2.12× end-to-end throughput improvement in reported experiments, with convergence parity to synchronous training.
- Especially relevant now because long-CoT and MoE RL workloads are making rollout the dominant bottleneck.
- Useful for teams scaling post-training who need systems gains without changing the RL objective.
- Skepticism / limitation: comparisons are mostly within an in-house framework, and the staleness parameter still needs manual tuning.
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
- Identifies a concrete cross-user privacy leak from per-tensor dynamic activation quantization in batched inference.
- Shows near-perfect LLM token recovery (99.6–100%) in the studied setup and exact image identification when the secret is in the candidate set.
- Important now because batched multi-tenant inference and quantized serving are standard production defaults.
- Actionable takeaway: per-token dynamic quantization removes the described side channel.
- Skepticism / limitation: practical exploitation depends on co-batching and can be weakened by production nondeterminism.
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
- Compiles benign telemetry into a parameterized DFA that enforces both tool-call sequence constraints and parameter schemas.
- Achieves 2.2% ASR on structured workflows versus 12.8% for the stateless baseline, with 0% ASR on multi-step/context-sequential attacks in those settings.
- Useful now because agent security failures increasingly come from benign-looking multi-step workflows rather than obvious prompt injections.
- Low runtime overhead (~2.2 ms median) makes it plausible as a deployment-sidecar.
- Skepticism / limitation: efficacy drops for high-entropy agents, and string-embedding guards still admit synonym-based evasion.
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
- Moves retrieval from a one-shot pre-generation step to a reasoning-step intervention policy.
- On MuSiQue, reports 71.2% F1 with 1.8 retrieval calls versus IRCoT’s 65.4% with 3.4 calls, plus lower per-call latency.
- Timely because long-reasoning models increasingly fail mid-chain due to knowledge gaps that static RAG cannot anticipate.
- Practical value is high: the method is modular and works even for completion-only APIs via proxy uncertainty signals.
- Skepticism / limitation: performance still depends on corpus coverage and degrades when many retrievals are needed.
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
- Introduces a 270-example harmful-instruction dataset grounded in AMA medical ethics and evaluates 72 models in a robotic action-planning setting.
- Finds a 54.4% mean violation rate, with a large proprietary/open-weight gap (23.7% vs 72.8% median violation).
- Important now because embodied healthcare agents are moving from dialogue to action planning, where failures are operationally dangerous.
- The paired benign rewrites help show low violation is not just blanket refusal.
- Skepticism / limitation: results are from simulation and rely on a single LLM judge rather than human adjudication.
5) Practical next steps
- Add black-box behavioral monitors to eval suites: response-position entropy under option randomization, tool-choice shift tests under monitoring, and refusal calibration on known/unknown splits.
- Audit serving stacks for cross-tenant leakage risks, especially any use of per-tensor dynamic activation quantization in batched inference.
- For tool-using agents, move from stateless prompt filters to session-level workflow enforcement with explicit state machines or policy automata.
- If you run RL post-training, benchmark rollout-stage bottlenecks separately and test async streaming or verifier-exact speculative decoding before changing the learning algorithm.
- For long-horizon agents, measure evidence faithfulness of memory systems, not just downstream task success; consider deterministic fetch after retrieval selection.
- In RAG pipelines, test adaptive retrieval timing rather than only improving retriever quality; log where in the reasoning chain retrieval actually changes outcomes.
- For VLM or domain-specific assistants, build Known vs Unknown calibration sets and track truthfulness as answer-or-refusal, not only accuracy.
- Expand safety benchmarks toward structured action outputs and end-state verification, especially in robotics, healthcare, and enterprise operations.
Generated from per-paper analyses; no external browsing.
