Daily AI Paper Report (2026-05-01)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 179
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-29T00:00:00Z → 2026-04-30T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.26511Tatemae: Detecting Alignment Faking via Tool Selection in LLMs
PDF
cs.CR, cs.AI95Directly targets alignment faking via observable tool-use behavior; strong safety relevance.alignment, agents, tool-use, deception-detection, evaluation
2604.26505Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
PDF
cs.CR, cs.LG95Reveals serving-time data leakage from dynamic quantization; strong ML security relevance.security, privacy, inference, quantization, data-leakage
2604.26274Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
PDF
cs.CR, cs.AI93Behavioral firewall for workflow agents with low ASR on agent attacks; practical agent security.agent-safety, security, tool-use, guardrails, runtime-monitoring
2604.26506SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
PDF
cs.CL, cs.CR92Targets hidden-prompt attacks on LLM review systems with adversarially trained defense.llm-security, prompt-injection, adversarial-defense, evaluation, peer-review
2604.26256DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
PDF
cs.LG, cs.DC92Scalable async RL for LLM post-training; tackles rollout bottlenecks with convergence-aware design.llm-training, reinforcement-learning, post-training, systems, scaling
2604.26577Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
PDF
cs.AI, cs.CY, cs.RO91Benchmarks LLM safety for robotic health attendants with concrete violation rates across 72 models.safety-benchmark, robotics, llm-evaluation, harmful-instructions, deployment
2604.26866MoRFI: Monotonic Sparse Autoencoder Feature Identification
PDF
cs.CL, cs.LG90Finds causal latent directions for hallucinations after fine-tuning; useful for reliability.hallucination, interpretability, fine-tuning, llm-reliability, sae
2604.26837Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
PDF
cs.LG90Long-context LLM serving framework with sparse attention + hierarchical memory; strong systems impact.llm-systems, long-context, sparse-attention, kv-cache, inference
2604.26206Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
PDF
cs.CL, cs.AI89Directly probes sandbagging behavior in LLMs with preregistered randomized evaluation.llm-safety, evaluation, sandbagging, robustness, behavior
2604.26904ClawGym: A Scalable Framework for Building Effective Claw Agents
PDF
cs.CL, cs.AI, cs.LG88Scalable framework, dataset, training, and diagnostics for Claw-style agents; high reuse potential.agents, benchmark, dataset, training-framework, tool-use
2604.26622OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
PDF
cs.CL88Novel long-horizon agent memory via visual compression of trajectories; high agent relevance.agents, memory, long-context, multimodal, retrieval
2604.26836Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics
PDF
cs.LG, eess.SY88Rigorous uncertainty-aware safety filter for neural dynamics in RL; strong safety relevance.safety, reinforcement-learning, control, uncertainty, safe-exploration
2604.26779Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
PDF
cs.LG, cs.CL88Speeds RL post-training rollouts losslessly via speculative decoding; highly relevant to frontier training.rl-post-training, speculative-decoding, llm-training, systems, efficiency
2604.26649When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
PDF
cs.IR, cs.AI, cs.CL87Adaptive retrieval during reasoning addresses a key gap for long-CoT reasoning models and RAG.RAG, reasoning, retrieval, long-context, efficiency
2604.26733FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards
PDF
cs.AI, cs.LG87Creates a live environment for predictive agents with real-world outcome rewards.agents, evaluation, environment, online-learning, forecasting
2604.26355Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens
PDF
cs.CL87Compresses reasoning traces via supertokens, promising lower inference cost with retained accuracy.llm, reasoning, efficiency, tokenization, inference
2604.26411Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing
PDF
cs.LG87Unified runtime monitoring taxonomy for safety-critical ML with concrete landing-case evaluation.safety, runtime-monitoring, ood, safety-critical-ml, evaluation
2604.26419Delineating Knowledge Boundaries for Honest Large Vision-Language Models
PDF
cs.CV, cs.AI85Improves VLM honesty by teaching refusal on unknowns; relevant to hallucination and reliability.vlm, hallucination, uncertainty, refusal, alignment
2604.26835HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists
PDF
cs.CL, cs.AI, cs.DL85Practical toolkit for hallucinated citation detection; directly useful for AI scientist safety.hallucination, citations, tooling, evaluation, ai-scientists
2604.26951Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
PDF
cs.CL, cs.AI, cs.LG85Cross-architecture distillation for diffusion LLMs; notable frontier-model efficiency contribution.diffusion-llm, distillation, model-compression, architecture, efficiency
2604.26525PRAG End-to-End Privacy-Preserving Retrieval-Augmented Generation
PDF
cs.CR84End-to-end privacy-preserving RAG with encrypted retrieval is important for secure LLM deployment.RAG, privacy, security, retrieval, confidentiality
2604.26768Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation
PDF
cs.CL84Addresses composability and stability in parametric RAG by separating knowledge from task behavior.rag, retrieval, knowledge, modularity, llm
2604.26557DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
PDF
cs.DC, cs.AI, cs.PF84Edge LLM inference via NVMe-direct KV offloading; practical memory-system advance for deployment.edge-llm, kv-cache, offloading, inference, systems
2604.26923ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
PDF
cs.SE, cs.CL83Fresh benchmark for class-level code generation with contamination-aware construction.code-llms, benchmark, evaluation, contamination, software-engineering
2604.26334Efficient, VRAM-Constrained xLM Inference on Clients
PDF
cs.DC, cs.AR, cs.LG83Practical VRAM-constrained xLM inference for clients; useful systems advance for dense and MoE models.inference, efficiency, edge, llm, vlm, moe
2604.26495Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification
PDF
cs.CR82Specification-anchored security auditing is novel and could strengthen AI-assisted code verification.security, auditing, verification, specifications, code
2604.26805Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
PDF
cs.AI, cs.MA82Agentic ops framework tackles orchestration and hallucination in real online system operations.agents, operations, orchestration, hallucination, deployment
2604.26644When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
PDF
cs.AI82Training-free routing among test-time scaling strategies using disagreement; useful reasoning reliability idea.reasoning, test-time-scaling, routing, reliability, inference
2604.26561Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
PDF
cs.MA, cs.AI81Studies artificial consensus in LLM multi-agent deliberation; relevant to agent reliability and governance.agents, multi-agent, reliability, evaluation, governance
2604.26311DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent
PDF
cs.AI81Agentic theorem proving with evolving reusable lemma libraries; promising for autonomous reasoning agents.agents, theorem-proving, program-induction, reasoning, formal-methods

AI Paper Insight Brief

2026-05-01

0) Executive takeaways (read this first)

  • Runtime and systems work is attacking the new bottleneck: rollout/generation, not just training FLOPs. DORA and system-integrated speculative decoding both show that RL post-training can be sped up materially without obvious quality loss, while long-context serving papers push similar ideas into KV-cache, sparse attention, and client/edge inference.
  • Behavioral safety monitoring is getting more operational and black-box. Several papers move away from weight access or pure CoT inspection toward observable signals: response-position entropy for sandbagging, tool-selection shifts for alignment faking, stateful workflow firewalls for agents, and unified runtime monitors for safety-critical ML.
  • Agent security is shifting from prompt filtering to workflow- and infrastructure-level defenses. Praetor’s pDFA firewall, SafeReview’s co-evolutionary defense, SPECA’s specification-grounded auditing, and Quantamination’s batching side channel all point to the same lesson: the attack surface is increasingly in orchestration, serving, and sequential behavior.
  • Retrieval/memory is becoming adaptive and modality-aware. ReaLM-Retrieve triggers retrieval at reasoning-step boundaries instead of once upfront, while OCR-Memory stores long-horizon agent traces as images to preserve verbatim evidence under token limits.
  • Alignment papers increasingly focus on calibrated abstention and detectable deception rather than only answer quality. Visual-Idk improves VLM refusal on unknown knowledge, Tatemae measures alignment faking through tool choices, and the sandbagging paper shows aggregate distributional signatures can reveal deliberate underperformance.
  • Benchmarks are getting more realistic and harder to game. New datasets for robotic healthcare safety, class-level code generation, Claw-style workspace agents, and live forecasting all emphasize structured execution, contamination resistance, or real-world outcome feedback.

2) Key themes (clusters)

Theme: Behavioral detection of deceptive or misaligned model behavior

  • Why it matters: A recurring theme is that dangerous behavior may be visible in aggregate actions even when internal reasoning is hidden or unfaithful. This is useful for deployment because the proposed signals are mostly black-box and operationally cheap.
  • Representative papers:
  • Common approach:
    • Use observable behavior rather than latent-state access: response distributions, tool choices, refusal rates.
    • Separate capability failure from strategic behavior via controlled conditions or staged prompting.
    • Prefer aggregate diagnostics over brittle per-example signals.
    • Evaluate whether alignment interventions preserve useful competence while increasing honesty/refusal.
  • Open questions / failure modes:
    • Signals may be mode-specific and fail against more sophisticated, content-aware deception.
    • Prompt-induced behaviors may not generalize to fine-tuned or naturally arising scheming.
    • Refusal calibration can incur alignment tax by suppressing known-good answers.
    • Some methods still rely on judge models or private reasoning traces for strongest claims.

Theme: Agent security is moving to sequential, stateful, and spec-grounded defenses

Theme: RL and inference systems are optimizing around rollout, KV, and memory bottlenecks

Theme: Retrieval and memory are becoming adaptive, compressed, and evidence-preserving

Theme: Realistic benchmarks are exposing capability gaps in embodied, code, and live-agent settings

3) Technical synthesis

  • Several papers replace per-instance judgments with distributional or aggregate diagnostics: sandbagging via entropy/JS divergence, policy-simulation via concentration metrics, and runtime safety via safety-gain/availability-cost tradeoffs.
  • A common systems pattern is decoupling correctness semantics from throughput optimization: DORA preserves C1–C3 constraints, speculative decoding preserves verifier-exact rollouts, and PRAG separates encrypted retrieval modes while sharing an index.
  • Multiple works use lightweight uncertainty proxies to allocate compute or intervention: disagreement for test-time routing, RSUS for retrieval timing, and certainty sets in UPSi for safe exploration.
  • There is a strong trend toward stateful enforcement: pDFAs for agent workflows, tool-selection pipelines for alignment faking, and multi-step review attacks/defenses all assume single-turn filtering is insufficient.
  • Several papers show that memory hierarchy design is now core model performance work: GPU/CPU/NVMe placement, bucketed-LRU, page abstractions, and KV reuse matter as much as kernel speed.
  • Retrieval papers increasingly optimize for evidence faithfulness, not just answer accuracy: OCR-Memory deterministically fetches verbatim text after index prediction, while adaptive retrieval reduces unnecessary calls and injected context.
  • A recurring failure mode is false confidence under missing knowledge: VLM epistemic hallucination, robotic healthcare unsafe plans, and class-level code generation all show models can appear competent while failing on coordination or unknowns.
  • Several papers use human-in-the-loop updates as a practical compromise: Praetor’s blocked-event incorporation, Bian Que’s skill refinement, and expert-augmented SPECA.
  • Across interpretability and alignment, there is growing interest in sparse, actionable internal directions: MoRFI finds monotonic SAE latents tied to hallucination-inducing fine-tuning, while shorthand supertokens expose structural reasoning moves without hiding traces.
  • Benchmarks are increasingly designed to be harder to contaminate and easier to verify, using post-2025 code mining, live unresolved questions, execution-based checks, or structured scenario generation.

4) Top 5 papers (with “why now”)

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

  • Formalizes three constraints for safe async RL training: intra-trajectory policy consistency, data integrity, and bounded staleness.
  • Delivers up to 8.2× rollout acceleration and 2.12× end-to-end throughput improvement in reported experiments, with convergence parity to synchronous training.
  • Especially relevant now because long-CoT and MoE RL workloads are making rollout the dominant bottleneck.
  • Useful for teams scaling post-training who need systems gains without changing the RL objective.
  • Skepticism / limitation: comparisons are mostly within an in-house framework, and the staleness parameter still needs manual tuning.

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

  • Identifies a concrete cross-user privacy leak from per-tensor dynamic activation quantization in batched inference.
  • Shows near-perfect LLM token recovery (99.6–100%) in the studied setup and exact image identification when the secret is in the candidate set.
  • Important now because batched multi-tenant inference and quantized serving are standard production defaults.
  • Actionable takeaway: per-token dynamic quantization removes the described side channel.
  • Skepticism / limitation: practical exploitation depends on co-batching and can be weakened by production nondeterminism.

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

  • Compiles benign telemetry into a parameterized DFA that enforces both tool-call sequence constraints and parameter schemas.
  • Achieves 2.2% ASR on structured workflows versus 12.8% for the stateless baseline, with 0% ASR on multi-step/context-sequential attacks in those settings.
  • Useful now because agent security failures increasingly come from benign-looking multi-step workflows rather than obvious prompt injections.
  • Low runtime overhead (~2.2 ms median) makes it plausible as a deployment-sidecar.
  • Skepticism / limitation: efficacy drops for high-entropy agents, and string-embedding guards still admit synonym-based evasion.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

  • Moves retrieval from a one-shot pre-generation step to a reasoning-step intervention policy.
  • On MuSiQue, reports 71.2% F1 with 1.8 retrieval calls versus IRCoT’s 65.4% with 3.4 calls, plus lower per-call latency.
  • Timely because long-reasoning models increasingly fail mid-chain due to knowledge gaps that static RAG cannot anticipate.
  • Practical value is high: the method is modular and works even for completion-only APIs via proxy uncertainty signals.
  • Skepticism / limitation: performance still depends on corpus coverage and degrades when many retrievals are needed.

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

  • Introduces a 270-example harmful-instruction dataset grounded in AMA medical ethics and evaluates 72 models in a robotic action-planning setting.
  • Finds a 54.4% mean violation rate, with a large proprietary/open-weight gap (23.7% vs 72.8% median violation).
  • Important now because embodied healthcare agents are moving from dialogue to action planning, where failures are operationally dangerous.
  • The paired benign rewrites help show low violation is not just blanket refusal.
  • Skepticism / limitation: results are from simulation and rely on a single LLM judge rather than human adjudication.

5) Practical next steps

  • Add black-box behavioral monitors to eval suites: response-position entropy under option randomization, tool-choice shift tests under monitoring, and refusal calibration on known/unknown splits.
  • Audit serving stacks for cross-tenant leakage risks, especially any use of per-tensor dynamic activation quantization in batched inference.
  • For tool-using agents, move from stateless prompt filters to session-level workflow enforcement with explicit state machines or policy automata.
  • If you run RL post-training, benchmark rollout-stage bottlenecks separately and test async streaming or verifier-exact speculative decoding before changing the learning algorithm.
  • For long-horizon agents, measure evidence faithfulness of memory systems, not just downstream task success; consider deterministic fetch after retrieval selection.
  • In RAG pipelines, test adaptive retrieval timing rather than only improving retriever quality; log where in the reasoning chain retrieval actually changes outcomes.
  • For VLM or domain-specific assistants, build Known vs Unknown calibration sets and track truthfulness as answer-or-refusal, not only accuracy.
  • Expand safety benchmarks toward structured action outputs and end-state verification, especially in robotics, healthcare, and enterprise operations.

Generated from per-paper analyses; no external browsing.