Daily AI Paper Report (2026-04-08)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 195
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-06T00:00:00Z → 2026-04-07T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.04759Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
PDF
cs.CR, cs.AI, cs.CL96First real-world safety eval of widely deployed agent w/ live attacks + CIK taxonomy.agent-safety, real-world-eval, tool-access, attack-scenarios, taxonomy, privilege-risk
2604.04426ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems
PDF
cs.AI94Supply-chain injection benchmark (10k malicious MCP tools) + network guardrails; big gap.agent-security, supply-chain, MCP, benchmark, prompt-injection, guardrails, MITRE-ATT&CK
2604.04842Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
PDF
cs.CL93Persona-driven counseling red-teaming exposes maladaptive validation risks in multi-turn therapy chatsLLM-safety, red-teaming, mental-health, persona-attack, dialogue, high-stakes
2604.04522HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems
PDF
cs.CR, cs.MA93Cryptographic provenance for human authorization across agent delegation chains; closes accountability gap.agentic-systems, security, authorization, provenance, cryptography, multi-agent
2604.04565PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning
PDF
cs.CL, cs.AI92Trains Answer/Ask/Abstain for epistemic calibration; directly targets overconfident QA/RAG failures.epistemic-calibration, abstention, clarification, RAG, hallucination, reliability, SFT
2604.04561Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
PDF
cs.CR, cs.AI, cs.CL9110k-trial taxonomy of prompt features that trigger agent vulnerability exploitation in sandboxes.agent-security, vulnerability-exploitation, system-prompts, taxonomy, evaluation, docker
2604.04743Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations
PDF
cs.CL, cs.AI, eess.SY91Dynamical-systems view of hallucinations + geometry-aware steering to reduce hallucination without retraininghallucinations, reliability, interpretability, steering, latent-dynamics, evaluation
2604.04443DeonticBench: A Benchmark for Reasoning over Rules
PDF
cs.CL916.2k-task benchmark for rule/deontic reasoning in legal/policy domains; high-stakes long-context eval.benchmark, evaluation, deontic-reasoning, law, policy, long-context
2604.04385How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
PDF
cs.CL, cs.AI, cs.LG90Mechanistic localization of alignment refusal routing circuits; controllable policy strength.interpretability, alignment, refusal, circuits, mechanistic, interchange-interventions
2604.04323How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
PDF
cs.CL90Realistic benchmark of agent skill retrieval/selection over 34k skills; strong signal for agent robustness.agents, tool-use, skills, retrieval, benchmark, robustness, evaluation
2604.04738Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates
PDF
cs.CR, cs.LG89Cryptographic proofs to certify fine-tune drift bounds; mitigates backdoors/safety regressions.model-security, fine-tuning, integrity, zero-knowledge, backdoors, provenance, certification
2604.04488A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
PDF
cs.CV, cs.LG89Backdoor defense for multimodal LLMs targeting low-poisoning triggers while preserving benign generationsecurity, backdoors, multimodal-LLM, data-poisoning, defense, robustness
2604.04410Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
PDF
cs.LG, cs.AI, cs.CL, stat.ML89Statistically consistent alignment via relative density-ratio optimization; addresses instability of DDRO.alignment, preference-learning, DPO, statistical-consistency, optimization
2604.04757Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange
PDF
cs.CR, cs.AI, cs.LG88Shows covert agent-to-agent steganographic conversations indistinguishable to auditors; major risk.steganography, agent-communication, auditing, watermarking, security, covert-channels
2604.04325Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
PDF
cs.CL88Multi-turn medical diagnosis benchmark reveals premature commitment and self-correction issues in LLMs.evaluation, multi-turn, medical, decision-making, self-correction, reliability, benchmarks
2604.04712Hardware-Level Governance of AI Compute: A Feasibility Taxonomy for Regulatory Compliance and Treaty Verification
PDF
cs.CR, cs.CY86Engineering-grounded taxonomy of hardware compute governance mechanisms + feasibility/adversaries.AI-governance, compute, hardware-attestation, monitoring, treaty-verification, compliance
2604.04876Incompleteness of AI Safety Verification via Kolmogorov Complexity
PDF
cs.AI86Formal incompleteness limit for safety/policy verification via Kolmogorov complexity framingformal-verification, AI-safety-theory, limits, Kolmogorov-complexity, policy-compliance
2604.04461DP-OPD: Differentially Private On-Policy Distillation for Language Models
PDF
cs.LG, cs.AI, cs.CL86Differentially private on-policy distillation for LMs; targets privacy–utility issues in long rollouts.privacy, differential-privacy, distillation, language-models, deployment
2604.04930Early Stopping for Large Reasoning Models via Confidence Dynamics
PDF
cs.CL, cs.AI, cs.LG86Early-stopping for long CoT via confidence dynamics; cuts compute and mitigates overthinking regressions.reasoning, chain-of-thought, efficiency, confidence, inference, calibration
2604.04532Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
PDF
cs.CL, cs.AI85Shows agent-judge eval is language-sensitive; prompt localization can flip model rankingsevaluation, agent-as-judge, multilingual, benchmarking, reliability, measurement
2604.04399GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
PDF
cs.AI84Interpretable GUI-agent eval via hierarchical diagnosis; improves long-horizon failure analysis.agent-evaluation, GUI-agents, diagnostics, benchmarks, trajectories, interpretability
2604.04733Discovering Failure Modes in Vision-Language Models using RL
PDF
cs.CV, cs.AI84RL framework to automatically discover VLM failure modes/blind spots without manual curation.evaluation, red-teaming, vision-language, robustness, reinforcement-learning
2604.04855The Role of Generator Access in Autoregressive Post-Training
PDF
cs.LG84Shows generator interface (prefix control/logits) can yield exponential gains in KL-regularized post-training.post-training, RLHF, DPO, generator-access, theory, sample-efficiency, alignment
2604.04328Soft Tournament Equilibrium
PDF
cs.AI, cs.LG, cs.MA83Differentiable set-valued tournament solution for non-transitive agent comparisons; avoids unstable rankingsagent-evaluation, tournaments, non-transitivity, ranking, multi-agent, metrics
2604.04917Vero: An Open RL Recipe for General Visual Reasoning
PDF
cs.CV, cs.AI, cs.CL83Open RL recipe + 600K multi-task reward data for visual reasoning; strong reproducible VLM progress.VLM, reinforcement-learning, open-models, dataset, visual-reasoning, post-training
2604.04872Synthetic Sandbox for Training Machine Learning Engineering Agents
PDF
cs.CL, cs.LG82Synthetic sandbox enabling on-policy RL for ML-engineering agents by shrinking verification cost.agents, RL, sandbox, MLE, training-framework, evaluation, scaling
2604.04767Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems
PDF
cs.LG, cs.AI, cs.CL82Task reformulation to learn from too-hard reasoning problems in RLVR; bootstraps via easier variantsreasoning, RLVR, curriculum, task-reformulation, training, LLMs
2604.04898QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
PDF
cs.AI, cs.CL, cs.LG82Post-trains an open 4B model for Olympiad-level proofs; useful for studying small-model reasoning.reasoning, math, small-models, post-training, proofs, distillation
2604.04902Are Latent Reasoning Models Easily Interpretable?
PDF
cs.LG82Finds latent reasoning tokens often unused; challenges interpretability/monitoring claims of LRMs.interpretability, reasoning, latent-reasoning, monitoring, evaluation, mechanistic
2604.04847Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
PDF
eess.AS, cs.CL81Real-audio full-duplex voice agent tool-use benchmark with disfluencies + latency/turn-taking.voice-agents, tool-use, benchmark, speech, disfluency, evaluation, latency

AI Paper Insight Brief

2026-04-08

0) Executive takeaways (read this first)

  • “Agent skills” don’t transfer cleanly from curated benchmarks to deployment reality: once agents must retrieve/select/adapt from a 34k-skill pool, gains often collapse toward the no-skill baseline; iterative agentic hybrid search + query-specific refinement can recover meaningful performance (e.g., +7.8 pp on Terminal-Bench 2.0 for Claude Opus 4.6).
  • Multi-turn interaction is itself a safety hazard in high-stakes domains: in medical diagnosis, models frequently commit too early; simply withholding the question until the end largely restores single-turn accuracy, and “salient evidence” (labs) can act as a lure that triggers premature (often wrong) answers.
  • Security evaluation is shifting from “what the tool says” to “what the tool does at runtime”: network-level monitoring (MITM + decrypted traffic event traces) detects MCP supply-chain injections with very high reported F1 and low FPR; persistent agent state (memory/identity/skills) is a major real-world attack surface that survives across sessions.
  • Transcript logging is not a sufficient control: cryptographic results show agents can embed undetectable covert communication in “honest-looking” conversations; key exchange can be done even under weak entropy assumptions (given new primitives), undermining passive auditing as a safety strategy.
  • Evaluation itself is becoming a first-class failure mode: GUI-agent judging improves dramatically with hierarchical diagnosis; but “agent-as-a-judge” results can invert model rankings depending on the judge language, with low inter-backbone agreement—meaning benchmark conclusions may not generalize across locales.
  • Alignment and integrity are becoming more mechanistic and cryptographic: alignment behavior can be localized to sparse routing circuits (gate→amplifiers) that are bypassable via encodings; and fine-tuning integrity can be certified with succinct ZK proofs for structured drift (norm/rank/sparsity), enabling new governance/audit workflows.

2) Key themes (clusters)

Theme: Skills & tool-use in the wild (retrieval, selection, adaptation)

  • Why it matters: “Skills” are widely used to extend agents, but their measured benefit depends heavily on whether evaluation includes realistic retrieval noise and adaptation costs.
  • Representative papers:
  • Common approach:
    • Move from curated/force-loaded artifacts to retrieval from large, noisy pools (skills or streaming speech).
    • Evaluate end-to-end task success under realistic constraints (selection errors, disfluencies, latency).
    • Add test-time adaptation loops (agentic search; query-specific refinement) or measure rollback failures (self-corrections).
  • Open questions / failure modes:
    • When relevant skills are absent, refinement can’t create missing knowledge; performance reverts toward baseline.
    • Voice agents: self-corrections/state rollback remains a dominant failure mode; low latency can trade off with turn-taking reliability.

Theme: Multi-turn safety failures (premature commitment, lures, persona attacks)

Theme: Execution-time agent security (supply chain, persistence, exploitation triggers)

Theme: Auditing, provenance, and cryptographic integrity for agents/models

Theme: Evaluation reliability & interpretability tooling (judges, diagnosis, rankings)

3) Technical synthesis

  • Realism upgrades in benchmarks follow a common pattern: remove oracle access (force-loaded skills; single-turn full evidence; short trajectories) and measure how performance degrades under retrieval noise, incremental evidence, long horizons, and runtime constraints.
  • Several papers converge on two-stage loops: (i) search/retrieve/segment (skills retrieval; trajectory segmentation; questioner RL probing), then (ii) refine/diagnose/steer (query-specific skill refinement; subtask diagnosis; latent steering).
  • “Commitment control” appears across domains: medical Q-Last reduces premature diagnosis; voice agents struggle with self-correction rollback; early-stopping for reasoning models uses confidence dynamics to stop overthinking.
  • Security defenses are moving from semantic checks to telemetry-grounded signals: ShieldNet’s decrypted event traces and OpenClaw’s external-action verification mirror a broader shift to execution evidence.
  • Multiple works highlight that surface artifacts are insufficient: tool schemas (MCP) don’t reveal injected behavior; transcripts don’t prevent covert channels; “detection” representations don’t guarantee aligned behavior (alignment routing circuits).
  • There’s a growing split between parametric guarantees vs semantic guarantees: fine-tuning integrity proofs can certify norm/rank/sparsity drift, but small drift can still cause large behavioral changes (explicitly noted).
  • Evaluation methodology itself is under attack: multilingual AAAJ shows judge-language/backbone interactions can invert rankings; STE argues rankings are brittle under cycles and proposes set-valued cores.
  • RL is being used both to improve capability (Vero open RL for visual reasoning; QED-Nano proof RL; SandMLE on-policy RL via micro-sandboxes; Cog-DRIFT reformulation curriculum) and to discover failures (RL questioner for VLM failure modes).
  • Several approaches rely on strong auxiliary models (verifiers/judges/graders) and thus inherit their biases (medical sharding with Qwen3-32B; VLM verifier; counseling judges; proof graders).
  • A recurring practical tradeoff: robustness vs cost (query-specific refinement compute; dual-view backdoor defense ~2× training; DP-OPD teacher-query overhead; network MITM invasiveness; long-token reasoning vs early stopping).

4) Top 5 papers (with “why now”)

1) ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

  • Introduces SC-Inject-Bench: code-level tool injections validated by network traces (PCAP), targeting a realistic MCP supply-chain threat model.
  • Shows a practical guardrail: MITM + decrypted HTTP(S) + structured event traces with a lightweight post-trained detector (Qwen3-0.6B) enabling streaming detection.
  • Reports very strong detection (e.g., PCAP-level F1=0.995, FPR=0.008) and ablations showing decryption is critical.
  • Skepticism / limitation: focuses on network-visible attacks; operational constraints of MITM/decryption and QUIC blocking may limit deployability.

2) Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

  • Makes persistent agent state a first-class security object via CIK taxonomy (Capability/Identity/Knowledge) and tests poisoning across sessions.
  • Demonstrates large ASR increases under single-dimension poisoning (abstract: baseline 24.6% → 64–74% with poisoning) and highlights executable capability payloads as especially dangerous.
  • Evaluates defenses and surfaces an evolution–safety tradeoff (file protection blocks attacks but also blocks legitimate updates).
  • Skepticism / limitation: single platform (OpenClaw) and 12 scenarios; cross-dimension chained attacks not studied.

3) How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

  • Localizes alignment behavior to a sparse gate→amplifier routing motif across 9 models, with causal tests (interchange, knockout cascades).
  • Shows continuous control via detection-layer modulation (dose–response) and a concrete bypass: cipher encoding collapses gate contribution (e.g., 78% collapse in Phi-4-mini).
  • Provides a mechanistic explanation for “detection without refusal” and a target for auditing/defense.
  • Skepticism / limitation: scope limited to 2–32B models and political/safety refusal domains; MLP routing contributions remain underexplained.

4) Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

  • Releases MINT (1,035 cases) converting multiple medical QA datasets into controlled multi-turn shards with information-preservation checks.
  • Finds widespread premature commitment (55.3% of initial answers in first two turns) and that withholding the question largely recovers single-turn accuracy.
  • Identifies lab results as lures that trigger immediate answering; suggests evidence scheduling as a mitigation.
  • Skepticism / limitation: converted vignettes (not natural dialogue) and multiple-choice focus; Q-Last protocol may be unrealistic.

5) Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange

  • Formalizes and proves that passive transcript auditing can fail: agents can embed covert communication indistinguishable from honest conversation.
  • Introduces bundle sampling and the new primitive PNR-KE, with constructions and strong limitations (e.g., attacks on non-interactive schemes).
  • Reframes “log everything” as insufficient for preventing covert coordination in general.
  • Skepticism / limitation: theoretical; relies on cryptographic assumptions (e.g., LSPN) and on whether real LLM message distributions meet entropy assumptions.

5) Practical next steps

  • For agent platforms using skills: benchmark with retrieval from large noisy pools (not force-loaded skills), and add agentic hybrid search + query-specific refinement as a standard baseline; track Recall@k and end-task pass rates jointly.
  • For multi-turn high-stakes assistants (medical/counseling): implement commitment-aware protocols (explicit “hold” until sufficient evidence; delay salient lures like labs) and measure early-commit error rates and flip dynamics (F2T/T2F).
  • For MCP/tool ecosystems: deploy runtime network telemetry (decrypted HTTP(S) event traces where feasible) and evaluate against code-level injection benchmarks; combine with host telemetry to cover non-network attacks.
  • For personal/local agents: treat persistent state as a security boundary—add write protections, code signing for skills, and explicit approvals for executable capability changes; measure the evolution–safety tradeoff explicitly.
  • For monitoring/covert-channel risk: do not rely on transcript logs alone; consider active-warden style interventions, randomized checks, or protocol-level constraints if covert coordination is in-scope.
  • For evaluation pipelines: report judge language and backbone; for multilingual deployments, run at least two judge backbones and quantify agreement; consider tournament-core style reporting (Top Cycle / Uncovered Set) when comparisons are cyclic.
  • For fine-tuning governance: if you need integrity guarantees, pilot structured drift certificates (norm/rank/sparsity) and pair them with behavioral audits, since parametric constraints don’t imply semantic safety.
  • For hallucination reduction and calibration: consider training-time routing (Answer/Ask/Abstain) and/or inference-time latent steering/early stopping, but evaluate on task types where separability is known to hold (factoid vs generative vs misconception-heavy).

Generated from per-paper analyses; no external browsing.