Daily AI Paper Report (2026-04-08)

Published: April 08, 2026

Chinese version: [中文]

Run stats

Candidates: 195
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-06T00:00:00Z → 2026-04-07T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.04759`	Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw PDF	cs.CR, cs.AI, cs.CL	96	First real-world safety eval of widely deployed agent w/ live attacks + CIK taxonomy.	agent-safety, real-world-eval, tool-access, attack-scenarios, taxonomy, privilege-risk
`2604.04426`	ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems PDF	cs.AI	94	Supply-chain injection benchmark (10k malicious MCP tools) + network guardrails; big gap.	agent-security, supply-chain, MCP, benchmark, prompt-injection, guardrails, MITRE-ATT&CK
`2604.04842`	Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling PDF	cs.CL	93	Persona-driven counseling red-teaming exposes maladaptive validation risks in multi-turn therapy chats	LLM-safety, red-teaming, mental-health, persona-attack, dialogue, high-stakes
`2604.04522`	HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems PDF	cs.CR, cs.MA	93	Cryptographic provenance for human authorization across agent delegation chains; closes accountability gap.	agentic-systems, security, authorization, provenance, cryptography, multi-agent
`2604.04565`	PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning PDF	cs.CL, cs.AI	92	Trains Answer/Ask/Abstain for epistemic calibration; directly targets overconfident QA/RAG failures.	epistemic-calibration, abstention, clarification, RAG, hallucination, reliability, SFT
`2604.04561`	Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities PDF	cs.CR, cs.AI, cs.CL	91	10k-trial taxonomy of prompt features that trigger agent vulnerability exploitation in sandboxes.	agent-security, vulnerability-exploitation, system-prompts, taxonomy, evaluation, docker
`2604.04743`	Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations PDF	cs.CL, cs.AI, eess.SY	91	Dynamical-systems view of hallucinations + geometry-aware steering to reduce hallucination without retraining	hallucinations, reliability, interpretability, steering, latent-dynamics, evaluation
`2604.04443`	DeonticBench: A Benchmark for Reasoning over Rules PDF	cs.CL	91	6.2k-task benchmark for rule/deontic reasoning in legal/policy domains; high-stakes long-context eval.	benchmark, evaluation, deontic-reasoning, law, policy, long-context
`2604.04385`	How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models PDF	cs.CL, cs.AI, cs.LG	90	Mechanistic localization of alignment refusal routing circuits; controllable policy strength.	interpretability, alignment, refusal, circuits, mechanistic, interchange-interventions
`2604.04323`	How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings PDF	cs.CL	90	Realistic benchmark of agent skill retrieval/selection over 34k skills; strong signal for agent robustness.	agents, tool-use, skills, retrieval, benchmark, robustness, evaluation
`2604.04738`	Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates PDF	cs.CR, cs.LG	89	Cryptographic proofs to certify fine-tune drift bounds; mitigates backdoors/safety regressions.	model-security, fine-tuning, integrity, zero-knowledge, backdoors, provenance, certification
`2604.04488`	A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models PDF	cs.CV, cs.LG	89	Backdoor defense for multimodal LLMs targeting low-poisoning triggers while preserving benign generation	security, backdoors, multimodal-LLM, data-poisoning, defense, robustness
`2604.04410`	Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment PDF	cs.LG, cs.AI, cs.CL, stat.ML	89	Statistically consistent alignment via relative density-ratio optimization; addresses instability of DDRO.	alignment, preference-learning, DPO, statistical-consistency, optimization
`2604.04757`	Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange PDF	cs.CR, cs.AI, cs.LG	88	Shows covert agent-to-agent steganographic conversations indistinguishable to auditors; major risk.	steganography, agent-communication, auditing, watermarking, security, covert-channels
`2604.04325`	Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction PDF	cs.CL	88	Multi-turn medical diagnosis benchmark reveals premature commitment and self-correction issues in LLMs.	evaluation, multi-turn, medical, decision-making, self-correction, reliability, benchmarks
`2604.04712`	Hardware-Level Governance of AI Compute: A Feasibility Taxonomy for Regulatory Compliance and Treaty Verification PDF	cs.CR, cs.CY	86	Engineering-grounded taxonomy of hardware compute governance mechanisms + feasibility/adversaries.	AI-governance, compute, hardware-attestation, monitoring, treaty-verification, compliance
`2604.04876`	Incompleteness of AI Safety Verification via Kolmogorov Complexity PDF	cs.AI	86	Formal incompleteness limit for safety/policy verification via Kolmogorov complexity framing	formal-verification, AI-safety-theory, limits, Kolmogorov-complexity, policy-compliance
`2604.04461`	DP-OPD: Differentially Private On-Policy Distillation for Language Models PDF	cs.LG, cs.AI, cs.CL	86	Differentially private on-policy distillation for LMs; targets privacy–utility issues in long rollouts.	privacy, differential-privacy, distillation, language-models, deployment
`2604.04930`	Early Stopping for Large Reasoning Models via Confidence Dynamics PDF	cs.CL, cs.AI, cs.LG	86	Early-stopping for long CoT via confidence dynamics; cuts compute and mitigates overthinking regressions.	reasoning, chain-of-thought, efficiency, confidence, inference, calibration
`2604.04532`	Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation PDF	cs.CL, cs.AI	85	Shows agent-judge eval is language-sensitive; prompt localization can flip model rankings	evaluation, agent-as-judge, multilingual, benchmarking, reliability, measurement
`2604.04399`	GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis PDF	cs.AI	84	Interpretable GUI-agent eval via hierarchical diagnosis; improves long-horizon failure analysis.	agent-evaluation, GUI-agents, diagnostics, benchmarks, trajectories, interpretability
`2604.04733`	Discovering Failure Modes in Vision-Language Models using RL PDF	cs.CV, cs.AI	84	RL framework to automatically discover VLM failure modes/blind spots without manual curation.	evaluation, red-teaming, vision-language, robustness, reinforcement-learning
`2604.04855`	The Role of Generator Access in Autoregressive Post-Training PDF	cs.LG	84	Shows generator interface (prefix control/logits) can yield exponential gains in KL-regularized post-training.	post-training, RLHF, DPO, generator-access, theory, sample-efficiency, alignment
`2604.04328`	Soft Tournament Equilibrium PDF	cs.AI, cs.LG, cs.MA	83	Differentiable set-valued tournament solution for non-transitive agent comparisons; avoids unstable rankings	agent-evaluation, tournaments, non-transitivity, ranking, multi-agent, metrics
`2604.04917`	Vero: An Open RL Recipe for General Visual Reasoning PDF	cs.CV, cs.AI, cs.CL	83	Open RL recipe + 600K multi-task reward data for visual reasoning; strong reproducible VLM progress.	VLM, reinforcement-learning, open-models, dataset, visual-reasoning, post-training
`2604.04872`	Synthetic Sandbox for Training Machine Learning Engineering Agents PDF	cs.CL, cs.LG	82	Synthetic sandbox enabling on-policy RL for ML-engineering agents by shrinking verification cost.	agents, RL, sandbox, MLE, training-framework, evaluation, scaling
`2604.04767`	Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems PDF	cs.LG, cs.AI, cs.CL	82	Task reformulation to learn from too-hard reasoning problems in RLVR; bootstraps via easier variants	reasoning, RLVR, curriculum, task-reformulation, training, LLMs
`2604.04898`	QED-Nano: Teaching a Tiny Model to Prove Hard Theorems PDF	cs.AI, cs.CL, cs.LG	82	Post-trains an open 4B model for Olympiad-level proofs; useful for studying small-model reasoning.	reasoning, math, small-models, post-training, proofs, distillation
`2604.04902`	Are Latent Reasoning Models Easily Interpretable? PDF	cs.LG	82	Finds latent reasoning tokens often unused; challenges interpretability/monitoring claims of LRMs.	interpretability, reasoning, latent-reasoning, monitoring, evaluation, mechanistic
`2604.04847`	Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency PDF	eess.AS, cs.CL	81	Real-audio full-duplex voice agent tool-use benchmark with disfluencies + latency/turn-taking.	voice-agents, tool-use, benchmark, speech, disfluency, evaluation, latency

AI Paper Insight Brief

2026-04-08

0) Executive takeaways (read this first)

“Agent skills” don’t transfer cleanly from curated benchmarks to deployment reality: once agents must retrieve/select/adapt from a 34k-skill pool, gains often collapse toward the no-skill baseline; iterative agentic hybrid search + query-specific refinement can recover meaningful performance (e.g., +7.8 pp on Terminal-Bench 2.0 for Claude Opus 4.6).
Multi-turn interaction is itself a safety hazard in high-stakes domains: in medical diagnosis, models frequently commit too early; simply withholding the question until the end largely restores single-turn accuracy, and “salient evidence” (labs) can act as a lure that triggers premature (often wrong) answers.
Security evaluation is shifting from “what the tool says” to “what the tool does at runtime”: network-level monitoring (MITM + decrypted traffic event traces) detects MCP supply-chain injections with very high reported F1 and low FPR; persistent agent state (memory/identity/skills) is a major real-world attack surface that survives across sessions.
Transcript logging is not a sufficient control: cryptographic results show agents can embed undetectable covert communication in “honest-looking” conversations; key exchange can be done even under weak entropy assumptions (given new primitives), undermining passive auditing as a safety strategy.
Evaluation itself is becoming a first-class failure mode: GUI-agent judging improves dramatically with hierarchical diagnosis; but “agent-as-a-judge” results can invert model rankings depending on the judge language, with low inter-backbone agreement—meaning benchmark conclusions may not generalize across locales.
Alignment and integrity are becoming more mechanistic and cryptographic: alignment behavior can be localized to sparse routing circuits (gate→amplifiers) that are bypassable via encodings; and fine-tuning integrity can be certified with succinct ZK proofs for structured drift (norm/rank/sparsity), enabling new governance/audit workflows.

2) Key themes (clusters)

Theme: Skills & tool-use in the wild (retrieval, selection, adaptation)

Why it matters: “Skills” are widely used to extend agents, but their measured benefit depends heavily on whether evaluation includes realistic retrieval noise and adaptation costs.
Representative papers:
- How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
- Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Common approach:
- Move from curated/force-loaded artifacts to retrieval from large, noisy pools (skills or streaming speech).
- Evaluate end-to-end task success under realistic constraints (selection errors, disfluencies, latency).
- Add test-time adaptation loops (agentic search; query-specific refinement) or measure rollback failures (self-corrections).
Open questions / failure modes:
- When relevant skills are absent, refinement can’t create missing knowledge; performance reverts toward baseline.
- Voice agents: self-corrections/state rollback remains a dominant failure mode; low latency can trade off with turn-taking reliability.

Theme: Multi-turn safety failures (premature commitment, lures, persona attacks)

Why it matters: Interactive settings introduce new failure modes—early commitment, context sensitivity, and socially-driven compliance—that don’t appear in single-turn tests.
Representative papers:
- Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
- Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
Common approach:
- Convert single-turn items into information-preserving multi-turn sequences to isolate timing effects.
- Measure flip dynamics (incorrect→correct vs correct→incorrect) and commitment timing.
- Use adaptive multi-turn adversaries (persona + strategy loop) to elicit subtle unsafe behavior (e.g., toxic empathy).
Open questions / failure modes:
- Protocol mitigations like withholding the question may be effective but not deployment-realistic.
- Persona-driven attacks can be stealthy (low perplexity) and remain effective under several defenses.

Theme: Execution-time agent security (supply chain, persistence, exploitation triggers)

Why it matters: Agent security is increasingly about runtime behavior (network, filesystem, persistent state), not just prompt injection or tool schemas.
Representative papers:
Common approach:
- Build threat-model-specific benchmarks (code-level injections validated by PCAPs; persistent-state poisoning; large-N prompt taxonomies).
- Shift detection to runtime telemetry (decrypted traffic events; external evidence like email/Stripe actions).
- Quantify which prompt framings actually matter (e.g., puzzle/CTF reframing as a dominant exploitation trigger).
Open questions / failure modes:
- Network-only defenses miss purely local malicious behaviors; persistent-state defenses can block legitimate evolution.
- Keyword-based exploitation detection may miss stealthier attacks; null results at n=50/cell don’t rule out small effects.

Theme: Auditing, provenance, and cryptographic integrity for agents/models

Why it matters: As agents act in the world, governance needs verifiable provenance and integrity—yet some monitoring assumptions (like transcript auditing) can fail fundamentally.
Representative papers:
Common approach:
- Use cryptographic chaining for delegation provenance (append-only hop signatures; offline verification).
- Use succinct ZK proofs to certify constrained fine-tuning drift (norm/rank/sparsity) at scale (e.g., 7B model proof/verify).
- Formalize and prove limits of passive monitoring (covert channels in transcripts; key exchange under noise).
Open questions / failure modes:
- HDP v0.1 uses a single issuer key for hops (agent-level attestations planned); scope is recorded but not enforced.
- PNR-KE constructions rely on hardness assumptions and are theoretical; practical LLM entropy conditions need validation.

Theme: Evaluation reliability & interpretability tooling (judges, diagnosis, rankings)

Why it matters: If evaluation is unstable (language-sensitive judges, non-transitive comparisons), model selection and safety conclusions can be wrong.
Representative papers:
Common approach:
- Replace monolithic judging with hierarchical decomposition and structured diagnostics.
- Treat judge language/backbone as an experimental variable; quantify agreement and ranking inversions.
- Model evaluation as probabilistic tournaments and output set-valued “cores” (Top Cycle / Uncovered Set) rather than a single ranking.
Open questions / failure modes:
- Decomposition depends on segmentation quality; domain-specific knowledge can still mislead evaluators.
- Low inter-backbone agreement suggests “LLM judge” may require calibration/ensembles, especially cross-lingually.

3) Technical synthesis

Realism upgrades in benchmarks follow a common pattern: remove oracle access (force-loaded skills; single-turn full evidence; short trajectories) and measure how performance degrades under retrieval noise, incremental evidence, long horizons, and runtime constraints.
Several papers converge on two-stage loops: (i) search/retrieve/segment (skills retrieval; trajectory segmentation; questioner RL probing), then (ii) refine/diagnose/steer (query-specific skill refinement; subtask diagnosis; latent steering).
“Commitment control” appears across domains: medical Q-Last reduces premature diagnosis; voice agents struggle with self-correction rollback; early-stopping for reasoning models uses confidence dynamics to stop overthinking.
Security defenses are moving from semantic checks to telemetry-grounded signals: ShieldNet’s decrypted event traces and OpenClaw’s external-action verification mirror a broader shift to execution evidence.
Multiple works highlight that surface artifacts are insufficient: tool schemas (MCP) don’t reveal injected behavior; transcripts don’t prevent covert channels; “detection” representations don’t guarantee aligned behavior (alignment routing circuits).
There’s a growing split between parametric guarantees vs semantic guarantees: fine-tuning integrity proofs can certify norm/rank/sparsity drift, but small drift can still cause large behavioral changes (explicitly noted).
Evaluation methodology itself is under attack: multilingual AAAJ shows judge-language/backbone interactions can invert rankings; STE argues rankings are brittle under cycles and proposes set-valued cores.
RL is being used both to improve capability (Vero open RL for visual reasoning; QED-Nano proof RL; SandMLE on-policy RL via micro-sandboxes; Cog-DRIFT reformulation curriculum) and to discover failures (RL questioner for VLM failure modes).
Several approaches rely on strong auxiliary models (verifiers/judges/graders) and thus inherit their biases (medical sharding with Qwen3-32B; VLM verifier; counseling judges; proof graders).
A recurring practical tradeoff: robustness vs cost (query-specific refinement compute; dual-view backdoor defense ~2× training; DP-OPD teacher-query overhead; network MITM invasiveness; long-token reasoning vs early stopping).

4) Top 5 papers (with “why now”)

1) ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Introduces SC-Inject-Bench: code-level tool injections validated by network traces (PCAP), targeting a realistic MCP supply-chain threat model.
Shows a practical guardrail: MITM + decrypted HTTP(S) + structured event traces with a lightweight post-trained detector (Qwen3-0.6B) enabling streaming detection.
Reports very strong detection (e.g., PCAP-level F1=0.995, FPR=0.008) and ablations showing decryption is critical.
Skepticism / limitation: focuses on network-visible attacks; operational constraints of MITM/decryption and QUIC blocking may limit deployability.

2) Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Makes persistent agent state a first-class security object via CIK taxonomy (Capability/Identity/Knowledge) and tests poisoning across sessions.
Demonstrates large ASR increases under single-dimension poisoning (abstract: baseline 24.6% → 64–74% with poisoning) and highlights executable capability payloads as especially dangerous.
Evaluates defenses and surfaces an evolution–safety tradeoff (file protection blocks attacks but also blocks legitimate updates).
Skepticism / limitation: single platform (OpenClaw) and 12 scenarios; cross-dimension chained attacks not studied.

3) How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Localizes alignment behavior to a sparse gate→amplifier routing motif across 9 models, with causal tests (interchange, knockout cascades).
Shows continuous control via detection-layer modulation (dose–response) and a concrete bypass: cipher encoding collapses gate contribution (e.g., 78% collapse in Phi-4-mini).
Provides a mechanistic explanation for “detection without refusal” and a target for auditing/defense.
Skepticism / limitation: scope limited to 2–32B models and political/safety refusal domains; MLP routing contributions remain underexplained.

4) Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Releases MINT (1,035 cases) converting multiple medical QA datasets into controlled multi-turn shards with information-preservation checks.
Finds widespread premature commitment (55.3% of initial answers in first two turns) and that withholding the question largely recovers single-turn accuracy.
Identifies lab results as lures that trigger immediate answering; suggests evidence scheduling as a mitigation.
Skepticism / limitation: converted vignettes (not natural dialogue) and multiple-choice focus; Q-Last protocol may be unrealistic.

5) Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange

Formalizes and proves that passive transcript auditing can fail: agents can embed covert communication indistinguishable from honest conversation.
Introduces bundle sampling and the new primitive PNR-KE, with constructions and strong limitations (e.g., attacks on non-interactive schemes).
Reframes “log everything” as insufficient for preventing covert coordination in general.
Skepticism / limitation: theoretical; relies on cryptographic assumptions (e.g., LSPN) and on whether real LLM message distributions meet entropy assumptions.

5) Practical next steps

For agent platforms using skills: benchmark with retrieval from large noisy pools (not force-loaded skills), and add agentic hybrid search + query-specific refinement as a standard baseline; track Recall@k and end-task pass rates jointly.
For multi-turn high-stakes assistants (medical/counseling): implement commitment-aware protocols (explicit “hold” until sufficient evidence; delay salient lures like labs) and measure early-commit error rates and flip dynamics (F2T/T2F).
For MCP/tool ecosystems: deploy runtime network telemetry (decrypted HTTP(S) event traces where feasible) and evaluate against code-level injection benchmarks; combine with host telemetry to cover non-network attacks.
For personal/local agents: treat persistent state as a security boundary—add write protections, code signing for skills, and explicit approvals for executable capability changes; measure the evolution–safety tradeoff explicitly.
For monitoring/covert-channel risk: do not rely on transcript logs alone; consider active-warden style interventions, randomized checks, or protocol-level constraints if covert coordination is in-scope.
For evaluation pipelines: report judge language and backbone; for multilingual deployments, run at least two judge backbones and quantify agreement; consider tournament-core style reporting (Top Cycle / Uncovered Set) when comparisons are cyclic.
For fine-tuning governance: if you need integrity guarantees, pilot structured drift certificates (norm/rank/sparsity) and pair them with behavioral audits, since parametric constraints don’t imply semantic safety.
For hallucination reduction and calibration: consider training-time routing (Answer/Ask/Abstain) and/or inference-time latent steering/early stopping, but evaluate on task types where separability is known to hold (factoid vs generative vs misconception-heavy).

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-08

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Skills & tool-use in the wild (retrieval, selection, adaptation)

Theme: Multi-turn safety failures (premature commitment, lures, persona attacks)

Theme: Execution-time agent security (supply chain, persistence, exploitation triggers)

Theme: Auditing, provenance, and cryptographic integrity for agents/models

Theme: Evaluation reliability & interpretability tooling (judges, diagnosis, rankings)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps