Daily AI Paper Report (2026-04-08)
Published:
Chinese version: [中文]
Run stats
- Candidates: 195
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-06T00:00:00Z → 2026-04-07T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.04759 | Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw | cs.CR, cs.AI, cs.CL | 96 | First real-world safety eval of widely deployed agent w/ live attacks + CIK taxonomy. | agent-safety, real-world-eval, tool-access, attack-scenarios, taxonomy, privilege-risk |
2604.04426 | ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems | cs.AI | 94 | Supply-chain injection benchmark (10k malicious MCP tools) + network guardrails; big gap. | agent-security, supply-chain, MCP, benchmark, prompt-injection, guardrails, MITRE-ATT&CK |
2604.04842 | Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling | cs.CL | 93 | Persona-driven counseling red-teaming exposes maladaptive validation risks in multi-turn therapy chats | LLM-safety, red-teaming, mental-health, persona-attack, dialogue, high-stakes |
2604.04522 | HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems | cs.CR, cs.MA | 93 | Cryptographic provenance for human authorization across agent delegation chains; closes accountability gap. | agentic-systems, security, authorization, provenance, cryptography, multi-agent |
2604.04565 | PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning | cs.CL, cs.AI | 92 | Trains Answer/Ask/Abstain for epistemic calibration; directly targets overconfident QA/RAG failures. | epistemic-calibration, abstention, clarification, RAG, hallucination, reliability, SFT |
2604.04561 | Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities | cs.CR, cs.AI, cs.CL | 91 | 10k-trial taxonomy of prompt features that trigger agent vulnerability exploitation in sandboxes. | agent-security, vulnerability-exploitation, system-prompts, taxonomy, evaluation, docker |
2604.04743 | Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations | cs.CL, cs.AI, eess.SY | 91 | Dynamical-systems view of hallucinations + geometry-aware steering to reduce hallucination without retraining | hallucinations, reliability, interpretability, steering, latent-dynamics, evaluation |
2604.04443 | DeonticBench: A Benchmark for Reasoning over Rules | cs.CL | 91 | 6.2k-task benchmark for rule/deontic reasoning in legal/policy domains; high-stakes long-context eval. | benchmark, evaluation, deontic-reasoning, law, policy, long-context |
2604.04385 | How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models | cs.CL, cs.AI, cs.LG | 90 | Mechanistic localization of alignment refusal routing circuits; controllable policy strength. | interpretability, alignment, refusal, circuits, mechanistic, interchange-interventions |
2604.04323 | How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings | cs.CL | 90 | Realistic benchmark of agent skill retrieval/selection over 34k skills; strong signal for agent robustness. | agents, tool-use, skills, retrieval, benchmark, robustness, evaluation |
2604.04738 | Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates | cs.CR, cs.LG | 89 | Cryptographic proofs to certify fine-tune drift bounds; mitigates backdoors/safety regressions. | model-security, fine-tuning, integrity, zero-knowledge, backdoors, provenance, certification |
2604.04488 | A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models | cs.CV, cs.LG | 89 | Backdoor defense for multimodal LLMs targeting low-poisoning triggers while preserving benign generation | security, backdoors, multimodal-LLM, data-poisoning, defense, robustness |
2604.04410 | Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment | cs.LG, cs.AI, cs.CL, stat.ML | 89 | Statistically consistent alignment via relative density-ratio optimization; addresses instability of DDRO. | alignment, preference-learning, DPO, statistical-consistency, optimization |
2604.04757 | Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange | cs.CR, cs.AI, cs.LG | 88 | Shows covert agent-to-agent steganographic conversations indistinguishable to auditors; major risk. | steganography, agent-communication, auditing, watermarking, security, covert-channels |
2604.04325 | Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction | cs.CL | 88 | Multi-turn medical diagnosis benchmark reveals premature commitment and self-correction issues in LLMs. | evaluation, multi-turn, medical, decision-making, self-correction, reliability, benchmarks |
2604.04712 | Hardware-Level Governance of AI Compute: A Feasibility Taxonomy for Regulatory Compliance and Treaty Verification | cs.CR, cs.CY | 86 | Engineering-grounded taxonomy of hardware compute governance mechanisms + feasibility/adversaries. | AI-governance, compute, hardware-attestation, monitoring, treaty-verification, compliance |
2604.04876 | Incompleteness of AI Safety Verification via Kolmogorov Complexity | cs.AI | 86 | Formal incompleteness limit for safety/policy verification via Kolmogorov complexity framing | formal-verification, AI-safety-theory, limits, Kolmogorov-complexity, policy-compliance |
2604.04461 | DP-OPD: Differentially Private On-Policy Distillation for Language Models | cs.LG, cs.AI, cs.CL | 86 | Differentially private on-policy distillation for LMs; targets privacy–utility issues in long rollouts. | privacy, differential-privacy, distillation, language-models, deployment |
2604.04930 | Early Stopping for Large Reasoning Models via Confidence Dynamics | cs.CL, cs.AI, cs.LG | 86 | Early-stopping for long CoT via confidence dynamics; cuts compute and mitigates overthinking regressions. | reasoning, chain-of-thought, efficiency, confidence, inference, calibration |
2604.04532 | Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation | cs.CL, cs.AI | 85 | Shows agent-judge eval is language-sensitive; prompt localization can flip model rankings | evaluation, agent-as-judge, multilingual, benchmarking, reliability, measurement |
2604.04399 | GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis | cs.AI | 84 | Interpretable GUI-agent eval via hierarchical diagnosis; improves long-horizon failure analysis. | agent-evaluation, GUI-agents, diagnostics, benchmarks, trajectories, interpretability |
2604.04733 | Discovering Failure Modes in Vision-Language Models using RL | cs.CV, cs.AI | 84 | RL framework to automatically discover VLM failure modes/blind spots without manual curation. | evaluation, red-teaming, vision-language, robustness, reinforcement-learning |
2604.04855 | The Role of Generator Access in Autoregressive Post-Training | cs.LG | 84 | Shows generator interface (prefix control/logits) can yield exponential gains in KL-regularized post-training. | post-training, RLHF, DPO, generator-access, theory, sample-efficiency, alignment |
2604.04328 | Soft Tournament Equilibrium | cs.AI, cs.LG, cs.MA | 83 | Differentiable set-valued tournament solution for non-transitive agent comparisons; avoids unstable rankings | agent-evaluation, tournaments, non-transitivity, ranking, multi-agent, metrics |
2604.04917 | Vero: An Open RL Recipe for General Visual Reasoning | cs.CV, cs.AI, cs.CL | 83 | Open RL recipe + 600K multi-task reward data for visual reasoning; strong reproducible VLM progress. | VLM, reinforcement-learning, open-models, dataset, visual-reasoning, post-training |
2604.04872 | Synthetic Sandbox for Training Machine Learning Engineering Agents | cs.CL, cs.LG | 82 | Synthetic sandbox enabling on-policy RL for ML-engineering agents by shrinking verification cost. | agents, RL, sandbox, MLE, training-framework, evaluation, scaling |
2604.04767 | Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems | cs.LG, cs.AI, cs.CL | 82 | Task reformulation to learn from too-hard reasoning problems in RLVR; bootstraps via easier variants | reasoning, RLVR, curriculum, task-reformulation, training, LLMs |
2604.04898 | QED-Nano: Teaching a Tiny Model to Prove Hard Theorems | cs.AI, cs.CL, cs.LG | 82 | Post-trains an open 4B model for Olympiad-level proofs; useful for studying small-model reasoning. | reasoning, math, small-models, post-training, proofs, distillation |
2604.04902 | Are Latent Reasoning Models Easily Interpretable? | cs.LG | 82 | Finds latent reasoning tokens often unused; challenges interpretability/monitoring claims of LRMs. | interpretability, reasoning, latent-reasoning, monitoring, evaluation, mechanistic |
2604.04847 | Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency | eess.AS, cs.CL | 81 | Real-audio full-duplex voice agent tool-use benchmark with disfluencies + latency/turn-taking. | voice-agents, tool-use, benchmark, speech, disfluency, evaluation, latency |
AI Paper Insight Brief
2026-04-08
0) Executive takeaways (read this first)
- “Agent skills” don’t transfer cleanly from curated benchmarks to deployment reality: once agents must retrieve/select/adapt from a 34k-skill pool, gains often collapse toward the no-skill baseline; iterative agentic hybrid search + query-specific refinement can recover meaningful performance (e.g., +7.8 pp on Terminal-Bench 2.0 for Claude Opus 4.6).
- Multi-turn interaction is itself a safety hazard in high-stakes domains: in medical diagnosis, models frequently commit too early; simply withholding the question until the end largely restores single-turn accuracy, and “salient evidence” (labs) can act as a lure that triggers premature (often wrong) answers.
- Security evaluation is shifting from “what the tool says” to “what the tool does at runtime”: network-level monitoring (MITM + decrypted traffic event traces) detects MCP supply-chain injections with very high reported F1 and low FPR; persistent agent state (memory/identity/skills) is a major real-world attack surface that survives across sessions.
- Transcript logging is not a sufficient control: cryptographic results show agents can embed undetectable covert communication in “honest-looking” conversations; key exchange can be done even under weak entropy assumptions (given new primitives), undermining passive auditing as a safety strategy.
- Evaluation itself is becoming a first-class failure mode: GUI-agent judging improves dramatically with hierarchical diagnosis; but “agent-as-a-judge” results can invert model rankings depending on the judge language, with low inter-backbone agreement—meaning benchmark conclusions may not generalize across locales.
- Alignment and integrity are becoming more mechanistic and cryptographic: alignment behavior can be localized to sparse routing circuits (gate→amplifiers) that are bypassable via encodings; and fine-tuning integrity can be certified with succinct ZK proofs for structured drift (norm/rank/sparsity), enabling new governance/audit workflows.
2) Key themes (clusters)
Theme: Skills & tool-use in the wild (retrieval, selection, adaptation)
- Why it matters: “Skills” are widely used to extend agents, but their measured benefit depends heavily on whether evaluation includes realistic retrieval noise and adaptation costs.
- Representative papers:
- Common approach:
- Move from curated/force-loaded artifacts to retrieval from large, noisy pools (skills or streaming speech).
- Evaluate end-to-end task success under realistic constraints (selection errors, disfluencies, latency).
- Add test-time adaptation loops (agentic search; query-specific refinement) or measure rollback failures (self-corrections).
- Open questions / failure modes:
- When relevant skills are absent, refinement can’t create missing knowledge; performance reverts toward baseline.
- Voice agents: self-corrections/state rollback remains a dominant failure mode; low latency can trade off with turn-taking reliability.
Theme: Multi-turn safety failures (premature commitment, lures, persona attacks)
- Why it matters: Interactive settings introduce new failure modes—early commitment, context sensitivity, and socially-driven compliance—that don’t appear in single-turn tests.
- Representative papers:
- Common approach:
- Convert single-turn items into information-preserving multi-turn sequences to isolate timing effects.
- Measure flip dynamics (incorrect→correct vs correct→incorrect) and commitment timing.
- Use adaptive multi-turn adversaries (persona + strategy loop) to elicit subtle unsafe behavior (e.g., toxic empathy).
- Open questions / failure modes:
- Protocol mitigations like withholding the question may be effective but not deployment-realistic.
- Persona-driven attacks can be stealthy (low perplexity) and remain effective under several defenses.
Theme: Execution-time agent security (supply chain, persistence, exploitation triggers)
- Why it matters: Agent security is increasingly about runtime behavior (network, filesystem, persistent state), not just prompt injection or tool schemas.
- Representative papers:
- Common approach:
- Build threat-model-specific benchmarks (code-level injections validated by PCAPs; persistent-state poisoning; large-N prompt taxonomies).
- Shift detection to runtime telemetry (decrypted traffic events; external evidence like email/Stripe actions).
- Quantify which prompt framings actually matter (e.g., puzzle/CTF reframing as a dominant exploitation trigger).
- Open questions / failure modes:
- Network-only defenses miss purely local malicious behaviors; persistent-state defenses can block legitimate evolution.
- Keyword-based exploitation detection may miss stealthier attacks; null results at n=50/cell don’t rule out small effects.
Theme: Auditing, provenance, and cryptographic integrity for agents/models
- Why it matters: As agents act in the world, governance needs verifiable provenance and integrity—yet some monitoring assumptions (like transcript auditing) can fail fundamentally.
- Representative papers:
- HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems
- Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates
- Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange
- Common approach:
- Use cryptographic chaining for delegation provenance (append-only hop signatures; offline verification).
- Use succinct ZK proofs to certify constrained fine-tuning drift (norm/rank/sparsity) at scale (e.g., 7B model proof/verify).
- Formalize and prove limits of passive monitoring (covert channels in transcripts; key exchange under noise).
- Open questions / failure modes:
- HDP v0.1 uses a single issuer key for hops (agent-level attestations planned); scope is recorded but not enforced.
- PNR-KE constructions rely on hardness assumptions and are theoretical; practical LLM entropy conditions need validation.
Theme: Evaluation reliability & interpretability tooling (judges, diagnosis, rankings)
- Why it matters: If evaluation is unstable (language-sensitive judges, non-transitive comparisons), model selection and safety conclusions can be wrong.
- Representative papers:
- Common approach:
- Replace monolithic judging with hierarchical decomposition and structured diagnostics.
- Treat judge language/backbone as an experimental variable; quantify agreement and ranking inversions.
- Model evaluation as probabilistic tournaments and output set-valued “cores” (Top Cycle / Uncovered Set) rather than a single ranking.
- Open questions / failure modes:
- Decomposition depends on segmentation quality; domain-specific knowledge can still mislead evaluators.
- Low inter-backbone agreement suggests “LLM judge” may require calibration/ensembles, especially cross-lingually.
3) Technical synthesis
- Realism upgrades in benchmarks follow a common pattern: remove oracle access (force-loaded skills; single-turn full evidence; short trajectories) and measure how performance degrades under retrieval noise, incremental evidence, long horizons, and runtime constraints.
- Several papers converge on two-stage loops: (i) search/retrieve/segment (skills retrieval; trajectory segmentation; questioner RL probing), then (ii) refine/diagnose/steer (query-specific skill refinement; subtask diagnosis; latent steering).
- “Commitment control” appears across domains: medical Q-Last reduces premature diagnosis; voice agents struggle with self-correction rollback; early-stopping for reasoning models uses confidence dynamics to stop overthinking.
- Security defenses are moving from semantic checks to telemetry-grounded signals: ShieldNet’s decrypted event traces and OpenClaw’s external-action verification mirror a broader shift to execution evidence.
- Multiple works highlight that surface artifacts are insufficient: tool schemas (MCP) don’t reveal injected behavior; transcripts don’t prevent covert channels; “detection” representations don’t guarantee aligned behavior (alignment routing circuits).
- There’s a growing split between parametric guarantees vs semantic guarantees: fine-tuning integrity proofs can certify norm/rank/sparsity drift, but small drift can still cause large behavioral changes (explicitly noted).
- Evaluation methodology itself is under attack: multilingual AAAJ shows judge-language/backbone interactions can invert rankings; STE argues rankings are brittle under cycles and proposes set-valued cores.
- RL is being used both to improve capability (Vero open RL for visual reasoning; QED-Nano proof RL; SandMLE on-policy RL via micro-sandboxes; Cog-DRIFT reformulation curriculum) and to discover failures (RL questioner for VLM failure modes).
- Several approaches rely on strong auxiliary models (verifiers/judges/graders) and thus inherit their biases (medical sharding with Qwen3-32B; VLM verifier; counseling judges; proof graders).
- A recurring practical tradeoff: robustness vs cost (query-specific refinement compute; dual-view backdoor defense ~2× training; DP-OPD teacher-query overhead; network MITM invasiveness; long-token reasoning vs early stopping).
4) Top 5 papers (with “why now”)
1) ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems
- Introduces SC-Inject-Bench: code-level tool injections validated by network traces (PCAP), targeting a realistic MCP supply-chain threat model.
- Shows a practical guardrail: MITM + decrypted HTTP(S) + structured event traces with a lightweight post-trained detector (Qwen3-0.6B) enabling streaming detection.
- Reports very strong detection (e.g., PCAP-level F1=0.995, FPR=0.008) and ablations showing decryption is critical.
- Skepticism / limitation: focuses on network-visible attacks; operational constraints of MITM/decryption and QUIC blocking may limit deployability.
2) Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
- Makes persistent agent state a first-class security object via CIK taxonomy (Capability/Identity/Knowledge) and tests poisoning across sessions.
- Demonstrates large ASR increases under single-dimension poisoning (abstract: baseline 24.6% → 64–74% with poisoning) and highlights executable capability payloads as especially dangerous.
- Evaluates defenses and surfaces an evolution–safety tradeoff (file protection blocks attacks but also blocks legitimate updates).
- Skepticism / limitation: single platform (OpenClaw) and 12 scenarios; cross-dimension chained attacks not studied.
3) How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
- Localizes alignment behavior to a sparse gate→amplifier routing motif across 9 models, with causal tests (interchange, knockout cascades).
- Shows continuous control via detection-layer modulation (dose–response) and a concrete bypass: cipher encoding collapses gate contribution (e.g., 78% collapse in Phi-4-mini).
- Provides a mechanistic explanation for “detection without refusal” and a target for auditing/defense.
- Skepticism / limitation: scope limited to 2–32B models and political/safety refusal domains; MLP routing contributions remain underexplained.
4) Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
- Releases MINT (1,035 cases) converting multiple medical QA datasets into controlled multi-turn shards with information-preservation checks.
- Finds widespread premature commitment (55.3% of initial answers in first two turns) and that withholding the question largely recovers single-turn accuracy.
- Identifies lab results as lures that trigger immediate answering; suggests evidence scheduling as a mitigation.
- Skepticism / limitation: converted vignettes (not natural dialogue) and multiple-choice focus; Q-Last protocol may be unrealistic.
5) Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange
- Formalizes and proves that passive transcript auditing can fail: agents can embed covert communication indistinguishable from honest conversation.
- Introduces bundle sampling and the new primitive PNR-KE, with constructions and strong limitations (e.g., attacks on non-interactive schemes).
- Reframes “log everything” as insufficient for preventing covert coordination in general.
- Skepticism / limitation: theoretical; relies on cryptographic assumptions (e.g., LSPN) and on whether real LLM message distributions meet entropy assumptions.
5) Practical next steps
- For agent platforms using skills: benchmark with retrieval from large noisy pools (not force-loaded skills), and add agentic hybrid search + query-specific refinement as a standard baseline; track Recall@k and end-task pass rates jointly.
- For multi-turn high-stakes assistants (medical/counseling): implement commitment-aware protocols (explicit “hold” until sufficient evidence; delay salient lures like labs) and measure early-commit error rates and flip dynamics (F2T/T2F).
- For MCP/tool ecosystems: deploy runtime network telemetry (decrypted HTTP(S) event traces where feasible) and evaluate against code-level injection benchmarks; combine with host telemetry to cover non-network attacks.
- For personal/local agents: treat persistent state as a security boundary—add write protections, code signing for skills, and explicit approvals for executable capability changes; measure the evolution–safety tradeoff explicitly.
- For monitoring/covert-channel risk: do not rely on transcript logs alone; consider active-warden style interventions, randomized checks, or protocol-level constraints if covert coordination is in-scope.
- For evaluation pipelines: report judge language and backbone; for multilingual deployments, run at least two judge backbones and quantify agreement; consider tournament-core style reporting (Top Cycle / Uncovered Set) when comparisons are cyclic.
- For fine-tuning governance: if you need integrity guarantees, pilot structured drift certificates (norm/rank/sparsity) and pair them with behavioral audits, since parametric constraints don’t imply semantic safety.
- For hallucination reduction and calibration: consider training-time routing (Answer/Ask/Abstain) and/or inference-time latent steering/early stopping, but evaluate on task types where separability is known to hold (factoid vs generative vs misconception-heavy).
Generated from per-paper analyses; no external browsing.
