Daily AI Paper Report (2026-04-13)
Published:
Chinese version: [中文]
Run stats
- Candidates: 3253
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-10T00:00:00Z → 2026-04-11T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.07835 | Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation | cs.AI | 96 | New efficient inference-time jailbreak method via hidden-state subspace ablation; high safety relevance | jailbreaks, inference-time attacks, representation engineering, guardrails, robustness |
2604.08401 | Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing | cs.AI, cs.CL | 93 | Self-auditing verification for LLM-agent beliefs to prevent drift in long-horizon tool use. | llm-agents, self-auditing, faithful-reasoning, verification, agent-safety, long-horizon |
2604.08291 | VCAO: Verifier-Centered Agentic Orchestration for Strategic OS Vulnerability Discovery | cs.GT, cs.CR, cs.OS | 92 | Agentic orchestration with verifiers for OS vuln discovery; strong security/agent workflow framing | agents, cybersecurity, vulnerability discovery, verification, tool-augmented LLMs, game theory |
2604.08304 | Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions | cs.CR, cs.AI | 92 | Clear secure-RAG framing + taxonomy across pipeline stages; useful for audits/defenses. | RAG, security, taxonomy, prompt-injection, data-poisoning, threat-modeling, LLM-systems |
2604.08455 | KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation | cs.AI | 92 | Interactive benchmark for proactive/personalized mobile agents incl. consent/when-to-act decisions | agents, evaluation, mobile, personalization, proactivity, human-in-the-loop, safety |
2604.08388 | Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover | cs.AI | 92 | Shows agentic tool-use can collapse after SFT and be restored with ~100 traces | agents, tool-use, capability-recovery, fine-tuning, formal-math, function-calling |
2604.07778 | The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives | cs.AI | 90 | Formal impossibility result for accountability in human-agent collectives as autonomy grows | agent-governance, accountability, formal-methods, causal-models, multi-agent |
2604.07745 | The Cartesian Cut in Agentic AI | cs.AI, q-bio.NC | 90 | Conceptual framework for where control lives in LLM agents; governance implications. | agents, agent-architecture, governance, control, conceptual |
2604.08326 | ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection | cs.AI | 89 | Fine-grained medical alignment w/ explicit criteria + multidimensional reward model; useful safety pattern | alignment, medical LLMs, reward modeling, rubrics, safety constraints, datasets |
2604.07853 | QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch | cs.LG, cs.AI | 89 | Quantization-aware RL aligns rollout precision to stabilize LLM RL and speed training | LLM-RL, post-training, quantization, efficiency, training-inference-mismatch |
2604.05795 | Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation | cs.CL | 88 | FAITH-M benchmark scores therapist responses on expert therapeutic principles; strong safety eval value. | evaluation, mental-health, alignment, safety, benchmark, rubric, clinical |
2604.08457 | CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning | cs.CV, cs.AI, cs.RO | 88 | Safety-critical VLM benchmark for real crash videos; tests grounding + causal/mechanistic reasoning | evaluation, vlm, autonomous-driving, safety, video, reasoning, benchmark |
2604.08124 | Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search | cs.AI | 88 | Improves RL-trained LLM search agents via hierarchical experience; targets stability/efficiency. | agents, search, reinforcement-learning, reasoning, training-stability |
2604.07264 | Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations | cs.CR, cs.AI | 86 | LLM intent compiler + verifier loop for safety-critical routing constraints; strong benchmark results | LLM, program-synthesis, verification, tool-use, networking, constraints, reliability |
2604.06693 | Aegon: Auditable AI Content Access with Ledger-Bound Tokens and Hardware-Attested Mobile Receipts | cs.CR, cs.CY | 86 | Auditable content-access protocol with append-only ledger proofs; practical governance infra. | AI-governance, auditing, content-licensing, cryptography, transparency-logs, attestation, JWT |
2604.08340 | PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models | cs.CV, cs.AI | 86 | Long-horizon VLM benchmark in complex 3D game with strict RGB-only isolation and evaluator | benchmarks, embodied-agents, vision-language-models, long-horizon, evaluation |
2604.04604 | AI Agents Under EU Law | cs.CY, cs.AI, cs.CR, cs.MA | 86 | Systematic mapping of EU AI Act+GDPR etc. obligations for autonomous AI agents. | ai-agents, governance, regulation, EU-AI-Act, GDPR, compliance, risk-management |
2604.07054 | Sell More, Play Less: Benchmarking LLM Realistic Selling Skill | cs.CL | 86 | Realistic sales benchmark + auto eval; tests goal-directed persuasion in multi-turn dialogs | benchmark, dialogue, persuasion, evaluation, user-simulation, DPO |
2604.08519 | Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts | cs.CL, stat.ML | 86 | Data pruning to improve factual memorization; info-theoretic framing of capacity limits | factuality, hallucinations, data-selection, memorization, scaling, information-theory |
2604.07007 | AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power | cs.MA, cs.AI, cs.CY | 84 | Governance architecture for open agent economies using separation-of-powers; novel but blockchain-heavy | agent governance, multi-agent systems, auditing, mechanism design, smart contracts |
2604.07967 | AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification | cs.CL, cs.AI | 84 | AtomEval detects semantic corruption in adversarial claim rewrites; improves fact-checking robustness eval. | fact-verification, adversarial-evaluation, metrics, robustness, nlp-eval |
2604.08003 | Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs | eess.AS, cs.CL, cs.SD | 84 | Entropy-allocation view for LLM-ASR; targets hallucinations + latency with principled training strategy | hallucinations, ASR, LLM, reliability, training, evaluation-metrics |
2604.08539 | OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks | cs.CV, cs.AI, cs.CL | 84 | New RL objective (Gaussian GRPO) to stabilize multi-task reward topologies for multimodal reasoning | multimodal, rl, post-training, optimization, reasoning, grpo |
2603.17692 | Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization | cs.LG, cs.AI, q-fin.CP, q-fin.PM | 84 | Anonymization-first eval for LLM trading agents to reduce memorization/survivorship bias. | llm-agents, evaluation, data-leakage, memorization, finance, multi-agent |
2604.06148 | Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries | cs.CR, cs.AI, cs.MA | 84 | Taxonomy for machine identity governance; relevant to agent credentials, tokens, and abuse. | agent-security, machine-identities, access-control, credentials, governance, risk-taxonomy, enterprise |
2604.07892 | Data Selection for Multi-turn Dialogue Instruction Tuning | cs.CL, cs.AI | 84 | Dialogue-level data selection for multi-turn instruction tuning; tackles noise and drift. | instruction-tuning, data-selection, multi-turn, post-training, dataset-quality |
2603.22709 | Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics | cs.CL, eess.AS | 84 | New semantic+overlap-aware ASR metrics; probes LLM robustness in multi-speaker settings | evaluation, speech, ASR, LLM-robustness, metrics, overlap, semantic-fidelity |
2604.06814 | OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale | cs.LG, cs.AI | 84 | 3030-dataset tabular benchmark; large-scale comparison of GBDT/NN/foundation models | benchmark, tabular, evaluation, foundation-models, GBDT |
2604.08417 | Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs | cs.SE, cs.CR | 84 | Empirical study of LLM vuln detection with interprocedural context; cost vs accuracy | security, vulnerability-detection, code-LLMs, evaluation, interprocedural-analysis |
2604.01554 | EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild | cs.CR, cs.LG, cs.SE | 82 | EXHIB benchmark for binary function similarity; broad, realistic security eval suite with model comparisons | benchmarks, software security, binary analysis, evaluation, vulnerability analysis |
AI Paper Insight Brief
2026-04-13
0) Executive takeaways (read this first)
- Evaluation is shifting from surface metrics to “meaning-/structure-preserving” metrics: tcpSemER for conversational ASR and AtomEval for adversarial fact verification both show that common metrics can dramatically misstate progress/robustness when paraphrase or semantic corruption is involved.
- Agent safety is increasingly about the interfaces and governance layers around models: EU-law mapping for agents, machine-identity governance (MIGT), and RAG security taxonomies all converge on “external actions + toolchains + identity + auditability” as the real compliance/security boundary.
- Inference-time and training-time mismatches are a recurring failure mode: quantized rollouts destabilize RL (QaRL/TBPO), LLM-ASR joint training can drift into hallucination (entropy allocation + IA-SFT), and heavy SFT can suppress tool-use (“agentic collapse”)—all pointing to the need for explicit alignment between what’s optimized and what’s deployed.
- Long-horizon embodied/GUI agents still fail on low-level recovery and initiative calibration: PokeGym finds deadlock/collision recovery as the dominant bottleneck; KnowU-Bench shows large drops on personalized/proactive tasks even for strong models.
- Security research is becoming more “systems + economics”: VCAO’s game-theoretic orchestration improves validated vuln yield per budget; EXHIB exposes BFSD generalization gaps across firmware/semantic variation; interprocedural context in LLM vuln detection often hurts while doubling cost.
2) Key themes (clusters)
Theme: Validity-aware evaluation (semantic/structure > surface form)
- Why it matters: As models become more fluent, they can “look wrong” while being semantically right (ASR paraphrase) or “look similar” while changing the proposition (adversarial claim rewrites). Robust evaluation needs invariances aligned to the task’s truth conditions.
- Representative papers:
- Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics
- AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification
- EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild
- OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale
- Common approach:
- Replace brittle string metrics with semantic or atomic units (tcpSemER embeddings; SROM tuples + hard structural gate).
- Decompose performance by hard regimes (overlap vs non-overlap; low/mid/high binary variation; dataset metafeatures).
- Use negative controls / audits to detect artifacts (e.g., leakage checks in other domains; here: validity-aware success vs raw ASR).
- Open questions / failure modes:
- Semantic metrics can under-penalize critical details (e.g., diarization/time alignment still matters in ASR).
- Atomic decomposition quality becomes a bottleneck (AtomEval depends on extractor accuracy).
- Benchmarks can still encode selection bias (OmniTabBench filters out “too easy/hard” datasets; EXHIB’s coverage still leaves room for broader semantic diversity).
Theme: Agent governance, compliance, and identity as first-class engineering
- Why it matters: For tool-using agents, risk is triggered by external actions (data flows, privileges, cross-party chains), not model internals. Compliance and security require inventories, identity controls, and audit trails that survive runtime drift.
- Representative papers:
- AI Agents Under EU Law
- Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries
- Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions
- Aegon: Auditable AI Content Access with Ledger-Bound Tokens and Hardware-Attested Mobile Receipts
- Common approach:
- Map systems via taxonomies tied to triggers/surfaces (EU legal triggers by agent action; RAG pipeline stages; identity risk domains).
- Emphasize auditability (external-action inventories; tamper-evident logs; provenance events; attested receipts).
- Treat machine identities / NHIs as core security objects (registries, cryptographic IDs, JIT credentials, tamper-evident trails).
- Open questions / failure modes:
- Draft standards and “standards-free” timelines create uncertainty for implementers (EU harmonised standards still evolving).
- Audit logs can be tamper-evident but not complete (Aegon: platforms can omit provenance events).
- Cross-jurisdiction conflicts may be irreconcilable in practice (MIGT’s conflict registry highlights governance friction).
Theme: Stabilizing agent training & deployment under mismatch and drift
- Why it matters: Many failures come from mismatch (quantized sampler vs full-precision learner; encoder drift vs LLM priors; specialization suppressing tool use). Fixes increasingly combine systems alignment + objective design + staged training.
- Representative papers:
- QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training–Inference Mismatch
- Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
- Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover
- OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
- Common approach:
- Align training forward pass with deployment inference (QaRL low-bit forward + STE; publish low-bit tensors to sampler).
- Use distribution/advantage shaping to tame heavy tails and inter-task imbalance (TBPO sequence-level clipping; G2RPO Gaussian mapping).
- Stage training to preserve grounding and prevent drift (CTC pretrain + IA-SFT hot-swapping; freeze/align stages).
- Use small targeted data to recover suppressed capabilities (100 Lean agentic traces restore BFCL tool use).
- Open questions / failure modes:
- Sequence-level masking/clipping may introduce sample inefficiency (QaRL notes cost/overhead).
- Hot-swapping thresholds and multi-stage pipelines add operational complexity (ASR entropy allocation approach).
- Capability recovery can harm abstention/irrelevance detection if training is one-sided (Goedel: irrelevance accuracy drops sharply).
Theme: Long-horizon interactive agents: recovery, proactivity, and persuasion
- Why it matters: Real deployments require agents to recover from execution failures, ask clarifying questions, and calibrate initiative/consent—capabilities not captured by static benchmarks.
- Representative papers:
- PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
- Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
- Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing
- Common approach:
- Build interactive benchmarks with controlled scenarios + automated scoring (AOB evaluator in PokeGym; emulator + hybrid judge in KnowU).
- Diagnose failures with process metrics (ineffective moves/deadlocks; clarify/partial preference satisfaction; role inversion).
- Add verification layers before committing actions/memory (SAVER audit–repair loop).
- Open questions / failure modes:
- Simulator dependence (KnowU uses an LLM user simulator; SalesLLM shows simulator choice affects long-horizon outcomes).
- Low-level control brittleness remains dominant (PokeGym: collisions/deadlocks; parametric control brittle).
- Verification adds compute and depends on the same model family for auditing/repair (SAVER overhead and reliance).
Theme: Security & robustness in the wild (benchmarks + orchestration + cost)
- Why it matters: Security tasks are dominated by distribution shift (firmware/semantic BFSD), orchestration under budget (vuln discovery), and cost/latency constraints (LLM vuln detection context expansion).
- Representative papers:
- EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild
- VCAO: Verifier-Centered Agentic Orchestration for Strategic OS Vulnerability Discovery
- Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs
- Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation
- Common approach:
- Stress models on harder, more realistic variation (semantic equivalence, firmware, obfuscation).
- Treat security as decision-making under uncertainty (Stackelberg game + Bayesian updates + cascaded verification).
- Quantify cost/performance explicitly (token costs; solver time; runtime trade-offs).
- Open questions / failure modes:
- White-box attacks (CRA) highlight fragility but may not translate to black-box settings.
- More context can degrade performance (interprocedural prompts) while increasing cost—suggesting need for selective retrieval rather than naive expansion.
- Benchmarks reveal gaps but don’t automatically yield fixes (EXHIB shows semantic variation remains hard).
3) Technical synthesis
- Multiple papers converge on “pipeline-level invariants”: tcpSemER preserves time collars + permutation invariance; AtomEval enforces relation-structure consistency; RAG security frames threats by pipeline stage; EU agent compliance centers on external-action inventories.
- Decomposition is the new default: overlap vs non-overlap error attribution (CASR), low/mid/high binary variation (EXHIB), metafeature-conditioned winners (OmniTabBench), failure taxonomies (PokeGym deadlocks; KnowU clarify/partial; CrashSight category gaps).
- Mismatch correction appears in three distinct forms:
- Systems mismatch (quantized sampler vs BF16 learner → QaRL aligned low-bit forward).
- Representation drift mismatch (speech encoder becomes too semantic → CTC pretrain + IA-SFT hot-swapping).
- Capability suppression mismatch (domain SFT suppresses tool use → tiny agentic trace reactivation).
- Robustness often requires “hard gates” + “soft scores”: AtomEval hard relation gate + soft degradations; SAVER typed violations + minimal repair; LEO intent compiler uses deterministic 8-pass validator with ACCEPT/REJECT/ABSTAIN.
- Graph structure keeps showing up as a stabilizer/accelerator: SemGAT in anonymized trading; GAT router distilled from Dijkstra for LEO; attack graphs in VCAO; semantic edges in finance and routing both used to propagate relational constraints.
- Cost-aware evaluation is becoming standard: vulnerability detection paper reports token-cost totals and shows context doubles tokens; QaRL reports per-step speedups; VCAO reports MILP solve time (<5s for ~75k vars).
- “Overlap / concurrency” is a core unsolved regime: CASR shows overlap regions dominate errors (~90% of error from ~32% overlap); similar “concurrency” issues appear in multi-agent governance (accountability horizon) and toolchains (RAG trust boundaries).
- Inference-time attacks are moving into representation space: CRA uses gradient-attributed masking to suppress refusal subspaces, suggesting defenses must consider activation integrity, not just prompt filtering.
- Benchmarks increasingly include intervention studies (PokeGym forced recovery improves SR; MDS shows long-dialogue robustness; CrashSight shows fine-tuning gains but persistent perceptual bottlenecks).
4) Top 5 papers (with “why now”)
- Aligns learner forward-pass arithmetic with quantized rollout engines to reduce PPO instability from mismatch.
- TBPO introduces sequence-level ratios + dual clipping to suppress “error-token” ratio explosions under quantized decoding.
- Demonstrates near-BF16 recovery while keeping most throughput gains (e.g., Qwen3-30B-A3B: 45.7 → 51.2 vs BF16 52.1).
- Skepticism: still slower than pure quantized-rollout training (1.3× vs 1.4× on MoE) and relies on low-bit kernel availability.
2) Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation
- Training-free, inference-time jailbreak that targets refusal subspaces via gradient attribution and masking.
- Large ASR gains reported across multiple 7B aligned models (e.g., Llama-2-7B-Chat ASR-O 53.0%; λ≈1.0 gives RRSR 96.3%).
- Highlights a concrete latent-space attack surface distinct from prompt-only jailbreaks.
- Skepticism: assumes white-box access to activations/gradients; quality degrades at high suppression strengths.
- Introduces tcpSemER (time-constrained, permutation-invariant semantic error) and overlap-aware tcpWER decomposition.
- Shows overlap dominates errors (NSF1: ~32% overlap accounts for ~90% of error), and semantic metrics reduce sensitivity to normalization.
- Provides a realistic comparison of modular vs LLM-based CASR under increasing overlap/speaker counts.
- Skepticism: primarily evaluation; does not propose architectural fixes for overlap handling.
4) KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
- Online Android benchmark that tests preference elicitation, proactivity/consent, and post-rejection restraint—beyond navigation.
- Shows strong models drop sharply on hard personalized tasks (e.g., Claude Sonnet 4.6: 60.4% overall vs 44.2% hard personalized).
- Hybrid evaluation (rule checks + LLM judge) better aligns with human ratings than rules alone.
- Skepticism: simulator dependence (LLM user simulator) and synthetic/curated profiles/logs may limit ecological validity.
5) VCAO: Verifier-Centered Agentic Orchestration for Strategic OS Vulnerability Discovery
- Frames vuln discovery as repeated Bayesian Stackelberg game; allocates tool budget via DOBSS-derived MILP + belief updates.
- Claims large gains in severity-weighted validated findings per budget (2.7× vs coverage-only fuzzing) and reduces false positives to ~15.1%.
- Includes a six-layer orchestration architecture and a stated online regret bound.
- Skepticism: relies on rational-attacker assumptions and calibrated tool likelihoods; attack-path enumeration is exponential and needs heuristics.
5) Practical next steps
- Adopt validity-aware metrics in your eval stack: for multi-speaker ASR, add tcpSemER + overlap decomposition; for adversarial fact verification, add atomic-structure validity checks (AtomEval-style) to avoid counting “semantic drift” as successful attacks.
- Instrument agent systems around external actions: build an “external-action inventory” (EU-law paper’s Step 0) and map it to identity, logging, and trust boundaries (MIGT + RAG security taxonomy).
- Harden against representation-space jailbreaks: if you operate open-weight models or internal deployments, test CRA-like activation ablations in a red-team setting to understand whether refusal relies on low-rank directions.
- If doing RL with quantized rollouts, measure mismatch-induced ratio pathologies (token/sequence ratios, error-token frequency) and consider aligned low-bit forward passes + sequence-level clipping/masking (QaRL/TBPO).
- For long-horizon VLM/GUI agents, track process metrics (deadlocks/ineffective moves; clarify rate; intervention/passivity) and run targeted interventions (e.g., deterministic recovery primitives) rather than only improving high-level planning.
- For specialized tool-using models, test for “agentic collapse” after heavy SFT; try small targeted agentic trace injections (including explicit no-tool negatives) to recover tool use without destroying domain skill.
- In security tooling, avoid naive context expansion: interprocedural context can degrade detection while doubling tokens; instead, experiment with selective retrieval of only the most relevant callers/callees and measure cost-per-validated-finding.
Generated from per-paper analyses; no external browsing.
