Daily AI Paper Report (2026-04-13)

Published: April 13, 2026

Chinese version: [中文]

Run stats

Candidates: 3253
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-10T00:00:00Z → 2026-04-11T00:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.07835`	Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation PDF	cs.AI	96	New efficient inference-time jailbreak method via hidden-state subspace ablation; high safety relevance	jailbreaks, inference-time attacks, representation engineering, guardrails, robustness
`2604.08401`	Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing PDF	cs.AI, cs.CL	93	Self-auditing verification for LLM-agent beliefs to prevent drift in long-horizon tool use.	llm-agents, self-auditing, faithful-reasoning, verification, agent-safety, long-horizon
`2604.08291`	VCAO: Verifier-Centered Agentic Orchestration for Strategic OS Vulnerability Discovery PDF	cs.GT, cs.CR, cs.OS	92	Agentic orchestration with verifiers for OS vuln discovery; strong security/agent workflow framing	agents, cybersecurity, vulnerability discovery, verification, tool-augmented LLMs, game theory
`2604.08304`	Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions PDF	cs.CR, cs.AI	92	Clear secure-RAG framing + taxonomy across pipeline stages; useful for audits/defenses.	RAG, security, taxonomy, prompt-injection, data-poisoning, threat-modeling, LLM-systems
`2604.08455`	KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation PDF	cs.AI	92	Interactive benchmark for proactive/personalized mobile agents incl. consent/when-to-act decisions	agents, evaluation, mobile, personalization, proactivity, human-in-the-loop, safety
`2604.08388`	Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover PDF	cs.AI	92	Shows agentic tool-use can collapse after SFT and be restored with ~100 traces	agents, tool-use, capability-recovery, fine-tuning, formal-math, function-calling
`2604.07778`	The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives PDF	cs.AI	90	Formal impossibility result for accountability in human-agent collectives as autonomy grows	agent-governance, accountability, formal-methods, causal-models, multi-agent
`2604.07745`	The Cartesian Cut in Agentic AI PDF	cs.AI, q-bio.NC	90	Conceptual framework for where control lives in LLM agents; governance implications.	agents, agent-architecture, governance, control, conceptual
`2604.08326`	ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection PDF	cs.AI	89	Fine-grained medical alignment w/ explicit criteria + multidimensional reward model; useful safety pattern	alignment, medical LLMs, reward modeling, rubrics, safety constraints, datasets
`2604.07853`	QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch PDF	cs.LG, cs.AI	89	Quantization-aware RL aligns rollout precision to stabilize LLM RL and speed training	LLM-RL, post-training, quantization, efficiency, training-inference-mismatch
`2604.05795`	Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation PDF	cs.CL	88	FAITH-M benchmark scores therapist responses on expert therapeutic principles; strong safety eval value.	evaluation, mental-health, alignment, safety, benchmark, rubric, clinical
`2604.08457`	CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning PDF	cs.CV, cs.AI, cs.RO	88	Safety-critical VLM benchmark for real crash videos; tests grounding + causal/mechanistic reasoning	evaluation, vlm, autonomous-driving, safety, video, reasoning, benchmark
`2604.08124`	Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search PDF	cs.AI	88	Improves RL-trained LLM search agents via hierarchical experience; targets stability/efficiency.	agents, search, reinforcement-learning, reasoning, training-stability
`2604.07264`	Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations PDF	cs.CR, cs.AI	86	LLM intent compiler + verifier loop for safety-critical routing constraints; strong benchmark results	LLM, program-synthesis, verification, tool-use, networking, constraints, reliability
`2604.06693`	Aegon: Auditable AI Content Access with Ledger-Bound Tokens and Hardware-Attested Mobile Receipts PDF	cs.CR, cs.CY	86	Auditable content-access protocol with append-only ledger proofs; practical governance infra.	AI-governance, auditing, content-licensing, cryptography, transparency-logs, attestation, JWT
`2604.08340`	PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models PDF	cs.CV, cs.AI	86	Long-horizon VLM benchmark in complex 3D game with strict RGB-only isolation and evaluator	benchmarks, embodied-agents, vision-language-models, long-horizon, evaluation
`2604.04604`	AI Agents Under EU Law PDF	cs.CY, cs.AI, cs.CR, cs.MA	86	Systematic mapping of EU AI Act+GDPR etc. obligations for autonomous AI agents.	ai-agents, governance, regulation, EU-AI-Act, GDPR, compliance, risk-management
`2604.07054`	Sell More, Play Less: Benchmarking LLM Realistic Selling Skill PDF	cs.CL	86	Realistic sales benchmark + auto eval; tests goal-directed persuasion in multi-turn dialogs	benchmark, dialogue, persuasion, evaluation, user-simulation, DPO
`2604.08519`	Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts PDF	cs.CL, stat.ML	86	Data pruning to improve factual memorization; info-theoretic framing of capacity limits	factuality, hallucinations, data-selection, memorization, scaling, information-theory
`2604.07007`	AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power PDF	cs.MA, cs.AI, cs.CY	84	Governance architecture for open agent economies using separation-of-powers; novel but blockchain-heavy	agent governance, multi-agent systems, auditing, mechanism design, smart contracts
`2604.07967`	AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification PDF	cs.CL, cs.AI	84	AtomEval detects semantic corruption in adversarial claim rewrites; improves fact-checking robustness eval.	fact-verification, adversarial-evaluation, metrics, robustness, nlp-eval
`2604.08003`	Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs PDF	eess.AS, cs.CL, cs.SD	84	Entropy-allocation view for LLM-ASR; targets hallucinations + latency with principled training strategy	hallucinations, ASR, LLM, reliability, training, evaluation-metrics
`2604.08539`	OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks PDF	cs.CV, cs.AI, cs.CL	84	New RL objective (Gaussian GRPO) to stabilize multi-task reward topologies for multimodal reasoning	multimodal, rl, post-training, optimization, reasoning, grpo
`2603.17692`	Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization PDF	cs.LG, cs.AI, q-fin.CP, q-fin.PM	84	Anonymization-first eval for LLM trading agents to reduce memorization/survivorship bias.	llm-agents, evaluation, data-leakage, memorization, finance, multi-agent
`2604.06148`	Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries PDF	cs.CR, cs.AI, cs.MA	84	Taxonomy for machine identity governance; relevant to agent credentials, tokens, and abuse.	agent-security, machine-identities, access-control, credentials, governance, risk-taxonomy, enterprise
`2604.07892`	Data Selection for Multi-turn Dialogue Instruction Tuning PDF	cs.CL, cs.AI	84	Dialogue-level data selection for multi-turn instruction tuning; tackles noise and drift.	instruction-tuning, data-selection, multi-turn, post-training, dataset-quality
`2603.22709`	Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics PDF	cs.CL, eess.AS	84	New semantic+overlap-aware ASR metrics; probes LLM robustness in multi-speaker settings	evaluation, speech, ASR, LLM-robustness, metrics, overlap, semantic-fidelity
`2604.06814`	OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale PDF	cs.LG, cs.AI	84	3030-dataset tabular benchmark; large-scale comparison of GBDT/NN/foundation models	benchmark, tabular, evaluation, foundation-models, GBDT
`2604.08417`	Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs PDF	cs.SE, cs.CR	84	Empirical study of LLM vuln detection with interprocedural context; cost vs accuracy	security, vulnerability-detection, code-LLMs, evaluation, interprocedural-analysis
`2604.01554`	EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild PDF	cs.CR, cs.LG, cs.SE	82	EXHIB benchmark for binary function similarity; broad, realistic security eval suite with model comparisons	benchmarks, software security, binary analysis, evaluation, vulnerability analysis

AI Paper Insight Brief

2026-04-13

0) Executive takeaways (read this first)

Evaluation is shifting from surface metrics to “meaning-/structure-preserving” metrics: tcpSemER for conversational ASR and AtomEval for adversarial fact verification both show that common metrics can dramatically misstate progress/robustness when paraphrase or semantic corruption is involved.
Agent safety is increasingly about the interfaces and governance layers around models: EU-law mapping for agents, machine-identity governance (MIGT), and RAG security taxonomies all converge on “external actions + toolchains + identity + auditability” as the real compliance/security boundary.
Inference-time and training-time mismatches are a recurring failure mode: quantized rollouts destabilize RL (QaRL/TBPO), LLM-ASR joint training can drift into hallucination (entropy allocation + IA-SFT), and heavy SFT can suppress tool-use (“agentic collapse”)—all pointing to the need for explicit alignment between what’s optimized and what’s deployed.
Long-horizon embodied/GUI agents still fail on low-level recovery and initiative calibration: PokeGym finds deadlock/collision recovery as the dominant bottleneck; KnowU-Bench shows large drops on personalized/proactive tasks even for strong models.
Security research is becoming more “systems + economics”: VCAO’s game-theoretic orchestration improves validated vuln yield per budget; EXHIB exposes BFSD generalization gaps across firmware/semantic variation; interprocedural context in LLM vuln detection often hurts while doubling cost.

2) Key themes (clusters)

Theme: Validity-aware evaluation (semantic/structure > surface form)

Why it matters: As models become more fluent, they can “look wrong” while being semantically right (ASR paraphrase) or “look similar” while changing the proposition (adversarial claim rewrites). Robust evaluation needs invariances aligned to the task’s truth conditions.
Representative papers:
Common approach:
- Replace brittle string metrics with semantic or atomic units (tcpSemER embeddings; SROM tuples + hard structural gate).
- Decompose performance by hard regimes (overlap vs non-overlap; low/mid/high binary variation; dataset metafeatures).
- Use negative controls / audits to detect artifacts (e.g., leakage checks in other domains; here: validity-aware success vs raw ASR).
Open questions / failure modes:
- Semantic metrics can under-penalize critical details (e.g., diarization/time alignment still matters in ASR).
- Atomic decomposition quality becomes a bottleneck (AtomEval depends on extractor accuracy).
- Benchmarks can still encode selection bias (OmniTabBench filters out “too easy/hard” datasets; EXHIB’s coverage still leaves room for broader semantic diversity).

Theme: Agent governance, compliance, and identity as first-class engineering

Why it matters: For tool-using agents, risk is triggered by external actions (data flows, privileges, cross-party chains), not model internals. Compliance and security require inventories, identity controls, and audit trails that survive runtime drift.
Representative papers:
Common approach:
- Map systems via taxonomies tied to triggers/surfaces (EU legal triggers by agent action; RAG pipeline stages; identity risk domains).
- Emphasize auditability (external-action inventories; tamper-evident logs; provenance events; attested receipts).
- Treat machine identities / NHIs as core security objects (registries, cryptographic IDs, JIT credentials, tamper-evident trails).
Open questions / failure modes:
- Draft standards and “standards-free” timelines create uncertainty for implementers (EU harmonised standards still evolving).
- Audit logs can be tamper-evident but not complete (Aegon: platforms can omit provenance events).
- Cross-jurisdiction conflicts may be irreconcilable in practice (MIGT’s conflict registry highlights governance friction).

Theme: Stabilizing agent training & deployment under mismatch and drift

Why it matters: Many failures come from mismatch (quantized sampler vs full-precision learner; encoder drift vs LLM priors; specialization suppressing tool use). Fixes increasingly combine systems alignment + objective design + staged training.
Representative papers:
Common approach:
- Align training forward pass with deployment inference (QaRL low-bit forward + STE; publish low-bit tensors to sampler).
- Use distribution/advantage shaping to tame heavy tails and inter-task imbalance (TBPO sequence-level clipping; G2RPO Gaussian mapping).
- Stage training to preserve grounding and prevent drift (CTC pretrain + IA-SFT hot-swapping; freeze/align stages).
- Use small targeted data to recover suppressed capabilities (100 Lean agentic traces restore BFCL tool use).
Open questions / failure modes:
- Sequence-level masking/clipping may introduce sample inefficiency (QaRL notes cost/overhead).
- Hot-swapping thresholds and multi-stage pipelines add operational complexity (ASR entropy allocation approach).
- Capability recovery can harm abstention/irrelevance detection if training is one-sided (Goedel: irrelevance accuracy drops sharply).

Theme: Long-horizon interactive agents: recovery, proactivity, and persuasion

Why it matters: Real deployments require agents to recover from execution failures, ask clarifying questions, and calibrate initiative/consent—capabilities not captured by static benchmarks.
Representative papers:
Common approach:
- Build interactive benchmarks with controlled scenarios + automated scoring (AOB evaluator in PokeGym; emulator + hybrid judge in KnowU).
- Diagnose failures with process metrics (ineffective moves/deadlocks; clarify/partial preference satisfaction; role inversion).
- Add verification layers before committing actions/memory (SAVER audit–repair loop).
Open questions / failure modes:
- Simulator dependence (KnowU uses an LLM user simulator; SalesLLM shows simulator choice affects long-horizon outcomes).
- Low-level control brittleness remains dominant (PokeGym: collisions/deadlocks; parametric control brittle).
- Verification adds compute and depends on the same model family for auditing/repair (SAVER overhead and reliance).

Theme: Security & robustness in the wild (benchmarks + orchestration + cost)

Why it matters: Security tasks are dominated by distribution shift (firmware/semantic BFSD), orchestration under budget (vuln discovery), and cost/latency constraints (LLM vuln detection context expansion).
Representative papers:
Common approach:
- Stress models on harder, more realistic variation (semantic equivalence, firmware, obfuscation).
- Treat security as decision-making under uncertainty (Stackelberg game + Bayesian updates + cascaded verification).
- Quantify cost/performance explicitly (token costs; solver time; runtime trade-offs).
Open questions / failure modes:
- White-box attacks (CRA) highlight fragility but may not translate to black-box settings.
- More context can degrade performance (interprocedural prompts) while increasing cost—suggesting need for selective retrieval rather than naive expansion.
- Benchmarks reveal gaps but don’t automatically yield fixes (EXHIB shows semantic variation remains hard).

3) Technical synthesis

Multiple papers converge on “pipeline-level invariants”: tcpSemER preserves time collars + permutation invariance; AtomEval enforces relation-structure consistency; RAG security frames threats by pipeline stage; EU agent compliance centers on external-action inventories.
Decomposition is the new default: overlap vs non-overlap error attribution (CASR), low/mid/high binary variation (EXHIB), metafeature-conditioned winners (OmniTabBench), failure taxonomies (PokeGym deadlocks; KnowU clarify/partial; CrashSight category gaps).
Mismatch correction appears in three distinct forms:
- Systems mismatch (quantized sampler vs BF16 learner → QaRL aligned low-bit forward).
- Representation drift mismatch (speech encoder becomes too semantic → CTC pretrain + IA-SFT hot-swapping).
- Capability suppression mismatch (domain SFT suppresses tool use → tiny agentic trace reactivation).
Robustness often requires “hard gates” + “soft scores”: AtomEval hard relation gate + soft degradations; SAVER typed violations + minimal repair; LEO intent compiler uses deterministic 8-pass validator with ACCEPT/REJECT/ABSTAIN.
Graph structure keeps showing up as a stabilizer/accelerator: SemGAT in anonymized trading; GAT router distilled from Dijkstra for LEO; attack graphs in VCAO; semantic edges in finance and routing both used to propagate relational constraints.
Cost-aware evaluation is becoming standard: vulnerability detection paper reports token-cost totals and shows context doubles tokens; QaRL reports per-step speedups; VCAO reports MILP solve time (<5s for ~75k vars).
“Overlap / concurrency” is a core unsolved regime: CASR shows overlap regions dominate errors (~90% of error from ~32% overlap); similar “concurrency” issues appear in multi-agent governance (accountability horizon) and toolchains (RAG trust boundaries).
Inference-time attacks are moving into representation space: CRA uses gradient-attributed masking to suppress refusal subspaces, suggesting defenses must consider activation integrity, not just prompt filtering.
Benchmarks increasingly include intervention studies (PokeGym forced recovery improves SR; MDS shows long-dialogue robustness; CrashSight shows fine-tuning gains but persistent perceptual bottlenecks).

4) Top 5 papers (with “why now”)

1) QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training–Inference Mismatch

Aligns learner forward-pass arithmetic with quantized rollout engines to reduce PPO instability from mismatch.
TBPO introduces sequence-level ratios + dual clipping to suppress “error-token” ratio explosions under quantized decoding.
Demonstrates near-BF16 recovery while keeping most throughput gains (e.g., Qwen3-30B-A3B: 45.7 → 51.2 vs BF16 52.1).
Skepticism: still slower than pure quantized-rollout training (1.3× vs 1.4× on MoE) and relies on low-bit kernel availability.

2) Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

Training-free, inference-time jailbreak that targets refusal subspaces via gradient attribution and masking.
Large ASR gains reported across multiple 7B aligned models (e.g., Llama-2-7B-Chat ASR-O 53.0%; λ≈1.0 gives RRSR 96.3%).
Highlights a concrete latent-space attack surface distinct from prompt-only jailbreaks.
Skepticism: assumes white-box access to activations/gradients; quality degrades at high suppression strengths.

3) Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Introduces tcpSemER (time-constrained, permutation-invariant semantic error) and overlap-aware tcpWER decomposition.
Shows overlap dominates errors (NSF1: ~32% overlap accounts for ~90% of error), and semantic metrics reduce sensitivity to normalization.
Provides a realistic comparison of modular vs LLM-based CASR under increasing overlap/speaker counts.
Skepticism: primarily evaluation; does not propose architectural fixes for overlap handling.

4) KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Online Android benchmark that tests preference elicitation, proactivity/consent, and post-rejection restraint—beyond navigation.
Shows strong models drop sharply on hard personalized tasks (e.g., Claude Sonnet 4.6: 60.4% overall vs 44.2% hard personalized).
Hybrid evaluation (rule checks + LLM judge) better aligns with human ratings than rules alone.
Skepticism: simulator dependence (LLM user simulator) and synthetic/curated profiles/logs may limit ecological validity.

5) VCAO: Verifier-Centered Agentic Orchestration for Strategic OS Vulnerability Discovery

Frames vuln discovery as repeated Bayesian Stackelberg game; allocates tool budget via DOBSS-derived MILP + belief updates.
Claims large gains in severity-weighted validated findings per budget (2.7× vs coverage-only fuzzing) and reduces false positives to ~15.1%.
Includes a six-layer orchestration architecture and a stated online regret bound.
Skepticism: relies on rational-attacker assumptions and calibrated tool likelihoods; attack-path enumeration is exponential and needs heuristics.

5) Practical next steps

Adopt validity-aware metrics in your eval stack: for multi-speaker ASR, add tcpSemER + overlap decomposition; for adversarial fact verification, add atomic-structure validity checks (AtomEval-style) to avoid counting “semantic drift” as successful attacks.
Instrument agent systems around external actions: build an “external-action inventory” (EU-law paper’s Step 0) and map it to identity, logging, and trust boundaries (MIGT + RAG security taxonomy).
Harden against representation-space jailbreaks: if you operate open-weight models or internal deployments, test CRA-like activation ablations in a red-team setting to understand whether refusal relies on low-rank directions.
If doing RL with quantized rollouts, measure mismatch-induced ratio pathologies (token/sequence ratios, error-token frequency) and consider aligned low-bit forward passes + sequence-level clipping/masking (QaRL/TBPO).
For long-horizon VLM/GUI agents, track process metrics (deadlocks/ineffective moves; clarify rate; intervention/passivity) and run targeted interventions (e.g., deterministic recovery primitives) rather than only improving high-level planning.
For specialized tool-using models, test for “agentic collapse” after heavy SFT; try small targeted agentic trace injections (including explicit no-tool negatives) to recover tool use without destroying domain skill.
In security tooling, avoid naive context expansion: interprocedural context can degrade detection while doubling tokens; instead, experiment with selective retrieval of only the most relevant callers/callees and measure cost-per-validated-finding.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-13

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Validity-aware evaluation (semantic/structure > surface form)

Theme: Agent governance, compliance, and identity as first-class engineering

Theme: Stabilizing agent training & deployment under mismatch and drift

Theme: Long-horizon interactive agents: recovery, proactivity, and persuasion

Theme: Security & robustness in the wild (benchmarks + orchestration + cost)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps