Daily AI Paper Report (2026-04-04)

Published: April 04, 2026

Chinese version: [中文]

Run stats

Candidates: 254
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-02T00:00:00Z → 2026-04-03T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.02174`	Quantifying Self-Preservation Bias in Large Language Models PDF	cs.AI	95	Benchmark quantifies self-preservation bias via role inconsistency; strong agentic misalignment signal.	agent-safety, instrumental-convergence, shutdown-resistance, evaluation, RLHF, benchmark
`2604.02022`	ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety PDF	cs.AI	94	Long-horizon trajectory benchmark for agent safety with delayed triggers and harm taxonomy.	agent-safety, benchmark, long-horizon, tool-use, red-teaming, evaluation
`2604.01604`	CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders PDF	cs.AI	94	Circuit-guided refusal features near boundary; improves jailbreak/ASR analysis and control	LLM-safety, refusal, mechanistic-interpretability, jailbreaks, feature-selection, circuits
`2604.01905`	From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers PDF	cs.CR, cs.SE	92	Component-centric dataset + detection for malicious MCP servers; targets real tool-ecosystem attacks.	security, agents, MCP, supply-chain, tooling, dataset, detection
`2604.02230`	Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs PDF	cs.AI	92	New abstention method (trace inversion) targets reasoning-model overanswering failures	abstention, hallucinations, reasoning-models, reliability, uncertainty, evaluation
`2604.01658`	CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery PDF	cs.AI	92	Autonomous multi-agent evolution w/ persistent memory + practical safeguards; strong agentic relevance	agents, multi-agent, open-ended, autonomous, safeguards, evaluation, infrastructure
`2604.01496`	From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents PDF	cs.SE, cs.CL	91	Strong SWE-bench gains + large released trajectories; advances real agentic coding workflows	agents, software-engineering, SWE-bench, post-training, datasets, tool-use
`2604.01508`	ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems PDF	cs.SE, cs.AI	90	Deterministic offline benchmark for tool misuse/recovery with budgets and fault injection; very reusable.	agents, tool-use, robustness, benchmark, fault-injection, evaluation
`2604.02091`	Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning PDF	cs.CL, cs.AI, cs.IR	90	RL aligns RAG reranking to downstream LLM answer utility, not static IR labels	RAG, reranking, RL, LLM-feedback, evaluation, alignment
`2604.01664`	ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents PDF	cs.AI	90	RL-based budget-aware context compression for long-horizon agents; directly targets context-limit failures	agents, long-horizon, context-management, compression, reinforcement-learning, efficiency
`2604.02288`	Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing PDF	cs.LG, cs.AI	89	Unifies GRPO/SDPO via routing; addresses RLVR credit assignment + late-stage collapse	RLVR, post-training, GRPO, distillation, optimization-stability, alignment-training
`2604.01977`	RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale PDF	cs.CR, cs.AI, cs.CL, cs.LG, cs.SE	88	Automates CVE detection-rule generation at scale; high security impact and deployable architecture	security, vulnerability-detection, CVE, rule-generation, automation, threat-detection
`2604.01624`	OSCAR: Orchestrated Self-verification and Cross-path Refinement PDF	cs.AI, cs.CL	87	Hallucination mitigation using diffusion LM trajectories; unsupervised uncertainty localization	hallucinations, diffusion-language-models, uncertainty, self-verification, inference-time-control
`2604.01652`	ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models PDF	cs.AI, cs.CL	87	1B grounded claim verifier w/ structured rationales; strong gains vs larger baselines, interpretable	verification, factuality, grounding, hallucinations, small-models, interpretability, evaluation
`2604.01925`	ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues PDF	cs.CL, cs.AI	86	New implicit-bias QA benchmark using characteristic cues; shows bias persists despite explicit suppression.	bias, evaluation, safety, fairness, benchmark, implicit-signals
`2604.01993`	SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning PDF	cs.CL, cs.AI	86	Benchmarking with verifiable atomic steps; filters unanswerables and gives stepwise feedback	evaluation, multi-hop-reasoning, verification, benchmarks, grounding, error-taxonomy
`2604.01837`	PLOT: Enhancing Preference Learning via Optimal Transport PDF	cs.CL	86	Optimal-transport token loss for preference learning; aims for stability/robustness gains	alignment, preference-learning, DPO/RLHF, optimal-transport, token-level
`2604.02322`	Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning PDF	cs.LG, cs.AI, cs.CL	86	Task-scaling law via solving N problems in one context; reduces CoT token cost with simple training	reasoning, efficiency, scaling-laws, training, chain-of-thought, inference-cost
`2604.01682`	PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment PDF	cs.CL	85	Risk-gated SFT objective to reduce overconfident hallucinations at fact-critical spans.	hallucination, alignment, factuality, SFT, uncertainty, training
`2604.02155`	Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents PDF	cs.CL	84	Finds non-monotonic CoT budget effects in function-calling agents; actionable for agent design.	agents, function-calling, reasoning, chain-of-thought, evaluation, reliability
`2604.02194`	Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model PDF	cs.CL, cs.AI	84	Neuron-level tuning to resist noisy/irrelevant retrieval; improves RAG robustness	RAG, robustness, retrieval-noise, instruction-tuning, attribution, neurons
`2604.01610`	GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation PDF	cs.AI	84	Training-free tool-based KG navigation enables multi-hop reasoning beyond context limits	agents, tool-use, knowledge-graphs, grounding, long-context, reasoning
`2604.02047`	Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding PDF	cs.CL, cs.AI	84	Training-free speculative decoding w/ anisotropic trees; principled use of mixed-quality token sources	inference, speculative-decoding, efficiency, decoding, systems
`2604.01754`	LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches PDF	cs.CL, cs.AI, cs.LG	83	Live, post-cutoff math benchmark from recent arXiv theorems; reduces contamination, adds taxonomy	evaluation, math-reasoning, benchmark, data-contamination, proof-sketches
`2604.01676`	GPA: Learning GUI Process Automation from Demonstrations PDF	cs.CV, cs.AI, cs.SE	82	Deterministic, local GUI automation from one demo; emphasizes reliability calibration and privacy.	agents, GUI, RPA, privacy, reliability, tooling
`2604.01576`	Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents PDF	cs.LG	82	Alignment for supportive agents: autonomy-preserving objective + relational failure benchmark	alignment, social-risk, autonomy, supportive-agents, benchmarks, reward-modeling
`2604.01840`	Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models PDF	cs.AI	82	Credits only visually-dependent tokens in RLVR; sharper learning signal for LVLM reasoning	multimodal, VLM, RLVR, credit-assignment, reasoning, optimization
`2604.01618`	Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models PDF	cs.CV, cs.AI	81	Physically plausible adversarial 3D textures attack VLA models; important robotics safety surface.	adversarial, robotics, VLA, physical-attacks, robustness, security
`2604.01988`	SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation PDF	cs.AI	81	Controlled benchmark for number sense + shortcut use/judgment; useful probe of reasoning reliability	evaluation, numerical-reasoning, robustness, shortcuts, calibration
`2604.02276`	De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules PDF	cs.AI, cs.CL, cs.LG	80	Automated regulatory rule extraction with judge+iterative repair; useful for compliance-aware agents.	governance, compliance, LLM-judge, self-refinement, information-extraction, agents

AI Paper Insight Brief

2026-04-04

0) Executive takeaways (read this first)

Agent reliability is shifting from “capability” to “operational correctness under constraints”: deterministic fault injection + budgeted scoring for tool misuse (ToolMisuseBench) and explicit context-window budgeting as an RL decision problem (ContextBudget) make failures attributable and optimizable.
Execution-heavy SWE agent training can be made scalable via a “semantic distill → small execution refine” recipe: SWE-ZERO (300k execution-free trajectories) + SWE-HERO (13k execution-backed) materially improves SWE-bench Verified (e.g., 32B: 62.2%) while reducing infra dependence.
Safety evaluation is becoming trajectory-native and system-supply-chain-aware: ATBench exposes long-horizon, delayed-trigger tool risks where even strong models struggle at fine-grained diagnosis; MCP server security work shows multi-component exploit chains and provides a behavior-deviation detector (Connor) with high F1 (94.6%) plus real marketplace finds.
“Reasoning” is not monotonicly beneficial—budgeting and credit assignment matter: long CoT can harm function-calling accuracy (peaks at very brief 8–32 tokens); multimodal RL improves when advantages are routed to visually-dependent tokens (PGPO); RL post-training stabilizes when routing samples between GRPO and self-distillation (SRPO).
Factuality/abstention is moving toward localized, model-native signals and targeted interventions: diffusion LMs can localize uncertain commitments via cross-chain entropy and correct spans (OSCAR); abstention improves by detecting “query misalignment” via reasoning-trace inversion.

2) Key themes (clusters)

Theme: Budgeted, deterministic evaluation for tool-using agents

Why it matters: Tool failures (schema drift, auth, timeouts) and resource limits (steps/calls/context) dominate real deployments; deterministic, budget-aware benchmarks make reliability improvements measurable and reproducible.
Representative papers:
Common approach:
- Deterministic fault injection / replayable simulators with structured metrics (success, invalid calls, recovery time, budget-exceeded).
- Treat “budget” as a first-class variable (AUC over caps; explicit remaining-context state; token-budget sweeps).
- Diagnose failures into actionable buckets (wrong valid function vs hallucinated function; policy violations; recovery success).
Open questions / failure modes:
- How to handle “hard” faults (authorization/rate-limit) where simple repair layers show zero success in ToolMisuseBench’s released setting.
- Whether learned policies generalize across tool ecosystems and shifting schemas without overfitting to benchmark fault mixes.
- How to gate reasoning/CoT adaptively (brief helps routing; long induces misrouting/hallucination) without brittle heuristics.

Theme: Scalable training + verification loops for code and long-horizon autonomy

Why it matters: Execution environments and long-running search loops are the bottleneck for open-source agents; scalable data + persistent memory can unlock capability without prohibitive infra.
Representative papers:
Common approach:
- Split “cheap semantic learning” from “expensive verification” (execution-free distillation then execution-backed refinement).
- Externalize state/knowledge into persistent artifacts (notes/skills/attempts) to enable reuse and cross-agent diffusion.
- Prefer deterministic replay/grounding mechanisms (SMC-based GUI element localization + readiness gating; bounded retries).
Open questions / failure modes:
- Teacher inheritance and verifier precision limits in SWE distillation; brittleness from environment variance.
- How CORAL-style autonomy behaves with weaker models or ambiguous evaluators (paper notes evaluator assumptions).
- Record-and-replay systems (GPA) can’t handle tasks requiring new planning beyond the demonstration.

Theme: Trajectory-level safety + supply-chain/tooling security

Why it matters: Real harms emerge across multi-step tool trajectories and via compromised tool servers; single-turn safety checks miss delayed triggers and multi-component exploit chains.
Representative papers:
Common approach:
- Explicit taxonomies + controlled generation (ATBench’s risk-source/failure-mode/harm axes; delayed-trigger two-episode protocol).
- Behavior-based detection beyond signatures (Connor’s intent extraction + execution tracing + code slicing + step-wise allow/warn/block).
- Physically grounded threat models for embodied agents (object-bound adversarial 3D textures; sim-to-real via EoT).
Open questions / failure modes:
- Fine-grained diagnosis remains weak even when binary unsafe detection is decent (ATBench: GPT-5.4 76.7% F1 binary vs 13.5% failure-mode accuracy).
- Connor can miss payloads not exercised during simulation; false positives when benign behavior deviates from declared intent.
- Defenses for VLA texture attacks (training-time robustness, action constraints) are not established here—only the vulnerability and attack pipeline.

Theme: Credit assignment and routing in post-training (RLVR / preference learning)

Why it matters: Many alignment failures are optimization artifacts: wrong tokens get updated, wrong samples get distilled, or global distribution shifts are poorly captured—leading to instability or weak robustness gains.
Representative papers:
Common approach:
- Route supervision based on sample status (SRPO sends incorrect rollouts to SDPO, others to GRPO; entropy-weight SDPO tokens).
- Reweight token-level learning signals using causal dependency measures (PGPO uses KL between vision-conditioned vs text-only token distributions).
- Replace local token tweaks with distribution-level objectives (PLOT uses an OT/Wasserstein-style token loss with embedding-based costs).
Open questions / failure modes:
- Generalization beyond tested scales/domains (PGPO up to 7B; SRPO on Qwen3 4B/8B and five benchmarks; PLOT on small preference datasets).
- Hyperparameter sensitivity (PGPO τ/β; PLOT α; SRPO depends on having correct sibling rollouts for teacher info).
- Whether these methods preserve behavior under adversarial prompting beyond reported ASR reductions (PLOT) and benchmark gains.

Theme: Factuality, abstention, and uncertainty localization (including diffusion LMs)

Why it matters: “Confident but wrong” outputs persist; better signals for where uncertainty is and when to abstain enable targeted correction rather than blanket refusal.
Representative papers:
Common approach:
- Localize uncertainty to spans/tokens (cross-chain entropy for DLM commitments; fact-aligned masks + risk propagation).
- Apply targeted interventions (remask/re-denoise uncertain spans; probability reallocation only on risky fact spans).
- Use compact verifiers with supervised rationales for grounded decisions (1B verifier with structured reasoning).
Open questions / failure modes:
- OSCAR’s VRAM overhead (parallel chains) and limits when the model lacks knowledge (consistent hallucinations across chains).
- PRISM depends on fact extraction/verification quality and requires tuning λ to avoid capability degradation.
- Trace inversion adds multiple LLM calls (cost) and is tailored to reasoning-trace models.

3) Technical synthesis

Budget-awareness is becoming a unifying design principle across agent reliability: ToolMisuseBench budgets (steps/calls/retries), ContextBudget’s explicit remaining-context state, and CoT token-budget sweeps all show that “more compute” can hurt without correct allocation.
Routing/weighting is the common fix for coarse credit assignment: SRPO routes samples between GRPO and SDPO; PGPO routes advantage mass to visually-dependent tokens; both aim to reduce gradient variance and prevent late-stage collapse.
Verification is shifting earlier and more locally: SWE-HERO uses execution-backed refinement after large execution-free distillation; OSCAR corrects uncertain spans before they “crystallize” in diffusion decoding; SAFE (multi-hop) verifies each atomic step (KG triple) with a trained feedback model.
Determinism + replayability is the new benchmark gold standard for tool reliability and safety: ToolMisuseBench’s seeded fault engine and ATBench’s planner-based synthesis + human audit enable controlled ablations and longitudinal comparisons.
Trajectory-level safety diagnosis is still the bottleneck: ATBench shows binary unsafe detection can be decent while fine-grained attribution is very low; Connor addresses this by intent extraction + step-wise behavior deviation judgments.
Mechanistic interpretability is being used adversarially and diagnostically: CRaFT uses circuit influence (via cross-layer transcoders) to find causally effective refusal features, producing much higher jailbreak ASR than activation-based selection.
RAG alignment is moving from IR labels to reader-utility signals: RRPO trains rerankers with RL using LLM-evaluated generation rewards; Neuro-RIT adapts the generator at neuron granularity to ignore irrelevant retrieval.
Small, structured reasoning supervision can beat larger baselines in verification: ThinknCheck’s 1B model with supervised rationales surpasses a 7B verifier on LLMAggreFact balanced accuracy and generalizes better to SciFact.
Embodied robustness is expanding beyond 2D patches: Tex3D’s differentiable 3D texture optimization (dual renderer + temporal weighting) shows large failure-rate increases and sim-to-real transfer, implying object appearance is a first-class attack surface.

4) Top 5 papers (with “why now”)

1) From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

Two-stage SFT: 300k execution-free distilled trajectories then 13.2k execution-backed refinement.
Strong open-source SWE-bench Verified results (e.g., 62.2% for 32B) and clear ablation showing the execution-free stage matters (55.7% → 62.2%).
Practical recipe details (128k context via YaRN; multi-turn masking; test-time scaling with verifiers).
Skepticism: inherits teacher biases (Qwen3-Coder-480B) and depends on verifier quality; environment variance affects reproducibility.

2) ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

1,000 human-audited tool-grounded trajectories with delayed triggers; large tool pool (2,084 tools; 1,954 calls).
Shows a key gap: strong models can do binary safety moderately well (GPT-5.4 76.7% F1) but fail at diagnosis (e.g., 13.5% failure-mode accuracy).
Provides a controllable taxonomy (risk source / failure mode / harm) for targeted evaluation slices.
Skepticism: single-label per axis can miss multi-causal interpretations; English-only; text+tool only (no multimodal/embodied).

3) From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

Component-centric PoC dataset: 114 malicious servers (19 influence paths × 6 goals); shows multi-component compositions can raise ASR; direct code/config injection hits 100% ASR.
Connor detector: 94.6% F1, strong ablation evidence (semantic generator critical), and marketplace sweep (1,672 servers → 2 confirmed malicious).
Concrete blueprint for tool marketplace security: intent extraction + execution tracing + code slicing + step-wise judgments.
Skepticism: relies on simulation/execution—payloads not triggered during simulation can evade; results depend on host/LLM versions.

4) OSCAR: Orchestrated Self-verification and Cross-path Refinement

Training-free hallucination detection/correction for diffusion LMs via cross-chain entropy localization + targeted remasking.
Beats a trained detector on AUROC (avg 86.5% on LLaDA-8B; 85.7% on Dream-7B) and improves QA F1 (+6.1 pp on LLaDA-8B; +10.7 on TriviaQA).
Span-level reductions on RAGTruth (overall 41.1% hallucinated span mass reduction).
Skepticism: increased peak VRAM (~1.67× for N=8) and limited to two DLMs; can’t fix “unknown unknowns” without retrieval.

5) Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

Clear deployment guidance: brief CoT helps routing; long CoT collapses accuracy (Qwen2.5-1.5B: 44% → 64% at 32 tokens, then 25% at 256).
Mechanistic error breakdown: brief CoT slashes wrong-valid-function selection (30.5% → 1.5%); long CoT increases wrong-valid and hallucinated functions.
FR-CoT prompt eliminates function hallucination (0.0%) while matching brief-CoT accuracy.
Skepticism: limited to BFCL v3 Multiple-function and three models; multi-step tool chains not evaluated.

5) Practical next steps

Adopt budgeted evaluation: add ToolMisuseBench-style deterministic fault injection + AUC-over-budget caps to your internal tool-agent CI; track invalid-call rate, recovery time, and catastrophic failures separately.
Implement “brief routing CoT” for function calling: try 8–32 token reasoning caps and/or FR-CoT-style forced function commitment; measure wrong-valid vs hallucinated-function rates.
Treat context as a constrained control problem: prototype a remaining-context-aware compression policy (NULL/PARTIAL/FULL over segments) and evaluate robustness under shrinking budgets (e.g., 16k→4k).
Harden tool supply chains: add pre-execution config scanning for risky startup commands and intent extraction from tool schemas; consider trajectory-level behavior deviation checks for high-risk tools.
Move from binary safety to diagnosis: if using trajectory safety benchmarks (e.g., ATBench-like), train/measure fine-grained attribution (risk source/failure mode/harm), not just safe/unsafe.
For RAG systems, optimize retrieval for reader utility: experiment with RL-trained rerankers using LLM-based generation rewards (RRPO-style) and compare against IR-label-trained rerankers on downstream F1/EM.
For factuality, localize then correct: where model-native uncertainty signals exist (diffusion chains), do span-level correction; for AR models, consider training-time span masking/reallocation (PRISM-like) if you have fact-risk annotations.
For embodied systems, add appearance-robustness tests: include object-bound texture/appearance perturbations (multi-view, EoT-style) in sim evaluation; track transfer to physical setups if applicable.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-04

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Budgeted, deterministic evaluation for tool-using agents

Theme: Scalable training + verification loops for code and long-horizon autonomy

Theme: Trajectory-level safety + supply-chain/tooling security

Theme: Credit assignment and routing in post-training (RLVR / preference learning)

Theme: Factuality, abstention, and uncertainty localization (including diffusion LMs)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps