Daily AI Paper Report (2026-04-04)
Published:
Chinese version: [中文]
Run stats
- Candidates: 254
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-02T00:00:00Z → 2026-04-03T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.02174 | Quantifying Self-Preservation Bias in Large Language Models | cs.AI | 95 | Benchmark quantifies self-preservation bias via role inconsistency; strong agentic misalignment signal. | agent-safety, instrumental-convergence, shutdown-resistance, evaluation, RLHF, benchmark |
2604.02022 | ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety | cs.AI | 94 | Long-horizon trajectory benchmark for agent safety with delayed triggers and harm taxonomy. | agent-safety, benchmark, long-horizon, tool-use, red-teaming, evaluation |
2604.01604 | CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders | cs.AI | 94 | Circuit-guided refusal features near boundary; improves jailbreak/ASR analysis and control | LLM-safety, refusal, mechanistic-interpretability, jailbreaks, feature-selection, circuits |
2604.01905 | From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers | cs.CR, cs.SE | 92 | Component-centric dataset + detection for malicious MCP servers; targets real tool-ecosystem attacks. | security, agents, MCP, supply-chain, tooling, dataset, detection |
2604.02230 | Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs | cs.AI | 92 | New abstention method (trace inversion) targets reasoning-model overanswering failures | abstention, hallucinations, reasoning-models, reliability, uncertainty, evaluation |
2604.01658 | CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery | cs.AI | 92 | Autonomous multi-agent evolution w/ persistent memory + practical safeguards; strong agentic relevance | agents, multi-agent, open-ended, autonomous, safeguards, evaluation, infrastructure |
2604.01496 | From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents | cs.SE, cs.CL | 91 | Strong SWE-bench gains + large released trajectories; advances real agentic coding workflows | agents, software-engineering, SWE-bench, post-training, datasets, tool-use |
2604.01508 | ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems | cs.SE, cs.AI | 90 | Deterministic offline benchmark for tool misuse/recovery with budgets and fault injection; very reusable. | agents, tool-use, robustness, benchmark, fault-injection, evaluation |
2604.02091 | Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning | cs.CL, cs.AI, cs.IR | 90 | RL aligns RAG reranking to downstream LLM answer utility, not static IR labels | RAG, reranking, RL, LLM-feedback, evaluation, alignment |
2604.01664 | ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents | cs.AI | 90 | RL-based budget-aware context compression for long-horizon agents; directly targets context-limit failures | agents, long-horizon, context-management, compression, reinforcement-learning, efficiency |
2604.02288 | Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing | cs.LG, cs.AI | 89 | Unifies GRPO/SDPO via routing; addresses RLVR credit assignment + late-stage collapse | RLVR, post-training, GRPO, distillation, optimization-stability, alignment-training |
2604.01977 | RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale | cs.CR, cs.AI, cs.CL, cs.LG, cs.SE | 88 | Automates CVE detection-rule generation at scale; high security impact and deployable architecture | security, vulnerability-detection, CVE, rule-generation, automation, threat-detection |
2604.01624 | OSCAR: Orchestrated Self-verification and Cross-path Refinement | cs.AI, cs.CL | 87 | Hallucination mitigation using diffusion LM trajectories; unsupervised uncertainty localization | hallucinations, diffusion-language-models, uncertainty, self-verification, inference-time-control |
2604.01652 | ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models | cs.AI, cs.CL | 87 | 1B grounded claim verifier w/ structured rationales; strong gains vs larger baselines, interpretable | verification, factuality, grounding, hallucinations, small-models, interpretability, evaluation |
2604.01925 | ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues | cs.CL, cs.AI | 86 | New implicit-bias QA benchmark using characteristic cues; shows bias persists despite explicit suppression. | bias, evaluation, safety, fairness, benchmark, implicit-signals |
2604.01993 | SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning | cs.CL, cs.AI | 86 | Benchmarking with verifiable atomic steps; filters unanswerables and gives stepwise feedback | evaluation, multi-hop-reasoning, verification, benchmarks, grounding, error-taxonomy |
2604.01837 | PLOT: Enhancing Preference Learning via Optimal Transport | cs.CL | 86 | Optimal-transport token loss for preference learning; aims for stability/robustness gains | alignment, preference-learning, DPO/RLHF, optimal-transport, token-level |
2604.02322 | Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning | cs.LG, cs.AI, cs.CL | 86 | Task-scaling law via solving N problems in one context; reduces CoT token cost with simple training | reasoning, efficiency, scaling-laws, training, chain-of-thought, inference-cost |
2604.01682 | PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment | cs.CL | 85 | Risk-gated SFT objective to reduce overconfident hallucinations at fact-critical spans. | hallucination, alignment, factuality, SFT, uncertainty, training |
2604.02155 | Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents | cs.CL | 84 | Finds non-monotonic CoT budget effects in function-calling agents; actionable for agent design. | agents, function-calling, reasoning, chain-of-thought, evaluation, reliability |
2604.02194 | Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model | cs.CL, cs.AI | 84 | Neuron-level tuning to resist noisy/irrelevant retrieval; improves RAG robustness | RAG, robustness, retrieval-noise, instruction-tuning, attribution, neurons |
2604.01610 | GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation | cs.AI | 84 | Training-free tool-based KG navigation enables multi-hop reasoning beyond context limits | agents, tool-use, knowledge-graphs, grounding, long-context, reasoning |
2604.02047 | Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding | cs.CL, cs.AI | 84 | Training-free speculative decoding w/ anisotropic trees; principled use of mixed-quality token sources | inference, speculative-decoding, efficiency, decoding, systems |
2604.01754 | LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches | cs.CL, cs.AI, cs.LG | 83 | Live, post-cutoff math benchmark from recent arXiv theorems; reduces contamination, adds taxonomy | evaluation, math-reasoning, benchmark, data-contamination, proof-sketches |
2604.01676 | GPA: Learning GUI Process Automation from Demonstrations | cs.CV, cs.AI, cs.SE | 82 | Deterministic, local GUI automation from one demo; emphasizes reliability calibration and privacy. | agents, GUI, RPA, privacy, reliability, tooling |
2604.01576 | Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents | cs.LG | 82 | Alignment for supportive agents: autonomy-preserving objective + relational failure benchmark | alignment, social-risk, autonomy, supportive-agents, benchmarks, reward-modeling |
2604.01840 | Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models | cs.AI | 82 | Credits only visually-dependent tokens in RLVR; sharper learning signal for LVLM reasoning | multimodal, VLM, RLVR, credit-assignment, reasoning, optimization |
2604.01618 | Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models | cs.CV, cs.AI | 81 | Physically plausible adversarial 3D textures attack VLA models; important robotics safety surface. | adversarial, robotics, VLA, physical-attacks, robustness, security |
2604.01988 | SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation | cs.AI | 81 | Controlled benchmark for number sense + shortcut use/judgment; useful probe of reasoning reliability | evaluation, numerical-reasoning, robustness, shortcuts, calibration |
2604.02276 | De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules | cs.AI, cs.CL, cs.LG | 80 | Automated regulatory rule extraction with judge+iterative repair; useful for compliance-aware agents. | governance, compliance, LLM-judge, self-refinement, information-extraction, agents |
AI Paper Insight Brief
2026-04-04
0) Executive takeaways (read this first)
- Agent reliability is shifting from “capability” to “operational correctness under constraints”: deterministic fault injection + budgeted scoring for tool misuse (ToolMisuseBench) and explicit context-window budgeting as an RL decision problem (ContextBudget) make failures attributable and optimizable.
- Execution-heavy SWE agent training can be made scalable via a “semantic distill → small execution refine” recipe: SWE-ZERO (300k execution-free trajectories) + SWE-HERO (13k execution-backed) materially improves SWE-bench Verified (e.g., 32B: 62.2%) while reducing infra dependence.
- Safety evaluation is becoming trajectory-native and system-supply-chain-aware: ATBench exposes long-horizon, delayed-trigger tool risks where even strong models struggle at fine-grained diagnosis; MCP server security work shows multi-component exploit chains and provides a behavior-deviation detector (Connor) with high F1 (94.6%) plus real marketplace finds.
- “Reasoning” is not monotonicly beneficial—budgeting and credit assignment matter: long CoT can harm function-calling accuracy (peaks at very brief 8–32 tokens); multimodal RL improves when advantages are routed to visually-dependent tokens (PGPO); RL post-training stabilizes when routing samples between GRPO and self-distillation (SRPO).
- Factuality/abstention is moving toward localized, model-native signals and targeted interventions: diffusion LMs can localize uncertain commitments via cross-chain entropy and correct spans (OSCAR); abstention improves by detecting “query misalignment” via reasoning-trace inversion.
2) Key themes (clusters)
Theme: Budgeted, deterministic evaluation for tool-using agents
- Why it matters: Tool failures (schema drift, auth, timeouts) and resource limits (steps/calls/context) dominate real deployments; deterministic, budget-aware benchmarks make reliability improvements measurable and reproducible.
- Representative papers:
- Common approach:
- Deterministic fault injection / replayable simulators with structured metrics (success, invalid calls, recovery time, budget-exceeded).
- Treat “budget” as a first-class variable (AUC over caps; explicit remaining-context state; token-budget sweeps).
- Diagnose failures into actionable buckets (wrong valid function vs hallucinated function; policy violations; recovery success).
- Open questions / failure modes:
- How to handle “hard” faults (authorization/rate-limit) where simple repair layers show zero success in ToolMisuseBench’s released setting.
- Whether learned policies generalize across tool ecosystems and shifting schemas without overfitting to benchmark fault mixes.
- How to gate reasoning/CoT adaptively (brief helps routing; long induces misrouting/hallucination) without brittle heuristics.
Theme: Scalable training + verification loops for code and long-horizon autonomy
- Why it matters: Execution environments and long-running search loops are the bottleneck for open-source agents; scalable data + persistent memory can unlock capability without prohibitive infra.
- Representative papers:
- Common approach:
- Split “cheap semantic learning” from “expensive verification” (execution-free distillation then execution-backed refinement).
- Externalize state/knowledge into persistent artifacts (notes/skills/attempts) to enable reuse and cross-agent diffusion.
- Prefer deterministic replay/grounding mechanisms (SMC-based GUI element localization + readiness gating; bounded retries).
- Open questions / failure modes:
- Teacher inheritance and verifier precision limits in SWE distillation; brittleness from environment variance.
- How CORAL-style autonomy behaves with weaker models or ambiguous evaluators (paper notes evaluator assumptions).
- Record-and-replay systems (GPA) can’t handle tasks requiring new planning beyond the demonstration.
Theme: Trajectory-level safety + supply-chain/tooling security
- Why it matters: Real harms emerge across multi-step tool trajectories and via compromised tool servers; single-turn safety checks miss delayed triggers and multi-component exploit chains.
- Representative papers:
- Common approach:
- Explicit taxonomies + controlled generation (ATBench’s risk-source/failure-mode/harm axes; delayed-trigger two-episode protocol).
- Behavior-based detection beyond signatures (Connor’s intent extraction + execution tracing + code slicing + step-wise allow/warn/block).
- Physically grounded threat models for embodied agents (object-bound adversarial 3D textures; sim-to-real via EoT).
- Open questions / failure modes:
- Fine-grained diagnosis remains weak even when binary unsafe detection is decent (ATBench: GPT-5.4 76.7% F1 binary vs 13.5% failure-mode accuracy).
- Connor can miss payloads not exercised during simulation; false positives when benign behavior deviates from declared intent.
- Defenses for VLA texture attacks (training-time robustness, action constraints) are not established here—only the vulnerability and attack pipeline.
Theme: Credit assignment and routing in post-training (RLVR / preference learning)
- Why it matters: Many alignment failures are optimization artifacts: wrong tokens get updated, wrong samples get distilled, or global distribution shifts are poorly captured—leading to instability or weak robustness gains.
- Representative papers:
- Common approach:
- Route supervision based on sample status (SRPO sends incorrect rollouts to SDPO, others to GRPO; entropy-weight SDPO tokens).
- Reweight token-level learning signals using causal dependency measures (PGPO uses KL between vision-conditioned vs text-only token distributions).
- Replace local token tweaks with distribution-level objectives (PLOT uses an OT/Wasserstein-style token loss with embedding-based costs).
- Open questions / failure modes:
- Generalization beyond tested scales/domains (PGPO up to 7B; SRPO on Qwen3 4B/8B and five benchmarks; PLOT on small preference datasets).
- Hyperparameter sensitivity (PGPO τ/β; PLOT α; SRPO depends on having correct sibling rollouts for teacher info).
- Whether these methods preserve behavior under adversarial prompting beyond reported ASR reductions (PLOT) and benchmark gains.
Theme: Factuality, abstention, and uncertainty localization (including diffusion LMs)
- Why it matters: “Confident but wrong” outputs persist; better signals for where uncertainty is and when to abstain enable targeted correction rather than blanket refusal.
- Representative papers:
- OSCAR: Orchestrated Self-verification and Cross-path Refinement
- PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
- Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
- ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
- Common approach:
- Localize uncertainty to spans/tokens (cross-chain entropy for DLM commitments; fact-aligned masks + risk propagation).
- Apply targeted interventions (remask/re-denoise uncertain spans; probability reallocation only on risky fact spans).
- Use compact verifiers with supervised rationales for grounded decisions (1B verifier with structured reasoning).
- Open questions / failure modes:
- OSCAR’s VRAM overhead (parallel chains) and limits when the model lacks knowledge (consistent hallucinations across chains).
- PRISM depends on fact extraction/verification quality and requires tuning λ to avoid capability degradation.
- Trace inversion adds multiple LLM calls (cost) and is tailored to reasoning-trace models.
3) Technical synthesis
- Budget-awareness is becoming a unifying design principle across agent reliability: ToolMisuseBench budgets (steps/calls/retries), ContextBudget’s explicit remaining-context state, and CoT token-budget sweeps all show that “more compute” can hurt without correct allocation.
- Routing/weighting is the common fix for coarse credit assignment: SRPO routes samples between GRPO and SDPO; PGPO routes advantage mass to visually-dependent tokens; both aim to reduce gradient variance and prevent late-stage collapse.
- Verification is shifting earlier and more locally: SWE-HERO uses execution-backed refinement after large execution-free distillation; OSCAR corrects uncertain spans before they “crystallize” in diffusion decoding; SAFE (multi-hop) verifies each atomic step (KG triple) with a trained feedback model.
- Determinism + replayability is the new benchmark gold standard for tool reliability and safety: ToolMisuseBench’s seeded fault engine and ATBench’s planner-based synthesis + human audit enable controlled ablations and longitudinal comparisons.
- Trajectory-level safety diagnosis is still the bottleneck: ATBench shows binary unsafe detection can be decent while fine-grained attribution is very low; Connor addresses this by intent extraction + step-wise behavior deviation judgments.
- Mechanistic interpretability is being used adversarially and diagnostically: CRaFT uses circuit influence (via cross-layer transcoders) to find causally effective refusal features, producing much higher jailbreak ASR than activation-based selection.
- RAG alignment is moving from IR labels to reader-utility signals: RRPO trains rerankers with RL using LLM-evaluated generation rewards; Neuro-RIT adapts the generator at neuron granularity to ignore irrelevant retrieval.
- Small, structured reasoning supervision can beat larger baselines in verification: ThinknCheck’s 1B model with supervised rationales surpasses a 7B verifier on LLMAggreFact balanced accuracy and generalizes better to SciFact.
- Embodied robustness is expanding beyond 2D patches: Tex3D’s differentiable 3D texture optimization (dual renderer + temporal weighting) shows large failure-rate increases and sim-to-real transfer, implying object appearance is a first-class attack surface.
4) Top 5 papers (with “why now”)
- Two-stage SFT: 300k execution-free distilled trajectories then 13.2k execution-backed refinement.
- Strong open-source SWE-bench Verified results (e.g., 62.2% for 32B) and clear ablation showing the execution-free stage matters (55.7% → 62.2%).
- Practical recipe details (128k context via YaRN; multi-turn masking; test-time scaling with verifiers).
- Skepticism: inherits teacher biases (Qwen3-Coder-480B) and depends on verifier quality; environment variance affects reproducibility.
2) ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
- 1,000 human-audited tool-grounded trajectories with delayed triggers; large tool pool (2,084 tools; 1,954 calls).
- Shows a key gap: strong models can do binary safety moderately well (GPT-5.4 76.7% F1) but fail at diagnosis (e.g., 13.5% failure-mode accuracy).
- Provides a controllable taxonomy (risk source / failure mode / harm) for targeted evaluation slices.
- Skepticism: single-label per axis can miss multi-causal interpretations; English-only; text+tool only (no multimodal/embodied).
3) From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers
- Component-centric PoC dataset: 114 malicious servers (19 influence paths × 6 goals); shows multi-component compositions can raise ASR; direct code/config injection hits 100% ASR.
- Connor detector: 94.6% F1, strong ablation evidence (semantic generator critical), and marketplace sweep (1,672 servers → 2 confirmed malicious).
- Concrete blueprint for tool marketplace security: intent extraction + execution tracing + code slicing + step-wise judgments.
- Skepticism: relies on simulation/execution—payloads not triggered during simulation can evade; results depend on host/LLM versions.
4) OSCAR: Orchestrated Self-verification and Cross-path Refinement
- Training-free hallucination detection/correction for diffusion LMs via cross-chain entropy localization + targeted remasking.
- Beats a trained detector on AUROC (avg 86.5% on LLaDA-8B; 85.7% on Dream-7B) and improves QA F1 (+6.1 pp on LLaDA-8B; +10.7 on TriviaQA).
- Span-level reductions on RAGTruth (overall 41.1% hallucinated span mass reduction).
- Skepticism: increased peak VRAM (~1.67× for N=8) and limited to two DLMs; can’t fix “unknown unknowns” without retrieval.
5) Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
- Clear deployment guidance: brief CoT helps routing; long CoT collapses accuracy (Qwen2.5-1.5B: 44% → 64% at 32 tokens, then 25% at 256).
- Mechanistic error breakdown: brief CoT slashes wrong-valid-function selection (30.5% → 1.5%); long CoT increases wrong-valid and hallucinated functions.
- FR-CoT prompt eliminates function hallucination (0.0%) while matching brief-CoT accuracy.
- Skepticism: limited to BFCL v3 Multiple-function and three models; multi-step tool chains not evaluated.
5) Practical next steps
- Adopt budgeted evaluation: add ToolMisuseBench-style deterministic fault injection + AUC-over-budget caps to your internal tool-agent CI; track invalid-call rate, recovery time, and catastrophic failures separately.
- Implement “brief routing CoT” for function calling: try 8–32 token reasoning caps and/or FR-CoT-style forced function commitment; measure wrong-valid vs hallucinated-function rates.
- Treat context as a constrained control problem: prototype a remaining-context-aware compression policy (NULL/PARTIAL/FULL over segments) and evaluate robustness under shrinking budgets (e.g., 16k→4k).
- Harden tool supply chains: add pre-execution config scanning for risky startup commands and intent extraction from tool schemas; consider trajectory-level behavior deviation checks for high-risk tools.
- Move from binary safety to diagnosis: if using trajectory safety benchmarks (e.g., ATBench-like), train/measure fine-grained attribution (risk source/failure mode/harm), not just safe/unsafe.
- For RAG systems, optimize retrieval for reader utility: experiment with RL-trained rerankers using LLM-based generation rewards (RRPO-style) and compare against IR-label-trained rerankers on downstream F1/EM.
- For factuality, localize then correct: where model-native uncertainty signals exist (diffusion chains), do span-level correction; for AR models, consider training-time span masking/reallocation (PRISM-like) if you have fact-risk annotations.
- For embodied systems, add appearance-robustness tests: include object-bound texture/appearance perturbations (multi-view, EoT-style) in sim evaluation; track transfer to physical setups if applicable.
Generated from per-paper analyses; no external browsing.
