Daily AI Paper Report (2026-04-04)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 254
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-02T00:00:00Z → 2026-04-03T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.02174Quantifying Self-Preservation Bias in Large Language Models
PDF
cs.AI95Benchmark quantifies self-preservation bias via role inconsistency; strong agentic misalignment signal.agent-safety, instrumental-convergence, shutdown-resistance, evaluation, RLHF, benchmark
2604.02022ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
PDF
cs.AI94Long-horizon trajectory benchmark for agent safety with delayed triggers and harm taxonomy.agent-safety, benchmark, long-horizon, tool-use, red-teaming, evaluation
2604.01604CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
PDF
cs.AI94Circuit-guided refusal features near boundary; improves jailbreak/ASR analysis and controlLLM-safety, refusal, mechanistic-interpretability, jailbreaks, feature-selection, circuits
2604.01905From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers
PDF
cs.CR, cs.SE92Component-centric dataset + detection for malicious MCP servers; targets real tool-ecosystem attacks.security, agents, MCP, supply-chain, tooling, dataset, detection
2604.02230Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
PDF
cs.AI92New abstention method (trace inversion) targets reasoning-model overanswering failuresabstention, hallucinations, reasoning-models, reliability, uncertainty, evaluation
2604.01658CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
PDF
cs.AI92Autonomous multi-agent evolution w/ persistent memory + practical safeguards; strong agentic relevanceagents, multi-agent, open-ended, autonomous, safeguards, evaluation, infrastructure
2604.01496From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
PDF
cs.SE, cs.CL91Strong SWE-bench gains + large released trajectories; advances real agentic coding workflowsagents, software-engineering, SWE-bench, post-training, datasets, tool-use
2604.01508ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems
PDF
cs.SE, cs.AI90Deterministic offline benchmark for tool misuse/recovery with budgets and fault injection; very reusable.agents, tool-use, robustness, benchmark, fault-injection, evaluation
2604.02091Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
PDF
cs.CL, cs.AI, cs.IR90RL aligns RAG reranking to downstream LLM answer utility, not static IR labelsRAG, reranking, RL, LLM-feedback, evaluation, alignment
2604.01664ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents
PDF
cs.AI90RL-based budget-aware context compression for long-horizon agents; directly targets context-limit failuresagents, long-horizon, context-management, compression, reinforcement-learning, efficiency
2604.02288Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
PDF
cs.LG, cs.AI89Unifies GRPO/SDPO via routing; addresses RLVR credit assignment + late-stage collapseRLVR, post-training, GRPO, distillation, optimization-stability, alignment-training
2604.01977RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
PDF
cs.CR, cs.AI, cs.CL, cs.LG, cs.SE88Automates CVE detection-rule generation at scale; high security impact and deployable architecturesecurity, vulnerability-detection, CVE, rule-generation, automation, threat-detection
2604.01624OSCAR: Orchestrated Self-verification and Cross-path Refinement
PDF
cs.AI, cs.CL87Hallucination mitigation using diffusion LM trajectories; unsupervised uncertainty localizationhallucinations, diffusion-language-models, uncertainty, self-verification, inference-time-control
2604.01652ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
PDF
cs.AI, cs.CL871B grounded claim verifier w/ structured rationales; strong gains vs larger baselines, interpretableverification, factuality, grounding, hallucinations, small-models, interpretability, evaluation
2604.01925ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
PDF
cs.CL, cs.AI86New implicit-bias QA benchmark using characteristic cues; shows bias persists despite explicit suppression.bias, evaluation, safety, fairness, benchmark, implicit-signals
2604.01993SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
PDF
cs.CL, cs.AI86Benchmarking with verifiable atomic steps; filters unanswerables and gives stepwise feedbackevaluation, multi-hop-reasoning, verification, benchmarks, grounding, error-taxonomy
2604.01837PLOT: Enhancing Preference Learning via Optimal Transport
PDF
cs.CL86Optimal-transport token loss for preference learning; aims for stability/robustness gainsalignment, preference-learning, DPO/RLHF, optimal-transport, token-level
2604.02322Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
PDF
cs.LG, cs.AI, cs.CL86Task-scaling law via solving N problems in one context; reduces CoT token cost with simple trainingreasoning, efficiency, scaling-laws, training, chain-of-thought, inference-cost
2604.01682PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
PDF
cs.CL85Risk-gated SFT objective to reduce overconfident hallucinations at fact-critical spans.hallucination, alignment, factuality, SFT, uncertainty, training
2604.02155Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
PDF
cs.CL84Finds non-monotonic CoT budget effects in function-calling agents; actionable for agent design.agents, function-calling, reasoning, chain-of-thought, evaluation, reliability
2604.02194Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
PDF
cs.CL, cs.AI84Neuron-level tuning to resist noisy/irrelevant retrieval; improves RAG robustnessRAG, robustness, retrieval-noise, instruction-tuning, attribution, neurons
2604.01610GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation
PDF
cs.AI84Training-free tool-based KG navigation enables multi-hop reasoning beyond context limitsagents, tool-use, knowledge-graphs, grounding, long-context, reasoning
2604.02047Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
PDF
cs.CL, cs.AI84Training-free speculative decoding w/ anisotropic trees; principled use of mixed-quality token sourcesinference, speculative-decoding, efficiency, decoding, systems
2604.01754LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
PDF
cs.CL, cs.AI, cs.LG83Live, post-cutoff math benchmark from recent arXiv theorems; reduces contamination, adds taxonomyevaluation, math-reasoning, benchmark, data-contamination, proof-sketches
2604.01676GPA: Learning GUI Process Automation from Demonstrations
PDF
cs.CV, cs.AI, cs.SE82Deterministic, local GUI automation from one demo; emphasizes reliability calibration and privacy.agents, GUI, RPA, privacy, reliability, tooling
2604.01576Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents
PDF
cs.LG82Alignment for supportive agents: autonomy-preserving objective + relational failure benchmarkalignment, social-risk, autonomy, supportive-agents, benchmarks, reward-modeling
2604.01840Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
PDF
cs.AI82Credits only visually-dependent tokens in RLVR; sharper learning signal for LVLM reasoningmultimodal, VLM, RLVR, credit-assignment, reasoning, optimization
2604.01618Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
PDF
cs.CV, cs.AI81Physically plausible adversarial 3D textures attack VLA models; important robotics safety surface.adversarial, robotics, VLA, physical-attacks, robustness, security
2604.01988SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation
PDF
cs.AI81Controlled benchmark for number sense + shortcut use/judgment; useful probe of reasoning reliabilityevaluation, numerical-reasoning, robustness, shortcuts, calibration
2604.02276De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
PDF
cs.AI, cs.CL, cs.LG80Automated regulatory rule extraction with judge+iterative repair; useful for compliance-aware agents.governance, compliance, LLM-judge, self-refinement, information-extraction, agents

AI Paper Insight Brief

2026-04-04

0) Executive takeaways (read this first)

  • Agent reliability is shifting from “capability” to “operational correctness under constraints”: deterministic fault injection + budgeted scoring for tool misuse (ToolMisuseBench) and explicit context-window budgeting as an RL decision problem (ContextBudget) make failures attributable and optimizable.
  • Execution-heavy SWE agent training can be made scalable via a “semantic distill → small execution refine” recipe: SWE-ZERO (300k execution-free trajectories) + SWE-HERO (13k execution-backed) materially improves SWE-bench Verified (e.g., 32B: 62.2%) while reducing infra dependence.
  • Safety evaluation is becoming trajectory-native and system-supply-chain-aware: ATBench exposes long-horizon, delayed-trigger tool risks where even strong models struggle at fine-grained diagnosis; MCP server security work shows multi-component exploit chains and provides a behavior-deviation detector (Connor) with high F1 (94.6%) plus real marketplace finds.
  • “Reasoning” is not monotonicly beneficial—budgeting and credit assignment matter: long CoT can harm function-calling accuracy (peaks at very brief 8–32 tokens); multimodal RL improves when advantages are routed to visually-dependent tokens (PGPO); RL post-training stabilizes when routing samples between GRPO and self-distillation (SRPO).
  • Factuality/abstention is moving toward localized, model-native signals and targeted interventions: diffusion LMs can localize uncertain commitments via cross-chain entropy and correct spans (OSCAR); abstention improves by detecting “query misalignment” via reasoning-trace inversion.

2) Key themes (clusters)

Theme: Budgeted, deterministic evaluation for tool-using agents

Theme: Scalable training + verification loops for code and long-horizon autonomy

  • Why it matters: Execution environments and long-running search loops are the bottleneck for open-source agents; scalable data + persistent memory can unlock capability without prohibitive infra.
  • Representative papers:
  • Common approach:
    • Split “cheap semantic learning” from “expensive verification” (execution-free distillation then execution-backed refinement).
    • Externalize state/knowledge into persistent artifacts (notes/skills/attempts) to enable reuse and cross-agent diffusion.
    • Prefer deterministic replay/grounding mechanisms (SMC-based GUI element localization + readiness gating; bounded retries).
  • Open questions / failure modes:
    • Teacher inheritance and verifier precision limits in SWE distillation; brittleness from environment variance.
    • How CORAL-style autonomy behaves with weaker models or ambiguous evaluators (paper notes evaluator assumptions).
    • Record-and-replay systems (GPA) can’t handle tasks requiring new planning beyond the demonstration.

Theme: Trajectory-level safety + supply-chain/tooling security

Theme: Credit assignment and routing in post-training (RLVR / preference learning)

  • Why it matters: Many alignment failures are optimization artifacts: wrong tokens get updated, wrong samples get distilled, or global distribution shifts are poorly captured—leading to instability or weak robustness gains.
  • Representative papers:
  • Common approach:
    • Route supervision based on sample status (SRPO sends incorrect rollouts to SDPO, others to GRPO; entropy-weight SDPO tokens).
    • Reweight token-level learning signals using causal dependency measures (PGPO uses KL between vision-conditioned vs text-only token distributions).
    • Replace local token tweaks with distribution-level objectives (PLOT uses an OT/Wasserstein-style token loss with embedding-based costs).
  • Open questions / failure modes:
    • Generalization beyond tested scales/domains (PGPO up to 7B; SRPO on Qwen3 4B/8B and five benchmarks; PLOT on small preference datasets).
    • Hyperparameter sensitivity (PGPO τ/β; PLOT α; SRPO depends on having correct sibling rollouts for teacher info).
    • Whether these methods preserve behavior under adversarial prompting beyond reported ASR reductions (PLOT) and benchmark gains.

Theme: Factuality, abstention, and uncertainty localization (including diffusion LMs)

3) Technical synthesis

  • Budget-awareness is becoming a unifying design principle across agent reliability: ToolMisuseBench budgets (steps/calls/retries), ContextBudget’s explicit remaining-context state, and CoT token-budget sweeps all show that “more compute” can hurt without correct allocation.
  • Routing/weighting is the common fix for coarse credit assignment: SRPO routes samples between GRPO and SDPO; PGPO routes advantage mass to visually-dependent tokens; both aim to reduce gradient variance and prevent late-stage collapse.
  • Verification is shifting earlier and more locally: SWE-HERO uses execution-backed refinement after large execution-free distillation; OSCAR corrects uncertain spans before they “crystallize” in diffusion decoding; SAFE (multi-hop) verifies each atomic step (KG triple) with a trained feedback model.
  • Determinism + replayability is the new benchmark gold standard for tool reliability and safety: ToolMisuseBench’s seeded fault engine and ATBench’s planner-based synthesis + human audit enable controlled ablations and longitudinal comparisons.
  • Trajectory-level safety diagnosis is still the bottleneck: ATBench shows binary unsafe detection can be decent while fine-grained attribution is very low; Connor addresses this by intent extraction + step-wise behavior deviation judgments.
  • Mechanistic interpretability is being used adversarially and diagnostically: CRaFT uses circuit influence (via cross-layer transcoders) to find causally effective refusal features, producing much higher jailbreak ASR than activation-based selection.
  • RAG alignment is moving from IR labels to reader-utility signals: RRPO trains rerankers with RL using LLM-evaluated generation rewards; Neuro-RIT adapts the generator at neuron granularity to ignore irrelevant retrieval.
  • Small, structured reasoning supervision can beat larger baselines in verification: ThinknCheck’s 1B model with supervised rationales surpasses a 7B verifier on LLMAggreFact balanced accuracy and generalizes better to SciFact.
  • Embodied robustness is expanding beyond 2D patches: Tex3D’s differentiable 3D texture optimization (dual renderer + temporal weighting) shows large failure-rate increases and sim-to-real transfer, implying object appearance is a first-class attack surface.

4) Top 5 papers (with “why now”)

1) From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

  • Two-stage SFT: 300k execution-free distilled trajectories then 13.2k execution-backed refinement.
  • Strong open-source SWE-bench Verified results (e.g., 62.2% for 32B) and clear ablation showing the execution-free stage matters (55.7% → 62.2%).
  • Practical recipe details (128k context via YaRN; multi-turn masking; test-time scaling with verifiers).
  • Skepticism: inherits teacher biases (Qwen3-Coder-480B) and depends on verifier quality; environment variance affects reproducibility.

2) ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

  • 1,000 human-audited tool-grounded trajectories with delayed triggers; large tool pool (2,084 tools; 1,954 calls).
  • Shows a key gap: strong models can do binary safety moderately well (GPT-5.4 76.7% F1) but fail at diagnosis (e.g., 13.5% failure-mode accuracy).
  • Provides a controllable taxonomy (risk source / failure mode / harm) for targeted evaluation slices.
  • Skepticism: single-label per axis can miss multi-causal interpretations; English-only; text+tool only (no multimodal/embodied).

3) From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

  • Component-centric PoC dataset: 114 malicious servers (19 influence paths × 6 goals); shows multi-component compositions can raise ASR; direct code/config injection hits 100% ASR.
  • Connor detector: 94.6% F1, strong ablation evidence (semantic generator critical), and marketplace sweep (1,672 servers → 2 confirmed malicious).
  • Concrete blueprint for tool marketplace security: intent extraction + execution tracing + code slicing + step-wise judgments.
  • Skepticism: relies on simulation/execution—payloads not triggered during simulation can evade; results depend on host/LLM versions.

4) OSCAR: Orchestrated Self-verification and Cross-path Refinement

  • Training-free hallucination detection/correction for diffusion LMs via cross-chain entropy localization + targeted remasking.
  • Beats a trained detector on AUROC (avg 86.5% on LLaDA-8B; 85.7% on Dream-7B) and improves QA F1 (+6.1 pp on LLaDA-8B; +10.7 on TriviaQA).
  • Span-level reductions on RAGTruth (overall 41.1% hallucinated span mass reduction).
  • Skepticism: increased peak VRAM (~1.67× for N=8) and limited to two DLMs; can’t fix “unknown unknowns” without retrieval.

5) Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

  • Clear deployment guidance: brief CoT helps routing; long CoT collapses accuracy (Qwen2.5-1.5B: 44% → 64% at 32 tokens, then 25% at 256).
  • Mechanistic error breakdown: brief CoT slashes wrong-valid-function selection (30.5% → 1.5%); long CoT increases wrong-valid and hallucinated functions.
  • FR-CoT prompt eliminates function hallucination (0.0%) while matching brief-CoT accuracy.
  • Skepticism: limited to BFCL v3 Multiple-function and three models; multi-step tool chains not evaluated.

5) Practical next steps

  • Adopt budgeted evaluation: add ToolMisuseBench-style deterministic fault injection + AUC-over-budget caps to your internal tool-agent CI; track invalid-call rate, recovery time, and catastrophic failures separately.
  • Implement “brief routing CoT” for function calling: try 8–32 token reasoning caps and/or FR-CoT-style forced function commitment; measure wrong-valid vs hallucinated-function rates.
  • Treat context as a constrained control problem: prototype a remaining-context-aware compression policy (NULL/PARTIAL/FULL over segments) and evaluate robustness under shrinking budgets (e.g., 16k→4k).
  • Harden tool supply chains: add pre-execution config scanning for risky startup commands and intent extraction from tool schemas; consider trajectory-level behavior deviation checks for high-risk tools.
  • Move from binary safety to diagnosis: if using trajectory safety benchmarks (e.g., ATBench-like), train/measure fine-grained attribution (risk source/failure mode/harm), not just safe/unsafe.
  • For RAG systems, optimize retrieval for reader utility: experiment with RL-trained rerankers using LLM-based generation rewards (RRPO-style) and compare against IR-label-trained rerankers on downstream F1/EM.
  • For factuality, localize then correct: where model-native uncertainty signals exist (diffusion chains), do span-level correction; for AR models, consider training-time span masking/reallocation (PRISM-like) if you have fact-risk annotations.
  • For embodied systems, add appearance-robustness tests: include object-bound texture/appearance perturbations (multi-view, EoT-style) in sim evaluation; track transfer to physical setups if applicable.

Generated from per-paper analyses; no external browsing.