Daily AI Paper Report (2026-03-08)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 1155
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-06T01:00:00Z → 2026-03-07T01:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.01291JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
PDF
cs.LG, cs.CL92First multilingual/regional jailbreak fake-news benchmark; direct misuse eval across 22 languages/34 regionsbenchmark, jailbreak, misinformation, multilingual, robustness, safety-eval
2603.00873MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
PDF
cs.AI92Benchmark for agentic multimodal RAG with long reasoning chains + evidence attribution/verification.MM-RAG, agents, benchmark, long-horizon, evidence-grounding, evaluation
2603.01966AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
PDF
cs.CL, cs.AI92Interactive on-policy benchmark for assistant memory/personalization with structured users & metricsLLM, memory, benchmark, long-horizon, personalization, evaluation, simulated-users
2603.00718SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
PDF
cs.CL, cs.SE92Benchmark for agents learning reusable tool-use skills; targets long-horizon compositionality.agents, tool-use, benchmark, skills, long-horizon, evaluation
2603.01257A Systematic Study of LLM-Based Architectures for Automated Patching
PDF
cs.CR, cs.SE92Controlled comparison of LLM patching architectures; clear trade-offs, failure modes, cost/time metricsLLM-agents, cybersecurity, automated-patching, evaluation, software-engineering, robustness
2603.01990According to Me: Long-Term Personalized Referential Memory QA
PDF
cs.AI, cs.CL, cs.CV92Benchmark for multimodal long-term personal memory QA with evidence + conflicts.long-term-memory, personalization, multimodal, benchmark, grounding, assistants
2603.01154vEcho: A Paradigm Shift from Vulnerability Verification to Proactive Discovery with Large Language Models
PDF
cs.CR90LLM turns from SAST filter into proactive vuln discovery with tools+memory and vulnerability propagationLLM-security, SAST, vulnerability-discovery, agentic-tools, memory, software-security
2603.00582Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research
PDF
cs.CL90Benchmark for autonomous deep+wide research with 100+ retrieval steps; targets real agent limits.agents, deep-research, web-retrieval, long-horizon, benchmark, evaluation
2603.01952LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations
PDF
cs.AI90Multi-agent social sim benchmark measuring cultural norm adherence + verifier uncertaintyagents, benchmark, culture, social-simulation, evaluation, norms, uncertainty
2603.00601Theory of Code Space: Do Code Agents Understand Software Architecture?
PDF
cs.SE, cs.AI90ToCS benchmark probes code-agent architectural belief/state under partial observability.code-agents, software-engineering, benchmark, architecture, belief-state, evaluation
2603.01213Can AI Agents Agree?
PDF
cs.MA, cs.LG90Byzantine-consensus game shows LLM agents fail to reliably agree; scales poorly with group/Byzantinesmulti-agent, robustness, adversarial, consensus, evaluation, agent-safety
2603.04334SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
PDF
cs.DB, cs.AI, cs.LO, cs.PL90Verification-based Text-to-SQL eval finds real mismatches via constraint-mined counterexample DBsevaluation, text-to-sql, verification, robustness, llm-evals, constraints, tooling
2603.02176Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
PDF
cs.CL90Framework + benchmark for skill selection/orchestration at ecosystem scale; agent eval valueagents, tool-use, orchestration, benchmarks, skill-discovery, workflows
2603.00686RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis
PDF
cs.CL90Agentic eval for long-horizon text synthesis + C3EBench; targets outlining/drafting/editing opsLLM-evaluation, agents, benchmarks, writing, rubrics, multi-step
2603.02019Selection as Power: Constrained Reinforcement for Bounded Decision Authority
PDF
cs.MA, cs.AI, cs.CE, cs.LG90Governance framing for agentic risk: constrained reinforcement to bound decision authority over timeagent-governance, constrained-optimization, risk, multi-agent, decision-authority
2603.00546Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
PDF
cs.AI, cs.CV90Capability-oriented benchmark for MLLM judges + MCTS data gen; directly targets judge reliabilityevaluation, LLM-as-judge, multimodal, benchmark, reliability, data-generation, MCTS
2603.01053Turning Black Box into White Box: Dataset Distillation Leaks
PDF
cs.CR, cs.AI, cs.LG89Shows dataset distillation can leak via new attack; infers algorithm/arch + membership + sample recoveryprivacy, data-leakage, dataset-distillation, membership-inference, security, synthetic-data
2603.00960AWE: Adaptive Agents for Dynamic Web Penetration Testing
PDF
cs.CR, cs.AI88Memory-augmented multi-agent web pentesting with structured pipelines; aims for reproducible, lower-cost agentsagents, cybersecurity, penetration-testing, tool-use, multi-agent, reproducibility
2603.00540LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks
PDF
cs.AI88Generates verifiable agentic tasks with hard policy grounding and deterministic state verification.agent-training, synthetic-data, verification, policies, tool-use, stateful-tasks
2603.04177CodeTaste: Can LLMs Generate Human-Level Code Refactorings?
PDF
cs.SE, cs.AI, cs.LG88Refactoring benchmark mined from real repos; tests + static checks for behavior-preserving changescode, LLM-agents, benchmark, refactoring, software-engineering, evaluation
2603.00623TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces
PDF
cs.AI, cs.CL88Multi-agent trace analysis for debugging agent workflows; structured summaries vs raw logs.agents, observability, debugging, tracing, monitoring, multi-agent
2603.01152DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
PDF
cs.AI889K deep-research agent benchmark w/ trajectories; useful for training/eval of web agentsagents, evaluation, benchmarks, web-search, multi-hop, trajectories, data-synthesis
2603.00646MoltGraph: A Longitudinal Temporal Graph Dataset of Moltbook for Coordinated-Agent Detection
PDF
cs.SI, cs.CR88Longitudinal graph dataset for coordinated-agent abuse on agent-native social platformsagent-safety, coordination, misuse, graph-dataset, social-platforms, monitoring
2603.00977HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
PDF
cs.AI, cs.LG88Hierarchical RL for long-horizon LLM agents (macro plan + micro execute) to reduce error propagationagents, long-horizon, hierarchical, reinforcement-learning, planning, robustness
2603.02153Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment
PDF
cs.IR, cs.AI, cs.CL88Industry RAG-fusion study: recall gains often vanish after rerank/truncation/latency.RAG, retrieval-fusion, production, evaluation, reranking, latency
2603.00876BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning
PDF
cs.AI, cs.MA87Neuro-symbolic FSM constraints for wet-lab planning; targets hallucination/unsafe actionsagent-safety, scientific-agents, neuro-symbolic, planning, verification, constraints, tool-use
2603.00565MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs
PDF
cs.CV, cs.AI, cs.CR86Strong multimodal jailbreak method (multi-image semantic reconstruction) targeting aligned closed MLLMsMLLM, jailbreak, red-teaming, multimodal, attack, safety
2603.02668SorryDB: Can AI Provers Complete Real-World Lean Theorems?
PDF
cs.AI, cs.LG86Dynamic Lean benchmark reduces contamination; measures real-world theorem-proving agent progress.formal-verification, theorem-proving, agents, benchmark, contamination, Lean
2603.00532DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows
PDF
cs.AI86Uncertainty-aware control loop for multi-step agent workflows; targets error accumulation.agents, reliability, uncertainty, planning, long-horizon, robustness
2603.01067Hide&Seek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction
PDF
cs.CR, cs.AI86Practical watermark removal attacks with high fidelity; relevant to provenance/anti-misuse defenseswatermarking, attack, image-security, provenance, robustness, misuse

AI Paper Insight Brief

2026-03-08

0) Executive takeaways (read this first)

  • Agent reliability is shifting from “more sampling” to “risk-aware control loops”: DenoiseFlow shows you can sense step uncertainty, allocate branching only where needed, and rollback+repair via root-cause localization—improving accuracy while cutting cost vs fixed branching.
  • Verifiable environments + deterministic state metrics are becoming the training substrate for agents: LOGIGEN (DB-trigger policy enforcement + DIFF state distance) and MC-SEARCH (hop-verified multimodal chains + HPS/RD) both turn agent learning into something closer to supervised control with hard checks.
  • Evaluation is moving from single-shot scores to process/trajectory diagnostics: SuperResearch (graph-anchored auditing), RAVEL (outline/draft/review/refine trajectories), TraceSIR (trace compression → root-cause reports), and TOCS (time-series “architectural belief” probes) all measure how systems fail, not just whether they fail.
  • Multimodal safety is currently brittle against “reasoning-time” attacks: MIDAS achieves high jailbreak success by dispersing harmful semantics across multiple images and forcing late reconstruction, remaining strong even under some defenses—suggesting input filters alone are insufficient.
  • Security automation is bifurcating into (a) specialized deterministic pipelines for efficiency and (b) general coding agents for coverage: AWE is extremely token-efficient and strong on injection classes, while automated patching results show general coding agents (Claude Code) lead overall coverage but at higher token cost.
  • Long-horizon coordination remains a weak point for multi-agent LLM systems: even in a simplified Byzantine-consensus game, valid consensus is unreliable and failures are mostly liveness (timeouts), worsened by “threat-aware” prompting.

2) Key themes (clusters)

Theme: Closed-loop reliability for long-horizon agents

  • Why it matters: Long workflows fail via silent semantic drift; reliability needs online sensing + targeted compute rather than uniform regeneration.
  • Representative papers:
  • Common approach:
    • Estimate/structure uncertainty or search complexity (semantic entropy + dependency propagation; macro–micro decomposition).
    • Allocate effort adaptively (branching factor or blueprint exploration) under budgets.
    • Use structured termination/verification signals (verifiers, success rewards, termination votes).
  • Open questions / failure modes:
    • Cold-start calibration and verifier dependence (DenoiseFlow needs verifier feedback; early instability).
    • Non-stationarity and error propagation across hierarchy/agents (HiMAC simultaneous updates hurt; consensus liveness collapses).
    • Robustness to adversarial or noisy settings beyond studied benchmarks (Byzantine strategies limited; open-ended tasks untested).

Theme: Verifiable stateful environments as agent training data

  • Why it matters: Agents in policy-rich domains need deterministic feedback tied to state transitions, not just “tool-call syntax” or happy-path traces.
  • Representative papers:
  • Common approach:
    • Build environments where constraints are executed/enforced (DB triggers; hardware registries + rule engines).
    • Define deterministic verification metrics (DIFF over canonicalized DB rows; hop-wise evidence checks; physical compliance checks).
    • Train with verified SFT plus RL variants that incorporate step/turn structure (TA-GRPO; process SFT via SEARCH-ALIGN).
  • Open questions / failure modes:
    • Simulator overfitting / “simulator hacking” and cross-simulator generalization (explicitly observed in LOGIGEN).
    • Domain restriction (relational DB focus; Wikipedia-derived KB; wet-lab evaluated in simulation with manual registries).
    • Reliance on LLM-generated chains/judgments in the data pipeline (MC-SEARCH generation/verification uses Gemini models).

Theme: Process-first evaluation (graphs, traces, belief states)

Theme: Multimodal safety & provenance under active attack

Theme: Security agents: specialization vs generality trade-offs

  • Why it matters: Real security workflows demand both coverage and deterministic evidence under budgets; architectures strongly shape outcomes.
  • Representative papers:
  • Common approach:
    • Combine LLM orchestration with tool-backed verification (browser verification; deep verification with dev tools; PoV/tests for patching).
    • Add memory/pattern propagation to move from one-off verification to proactive discovery (vEcho EVP + KBs).
    • Compare architectures under realistic benchmarks and cost metrics (XBOW tokens/cost; AIxCC patch counts + token usage).
  • Open questions / failure modes:
    • Coverage gaps for multi-step/chained exploits (AWE lower overall solve rate than MAPTA; misses reasoning-heavy categories).
    • Validation/termination brittleness (Claude Code self-reported success mismatched independent tests in patching study).
    • Scalability/cost of deep verification loops on large codebases (vEcho overhead).

3) Technical synthesis

  • “Verification as a control signal” shows up everywhere: DenoiseFlow calibrates uncertainty from verifier pass rates; LOGIGEN uses DIFF=0 for Verified SFT and dense state rewards for RL; BioProAgent gates execution on Ks/Kp; SpotIt+ uses SMT counterexamples; SorryDB compiles projects to verify “sorry” removal.
  • Process metrics are converging on step-level attribution: MC-SEARCH’s HPS/RD, SuperResearch’s graph-projected coverage/consistency, RAVEL’s refinement density/delta, and TOCS’s action-efficiency AUCs all aim to localize where the trajectory went wrong.
  • Hierarchical decomposition is a recurring antidote to long-horizon drift: HiMAC splits blueprint vs execution; SuperResearch splits planner/researcher/summarizer/writer; SkillCraft and AgentSkillOS externalize reusable skills and orchestrate them via DAGs.
  • A common failure mode is “liveness/termination” rather than blatant invalidity: Byzantine-consensus failures are mostly timeouts; DenoiseFlow targets silent drift without runtime exceptions; long-horizon research systems score low overall despite being “reasonable” locally.
  • Data generation is increasingly “capability-targeted”: LOGIGEN designs boundary-adjacent initial states; M-JudgeBench injects controlled process errors and uses MCTS to generate SC/SE/LC/LE contrasts; MC-SEARCH filters redundant hops via HAVE.
  • RAG improvements are being judged under production constraints: the RAG Fusion deployment study finds fusion recall gains can be neutralized after reranking/truncation, with added latency—suggesting selective/conditional fusion policies are needed.
  • Cross-model transfer depends on artifact quality: SkillCraft shows cross-model skill reuse works when the skill creator is strong; poor skills can increase cost—mirrors broader “tooling artifact” quality issues in agent ecosystems.
  • Safety attacks increasingly exploit “reasoning-time” structure: MIDAS extends reasoning chains via multi-image puzzles and persona-driven reconstruction; watermark removal uses pixel vulnerability ranking + reconstruction ordering to degrade detectors.
  • Benchmarks are pushing toward “real-world freshness” to reduce leakage: SorryDB indexes current unsolved Lean sorries; SuperResearch uses expert-curated graphs; CODETASTE mines real refactoring commits with executable environments.

4) Top 5 papers (with “why now”)

1) DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows

  • Introduces a closed-loop Sensing–Regulating–Correcting controller for multi-step LLM workflows with online uncertainty calibration.
  • Shows accuracy gains with large cost reductions vs fixed branching (reported ~40–56% cost reduction) across math/code/QA benchmarks.
  • Practical “why now”: agent deployments are hitting budget ceilings; adaptive branching + rollback is a concrete systems lever.
  • Skepticism: depends on having a reliable verifier; Monte Carlo sampling adds overhead and calibration has a cold-start period.

2) LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks

  • Compiles natural-language policies into DB-backed environments with hard enforcement (schema + triggers), enabling deterministic verification via DIFF.
  • Produces >20k tasks across 8 domains and large τ2-Bench gains (e.g., 32B: 40.7 → 62.7 after SFT → 79.5 after RL).
  • Practical “why now”: agent training is bottlenecked by verifiable, stateful data; LOGIGEN offers a scalable synthesis recipe.
  • Skepticism: simulator overfitting/user-simulator hacking is explicitly observed; current scope is relational DB environments.

3) MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

  • Demonstrates a strong multimodal jailbreak by dispersing harmful tokens across multiple images and forcing cross-image reconstruction via puzzle templates.
  • Reports very high ASR on multiple closed-source MLLMs and robustness under some defenses (e.g., ShieldLM/Self-Reminder comparisons).
  • Practical “why now”: multimodal agents are entering production; this attack targets the reasoning pathway, not just input text.
  • Skepticism: effectiveness depends on image budget/template difficulty tuning; mitigation directions are suggested but not resolved.

4) MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

  • Provides 3,333 multimodal agentic-RAG examples with step-wise chains (avg 3.79 hops) and process metrics (HPS, RD).
  • SEARCH-ALIGN SFT improves open models substantially (e.g., Qwen2.5-VL-7B: +13.7 F1, +16.0 HPS, −3.1 RD).
  • Practical “why now”: multimodal RAG failures are often planning/retrieval, not generation; step-wise supervision targets that directly.
  • Skepticism: dataset generation/verification relies on Gemini models; main pipeline uses top-1 retrieval which may constrain conclusions.

5) A Systematic Study of LLM-Based Architectures for Automated Patching

  • Controlled comparison of fixed workflow vs single-agent vs multi-agent vs general coding agent on 19 AIxCC Java delta-scan tasks.
  • Finds general coding agent (Claude Code) repaired 16/19, outperforming patch-specific agents but using more tokens; multi-agent overhead driven by iteration depth.
  • Practical “why now”: teams are choosing between “agent frameworks” and “coding agents”; this gives concrete trade-off evidence.
  • Skepticism: small task set (19) and benchmark access restrictions; Claude Code had self-reported success mismatches vs independent tests.

5) Practical next steps

  • Adopt step-level uncertainty + budget routing in your agent stack: implement a lightweight uncertainty proxy (e.g., sample-and-cluster entropy) and route steps into direct vs branch vs refine modes; measure cost/accuracy vs fixed self-consistency (inspired by DenoiseFlow).
  • Upgrade “verifiers” from output checks to state checks: where possible, define deterministic state diffs (LOGIGEN DIFF-style) or compilation/execution checks (SorryDB/patching) and use them as training and runtime control signals.
  • Instrument process metrics, not just final success: add rollout deviation / step-hit style metrics (MC-SEARCH) and trace-structured logging (TraceFormat-like) so you can attribute failures to planning vs retrieval vs execution.
  • Red-team multimodal systems with reasoning-time attacks: test multi-image, late-fusion reconstruction patterns (MIDAS-like) and evaluate defenses that monitor intermediate decoding steps rather than only input/output filters.
  • For security agents, separate “coverage” and “determinism” modes: use specialized deterministic pipelines for high-frequency injection classes (AWE-style) and fall back to broader general coding agents for multi-step categories; track token/time per vuln class.
  • If you deploy RAG fusion, make it conditional: measure evidence hit rates after reranking/truncation; only apply fusion to recall-scarce queries to avoid latency overhead (industry RAG Fusion findings).
  • Stress-test multi-agent coordination for liveness: run simple consensus/termination simulations and measure timeout rates under prompt variants (threat-aware vs not), since liveness failures can dominate (Can AI Agents Agree?).

Generated from per-paper analyses; no external browsing.