Daily AI Paper Report (2026-03-08)

Published: March 08, 2026

Chinese version: [中文]

Run stats

Candidates: 1155
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-06T01:00:00Z → 2026-03-07T01:00:00Z (weekend_backlog_unknown, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.01291`	JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks PDF	cs.LG, cs.CL	92	First multilingual/regional jailbreak fake-news benchmark; direct misuse eval across 22 languages/34 regions	benchmark, jailbreak, misinformation, multilingual, robustness, safety-eval
`2603.00873`	MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains PDF	cs.AI	92	Benchmark for agentic multimodal RAG with long reasoning chains + evidence attribution/verification.	MM-RAG, agents, benchmark, long-horizon, evidence-grounding, evaluation
`2603.01966`	AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations PDF	cs.CL, cs.AI	92	Interactive on-policy benchmark for assistant memory/personalization with structured users & metrics	LLM, memory, benchmark, long-horizon, personalization, evaluation, simulated-users
`2603.00718`	SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? PDF	cs.CL, cs.SE	92	Benchmark for agents learning reusable tool-use skills; targets long-horizon compositionality.	agents, tool-use, benchmark, skills, long-horizon, evaluation
`2603.01257`	A Systematic Study of LLM-Based Architectures for Automated Patching PDF	cs.CR, cs.SE	92	Controlled comparison of LLM patching architectures; clear trade-offs, failure modes, cost/time metrics	LLM-agents, cybersecurity, automated-patching, evaluation, software-engineering, robustness
`2603.01990`	According to Me: Long-Term Personalized Referential Memory QA PDF	cs.AI, cs.CL, cs.CV	92	Benchmark for multimodal long-term personal memory QA with evidence + conflicts.	long-term-memory, personalization, multimodal, benchmark, grounding, assistants
`2603.01154`	vEcho: A Paradigm Shift from Vulnerability Verification to Proactive Discovery with Large Language Models PDF	cs.CR	90	LLM turns from SAST filter into proactive vuln discovery with tools+memory and vulnerability propagation	LLM-security, SAST, vulnerability-discovery, agentic-tools, memory, software-security
`2603.00582`	Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research PDF	cs.CL	90	Benchmark for autonomous deep+wide research with 100+ retrieval steps; targets real agent limits.	agents, deep-research, web-retrieval, long-horizon, benchmark, evaluation
`2603.01952`	LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations PDF	cs.AI	90	Multi-agent social sim benchmark measuring cultural norm adherence + verifier uncertainty	agents, benchmark, culture, social-simulation, evaluation, norms, uncertainty
`2603.00601`	Theory of Code Space: Do Code Agents Understand Software Architecture? PDF	cs.SE, cs.AI	90	ToCS benchmark probes code-agent architectural belief/state under partial observability.	code-agents, software-engineering, benchmark, architecture, belief-state, evaluation
`2603.01213`	Can AI Agents Agree? PDF	cs.MA, cs.LG	90	Byzantine-consensus game shows LLM agents fail to reliably agree; scales poorly with group/Byzantines	multi-agent, robustness, adversarial, consensus, evaluation, agent-safety
`2603.04334`	SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints PDF	cs.DB, cs.AI, cs.LO, cs.PL	90	Verification-based Text-to-SQL eval finds real mismatches via constraint-mined counterexample DBs	evaluation, text-to-sql, verification, robustness, llm-evals, constraints, tooling
`2603.02176`	Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale PDF	cs.CL	90	Framework + benchmark for skill selection/orchestration at ecosystem scale; agent eval value	agents, tool-use, orchestration, benchmarks, skill-discovery, workflows
`2603.00686`	RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis PDF	cs.CL	90	Agentic eval for long-horizon text synthesis + C3EBench; targets outlining/drafting/editing ops	LLM-evaluation, agents, benchmarks, writing, rubrics, multi-step
`2603.02019`	Selection as Power: Constrained Reinforcement for Bounded Decision Authority PDF	cs.MA, cs.AI, cs.CE, cs.LG	90	Governance framing for agentic risk: constrained reinforcement to bound decision authority over time	agent-governance, constrained-optimization, risk, multi-agent, decision-authority
`2603.00546`	Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation PDF	cs.AI, cs.CV	90	Capability-oriented benchmark for MLLM judges + MCTS data gen; directly targets judge reliability	evaluation, LLM-as-judge, multimodal, benchmark, reliability, data-generation, MCTS
`2603.01053`	Turning Black Box into White Box: Dataset Distillation Leaks PDF	cs.CR, cs.AI, cs.LG	89	Shows dataset distillation can leak via new attack; infers algorithm/arch + membership + sample recovery	privacy, data-leakage, dataset-distillation, membership-inference, security, synthetic-data
`2603.00960`	AWE: Adaptive Agents for Dynamic Web Penetration Testing PDF	cs.CR, cs.AI	88	Memory-augmented multi-agent web pentesting with structured pipelines; aims for reproducible, lower-cost agents	agents, cybersecurity, penetration-testing, tool-use, multi-agent, reproducibility
`2603.00540`	LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks PDF	cs.AI	88	Generates verifiable agentic tasks with hard policy grounding and deterministic state verification.	agent-training, synthetic-data, verification, policies, tool-use, stateful-tasks
`2603.04177`	CodeTaste: Can LLMs Generate Human-Level Code Refactorings? PDF	cs.SE, cs.AI, cs.LG	88	Refactoring benchmark mined from real repos; tests + static checks for behavior-preserving changes	code, LLM-agents, benchmark, refactoring, software-engineering, evaluation
`2603.00623`	TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces PDF	cs.AI, cs.CL	88	Multi-agent trace analysis for debugging agent workflows; structured summaries vs raw logs.	agents, observability, debugging, tracing, monitoring, multi-agent
`2603.01152`	DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent PDF	cs.AI	88	9K deep-research agent benchmark w/ trajectories; useful for training/eval of web agents	agents, evaluation, benchmarks, web-search, multi-hop, trajectories, data-synthesis
`2603.00646`	MoltGraph: A Longitudinal Temporal Graph Dataset of Moltbook for Coordinated-Agent Detection PDF	cs.SI, cs.CR	88	Longitudinal graph dataset for coordinated-agent abuse on agent-native social platforms	agent-safety, coordination, misuse, graph-dataset, social-platforms, monitoring
`2603.00977`	HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents PDF	cs.AI, cs.LG	88	Hierarchical RL for long-horizon LLM agents (macro plan + micro execute) to reduce error propagation	agents, long-horizon, hierarchical, reinforcement-learning, planning, robustness
`2603.02153`	Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment PDF	cs.IR, cs.AI, cs.CL	88	Industry RAG-fusion study: recall gains often vanish after rerank/truncation/latency.	RAG, retrieval-fusion, production, evaluation, reranking, latency
`2603.00876`	BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning PDF	cs.AI, cs.MA	87	Neuro-symbolic FSM constraints for wet-lab planning; targets hallucination/unsafe actions	agent-safety, scientific-agents, neuro-symbolic, planning, verification, constraints, tool-use
`2603.00565`	MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs PDF	cs.CV, cs.AI, cs.CR	86	Strong multimodal jailbreak method (multi-image semantic reconstruction) targeting aligned closed MLLMs	MLLM, jailbreak, red-teaming, multimodal, attack, safety
`2603.02668`	SorryDB: Can AI Provers Complete Real-World Lean Theorems? PDF	cs.AI, cs.LG	86	Dynamic Lean benchmark reduces contamination; measures real-world theorem-proving agent progress.	formal-verification, theorem-proving, agents, benchmark, contamination, Lean
`2603.00532`	DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows PDF	cs.AI	86	Uncertainty-aware control loop for multi-step agent workflows; targets error accumulation.	agents, reliability, uncertainty, planning, long-horizon, robustness
`2603.01067`	Hide&Seek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction PDF	cs.CR, cs.AI	86	Practical watermark removal attacks with high fidelity; relevant to provenance/anti-misuse defenses	watermarking, attack, image-security, provenance, robustness, misuse

AI Paper Insight Brief

2026-03-08

0) Executive takeaways (read this first)

Agent reliability is shifting from “more sampling” to “risk-aware control loops”: DenoiseFlow shows you can sense step uncertainty, allocate branching only where needed, and rollback+repair via root-cause localization—improving accuracy while cutting cost vs fixed branching.
Verifiable environments + deterministic state metrics are becoming the training substrate for agents: LOGIGEN (DB-trigger policy enforcement + DIFF state distance) and MC-SEARCH (hop-verified multimodal chains + HPS/RD) both turn agent learning into something closer to supervised control with hard checks.
Evaluation is moving from single-shot scores to process/trajectory diagnostics: SuperResearch (graph-anchored auditing), RAVEL (outline/draft/review/refine trajectories), TraceSIR (trace compression → root-cause reports), and TOCS (time-series “architectural belief” probes) all measure how systems fail, not just whether they fail.
Multimodal safety is currently brittle against “reasoning-time” attacks: MIDAS achieves high jailbreak success by dispersing harmful semantics across multiple images and forcing late reconstruction, remaining strong even under some defenses—suggesting input filters alone are insufficient.
Security automation is bifurcating into (a) specialized deterministic pipelines for efficiency and (b) general coding agents for coverage: AWE is extremely token-efficient and strong on injection classes, while automated patching results show general coding agents (Claude Code) lead overall coverage but at higher token cost.
Long-horizon coordination remains a weak point for multi-agent LLM systems: even in a simplified Byzantine-consensus game, valid consensus is unreliable and failures are mostly liveness (timeouts), worsened by “threat-aware” prompting.

2) Key themes (clusters)

Theme: Closed-loop reliability for long-horizon agents

Why it matters: Long workflows fail via silent semantic drift; reliability needs online sensing + targeted compute rather than uniform regeneration.
Representative papers:
Common approach:
- Estimate/structure uncertainty or search complexity (semantic entropy + dependency propagation; macro–micro decomposition).
- Allocate effort adaptively (branching factor or blueprint exploration) under budgets.
- Use structured termination/verification signals (verifiers, success rewards, termination votes).
Open questions / failure modes:
- Cold-start calibration and verifier dependence (DenoiseFlow needs verifier feedback; early instability).
- Non-stationarity and error propagation across hierarchy/agents (HiMAC simultaneous updates hurt; consensus liveness collapses).
- Robustness to adversarial or noisy settings beyond studied benchmarks (Byzantine strategies limited; open-ended tasks untested).

Theme: Verifiable stateful environments as agent training data

Why it matters: Agents in policy-rich domains need deterministic feedback tied to state transitions, not just “tool-call syntax” or happy-path traces.
Representative papers:
Common approach:
- Build environments where constraints are executed/enforced (DB triggers; hardware registries + rule engines).
- Define deterministic verification metrics (DIFF over canonicalized DB rows; hop-wise evidence checks; physical compliance checks).
- Train with verified SFT plus RL variants that incorporate step/turn structure (TA-GRPO; process SFT via SEARCH-ALIGN).
Open questions / failure modes:
- Simulator overfitting / “simulator hacking” and cross-simulator generalization (explicitly observed in LOGIGEN).
- Domain restriction (relational DB focus; Wikipedia-derived KB; wet-lab evaluated in simulation with manual registries).
- Reliance on LLM-generated chains/judgments in the data pipeline (MC-SEARCH generation/verification uses Gemini models).

Theme: Process-first evaluation (graphs, traces, belief states)

Why it matters: As agents become multi-step, outcome-only metrics hide whether failures come from planning, retrieval, memory, or execution drift—blocking targeted fixes.
Representative papers:
Common approach:
- Represent intermediate structure explicitly (research graphs; Thought–Action–Observation traces; architectural dependency graphs; synthesis action primitives).
- Score multiple dimensions beyond correctness (coverage/consistency/citation health; report RCA quality; dependency F1 + calibration; trajectory efficiency/refinement density).
- Use tooling to support audits (graph visualizers; standardized trace formats; structured JSON probes).
Open questions / failure modes:
- Evaluation still depends on LLM judges in key places (SuperResearch coverage uses LLM marking; RAVEL uses GPT-5.2-1120 judge).
- Externalization gap: agents may “know” but fail to serialize beliefs (TOCS belief externalization bottleneck; invariant F1=0.0).
- Scaling costs (SuperResearch is compute- and human-intensive; TraceSIR token/latency overhead).

Theme: Multimodal safety & provenance under active attack

Why it matters: Multimodal systems are being attacked via structured reasoning paths (not just prompt injection), and provenance defenses (watermarks) can be removed efficiently.
Representative papers:
Common approach:
- Attack by delaying/obfuscating harmful semantics until late in the reasoning chain (multi-image dispersion + reconstruction; pixel vulnerability masking + reconstruction).
- Evaluate across multiple models/defenses with attack success and quality metrics (ASR, harmfulness ratings; PSNR/SSIM; multilingual ASR + sub-metric harm scores).
- Emphasize black-box practicality (single-shot MIDAS; query-free HS).
Open questions / failure modes:
- Defense needs to be process-aware (MIDAS suggests intermediate-state monitoring; current defenses insufficient).
- Generalization limits and attacker assumptions (HS generalizes poorly cross-domain unless trained; JailNewsBench constrained by legal/ethical region coverage).
- Evaluator dependence and safety trade-offs (JailNewsBench uses ensemble LLM judges; examples withheld for safety).

Theme: Security agents: specialization vs generality trade-offs

Why it matters: Real security workflows demand both coverage and deterministic evidence under budgets; architectures strongly shape outcomes.
Representative papers:
Common approach:
- Combine LLM orchestration with tool-backed verification (browser verification; deep verification with dev tools; PoV/tests for patching).
- Add memory/pattern propagation to move from one-off verification to proactive discovery (vEcho EVP + KBs).
- Compare architectures under realistic benchmarks and cost metrics (XBOW tokens/cost; AIxCC patch counts + token usage).
Open questions / failure modes:
- Coverage gaps for multi-step/chained exploits (AWE lower overall solve rate than MAPTA; misses reasoning-heavy categories).
- Validation/termination brittleness (Claude Code self-reported success mismatched independent tests in patching study).
- Scalability/cost of deep verification loops on large codebases (vEcho overhead).

3) Technical synthesis

“Verification as a control signal” shows up everywhere: DenoiseFlow calibrates uncertainty from verifier pass rates; LOGIGEN uses DIFF=0 for Verified SFT and dense state rewards for RL; BioProAgent gates execution on Ks/Kp; SpotIt+ uses SMT counterexamples; SorryDB compiles projects to verify “sorry” removal.
Process metrics are converging on step-level attribution: MC-SEARCH’s HPS/RD, SuperResearch’s graph-projected coverage/consistency, RAVEL’s refinement density/delta, and TOCS’s action-efficiency AUCs all aim to localize where the trajectory went wrong.
Hierarchical decomposition is a recurring antidote to long-horizon drift: HiMAC splits blueprint vs execution; SuperResearch splits planner/researcher/summarizer/writer; SkillCraft and AgentSkillOS externalize reusable skills and orchestrate them via DAGs.
A common failure mode is “liveness/termination” rather than blatant invalidity: Byzantine-consensus failures are mostly timeouts; DenoiseFlow targets silent drift without runtime exceptions; long-horizon research systems score low overall despite being “reasonable” locally.
Data generation is increasingly “capability-targeted”: LOGIGEN designs boundary-adjacent initial states; M-JudgeBench injects controlled process errors and uses MCTS to generate SC/SE/LC/LE contrasts; MC-SEARCH filters redundant hops via HAVE.
RAG improvements are being judged under production constraints: the RAG Fusion deployment study finds fusion recall gains can be neutralized after reranking/truncation, with added latency—suggesting selective/conditional fusion policies are needed.
Cross-model transfer depends on artifact quality: SkillCraft shows cross-model skill reuse works when the skill creator is strong; poor skills can increase cost—mirrors broader “tooling artifact” quality issues in agent ecosystems.
Safety attacks increasingly exploit “reasoning-time” structure: MIDAS extends reasoning chains via multi-image puzzles and persona-driven reconstruction; watermark removal uses pixel vulnerability ranking + reconstruction ordering to degrade detectors.
Benchmarks are pushing toward “real-world freshness” to reduce leakage: SorryDB indexes current unsolved Lean sorries; SuperResearch uses expert-curated graphs; CODETASTE mines real refactoring commits with executable environments.

4) Top 5 papers (with “why now”)

1) DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows

Introduces a closed-loop Sensing–Regulating–Correcting controller for multi-step LLM workflows with online uncertainty calibration.
Shows accuracy gains with large cost reductions vs fixed branching (reported ~40–56% cost reduction) across math/code/QA benchmarks.
Practical “why now”: agent deployments are hitting budget ceilings; adaptive branching + rollback is a concrete systems lever.
Skepticism: depends on having a reliable verifier; Monte Carlo sampling adds overhead and calibration has a cold-start period.

2) LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks

Compiles natural-language policies into DB-backed environments with hard enforcement (schema + triggers), enabling deterministic verification via DIFF.
Produces >20k tasks across 8 domains and large τ2-Bench gains (e.g., 32B: 40.7 → 62.7 after SFT → 79.5 after RL).
Practical “why now”: agent training is bottlenecked by verifiable, stateful data; LOGIGEN offers a scalable synthesis recipe.
Skepticism: simulator overfitting/user-simulator hacking is explicitly observed; current scope is relational DB environments.

3) MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Demonstrates a strong multimodal jailbreak by dispersing harmful tokens across multiple images and forcing cross-image reconstruction via puzzle templates.
Reports very high ASR on multiple closed-source MLLMs and robustness under some defenses (e.g., ShieldLM/Self-Reminder comparisons).
Practical “why now”: multimodal agents are entering production; this attack targets the reasoning pathway, not just input text.
Skepticism: effectiveness depends on image budget/template difficulty tuning; mitigation directions are suggested but not resolved.

4) MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

Provides 3,333 multimodal agentic-RAG examples with step-wise chains (avg 3.79 hops) and process metrics (HPS, RD).
SEARCH-ALIGN SFT improves open models substantially (e.g., Qwen2.5-VL-7B: +13.7 F1, +16.0 HPS, −3.1 RD).
Practical “why now”: multimodal RAG failures are often planning/retrieval, not generation; step-wise supervision targets that directly.
Skepticism: dataset generation/verification relies on Gemini models; main pipeline uses top-1 retrieval which may constrain conclusions.

5) A Systematic Study of LLM-Based Architectures for Automated Patching

Controlled comparison of fixed workflow vs single-agent vs multi-agent vs general coding agent on 19 AIxCC Java delta-scan tasks.
Finds general coding agent (Claude Code) repaired 16/19, outperforming patch-specific agents but using more tokens; multi-agent overhead driven by iteration depth.
Practical “why now”: teams are choosing between “agent frameworks” and “coding agents”; this gives concrete trade-off evidence.
Skepticism: small task set (19) and benchmark access restrictions; Claude Code had self-reported success mismatches vs independent tests.

5) Practical next steps

Adopt step-level uncertainty + budget routing in your agent stack: implement a lightweight uncertainty proxy (e.g., sample-and-cluster entropy) and route steps into direct vs branch vs refine modes; measure cost/accuracy vs fixed self-consistency (inspired by DenoiseFlow).
Upgrade “verifiers” from output checks to state checks: where possible, define deterministic state diffs (LOGIGEN DIFF-style) or compilation/execution checks (SorryDB/patching) and use them as training and runtime control signals.
Instrument process metrics, not just final success: add rollout deviation / step-hit style metrics (MC-SEARCH) and trace-structured logging (TraceFormat-like) so you can attribute failures to planning vs retrieval vs execution.
Red-team multimodal systems with reasoning-time attacks: test multi-image, late-fusion reconstruction patterns (MIDAS-like) and evaluate defenses that monitor intermediate decoding steps rather than only input/output filters.
For security agents, separate “coverage” and “determinism” modes: use specialized deterministic pipelines for high-frequency injection classes (AWE-style) and fall back to broader general coding agents for multi-step categories; track token/time per vuln class.
If you deploy RAG fusion, make it conditional: measure evidence hit rates after reranking/truncation; only apply fusion to recall-scarce queries to avoid latency overhead (industry RAG Fusion findings).
Stress-test multi-agent coordination for liveness: run simple consensus/termination simulations and measure timeout rates under prompt variants (threat-aware vs not), since liveness failures can dominate (Can AI Agents Agree?).

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-08

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Closed-loop reliability for long-horizon agents

Theme: Verifiable stateful environments as agent training data

Theme: Process-first evaluation (graphs, traces, belief states)

Theme: Multimodal safety & provenance under active attack

Theme: Security agents: specialization vs generality trade-offs

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps