Daily AI Paper Report (2026-03-29)

Published: March 29, 2026

Chinese version: [中文]

Run stats

Candidates: 1744
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_unknown, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.23951`	From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents PDF	cs.CL	95	Closed-loop LLM agents discover improved LLM-RL algorithms; strong automation + eval/iteration framework.	LLM-agents, RLHF, policy-optimization, auto-research, evaluation, algorithm-discovery
`2603.23007`	AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents PDF	cs.CR, cs.AI	94	Concrete backdoor for mobile GUI agents via notifications; high-impact agent security threat model.	agent-security, mobile-agents, backdoors, visual-triggers, remote-action-execution, red-teaming
`2603.22869`	Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories PDF	cs.AI	92	Internalizes fine-grained authorization in LLM reasoning; targets data leakage and access-boundary failures.	authorization, access-control, LLM-safety, data-leakage, reasoning-trajectories, security
`2603.24477`	Composer 2 Technical Report PDF	cs.SE, cs.LG	92	Agentic SWE model + RL in real tool harness; likely strong frontier agent capability signal	agentic-coding, software-engineering, reinforcement-learning, tool-use, long-horizon, frontier-llm
`2603.24579`	MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination PDF	cs.CL	90	Multi-agent asymmetry to reduce LLM-judge confirmation bias for RAG hallucination checking	hallucination, RAG, LLM-judge, multi-agent, verification, reliability
`2603.21636`	Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks PDF	cs.AI, cs.CL	90	Audit framework for benchmark contamination sensitivity & score confidence; key for LLM eval integrity	LLM-evaluation, benchmarking, data-contamination, leakage, audit, measurement
`2603.24221`	Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing PDF	cs.RO, cs.AI	90	Environment-grounded multi-agent LLM pentesting for robots; concrete security workflow + memory graph.	agent-security, penetration-testing, cybersecurity, robotics, multi-agent, tool-use
`2603.23231`	PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments PDF	cs.AI	88	Benchmark for personalized memory agents with evolving preferences; more realistic than pure retrieval tests.	agents, memory, personalization, evaluation, benchmarks, long-term-consistency
`2603.24058`	Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification PDF	cs.CV, cs.AI	88	Targets LVLM object hallucination via attention-imbalance rectification; reliability for high-stakes vision.	hallucinations, vision-language, reliability, attention, calibration, safety
`2603.21630`	EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises PDF	cs.AI	86	Full-stack closed-loop platform for enterprise agents: tools+data synthesis+training+eval in one.	agents, enterprise, tool-use, MCP, data-synthesis, evaluation, deployment
`2603.23129`	Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair PDF	cs.LG	86	Gödel-style self-improving agent for small models via auditable policy patches; relevant to safe autonomy.	agents, self-improvement, policy-repair, small-models, auditing, reliability
`2603.22862`	The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration PDF	cs.SE, cs.CL	86	Comprehensive review of multi-tool LLM agent orchestration incl. safety/cost/verifiability constraints	llm-agents, tool-use, orchestration, survey, safety, verification
`2603.08369`	M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering PDF	cs.AI	86	Multi-agent context engineering to correct perception errors in multimodal math reasoning	multimodal, VLM, math-reasoning, multi-agent, perception, robustness
`2603.23448`	Code Review Agent Benchmark PDF	cs.SE, cs.AI	86	New benchmark/dataset for code review agents; timely for agentic SE quality assurance.	agents, benchmark, code-review, software-engineering, evaluation, datasets
`2603.24481`	Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA PDF	cs.AI, cs.CL, cs.LG	86	Multi-agent verification + weighted fusion improves uncertainty calibration for medical MCQA	uncertainty, calibration, verification, multi-agent, medical, reliability
`2603.19195`	How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation PDF	eess.AS, cs.CL, cs.SD	86	Holistic eval of LLM backbones' auditory knowledge + new benchmark (AKB-2000) for audio LMs.	audio-language-models, LLM-backbones, evaluation, benchmark, probing, multimodal
`2603.21475`	Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems PDF	cs.AI	86	Decouples agent node creation from orchestration; targets knowledge-intensive MAS generation bottleneck.	multi-agent, agent-architecture, orchestration, domain-experts, automation
`2603.24034`	From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs PDF	cs.CL, cs.AI	86	Mitigates contextual exposure bias in Speech-LLMs using noisy history + dropout + DPO on failures.	speech-LLM, robustness, DPO, distribution-shift, evaluation, alignment
`2603.23472`	Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions PDF	cs.LG, cs.CR, math.OC	84	Unified DP + Byzantine-robust federated optimization with weaker assumptions and guarantees.	federated-learning, differential-privacy, byzantine-robustness, secure-ml, optimization
`2603.22651`	Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies PDF	cs.AI, cs.CL, cs.LG	84	Large-scale benchmark of multi-agent orchestration patterns with cost/latency/accuracy tradeoffs.	multi-agent, orchestration, benchmark, evaluation, LLMs, cost-latency, document-IE
`2603.15080`	Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database PDF	cs.DB, cs.AI, q-bio.QM	84	Open biomedical KGs + federation + explicit AI-agent access layer; reusable infra at scale	knowledge-graphs, agents, tool-use, data-infrastructure, biomedicine, RAG
`2603.22999`	PaperVoyager : Building Interactive Web with Visual Language Models PDF	cs.CL	84	Benchmark + agent that turns papers into executable interactive web systems; strong tool-use/document agent angle.	agents, tool-use, document-understanding, benchmark, evaluation, web-synthesis
`2603.23983`	SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating PDF	cs.RO, cs.AI, eess.SY	84	Text-driven humanoid control with explicit safety gating and physics guidance; addresses OOD unsafe motions.	robot-safety, agents, humanoids, safety-gating, OOD-robustness, control
`2603.24558`	LensWalk: Agentic Video Understanding by Planning How You See in Videos PDF	cs.CV, cs.AI	83	Agentic video understanding with reason-plan-observe control of perception; likely reusable framework.	agentic, video-understanding, planning, active-perception, VLM-tools, efficiency
`2603.17265`	LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis PDF	cs.CV, cs.CL	82	LED benchmark targets structural layout errors beyond IoU; reusable eval for doc/LMM systems.	benchmark, evaluation, multimodal, document-ai, hallucination, robustness
`2603.22918`	EVA: Efficient Reinforcement Learning for End-to-End Video Agent PDF	cs.CV, cs.AI, cs.CL	82	RL-based planning-before-perception for long videos; efficiency gains for multimodal agents.	video-agents, reinforcement-learning, planning, multimodal, efficiency, long-context
`2603.21574`	Adaptive Robust Estimator for Multi-Agent Reinforcement Learning PDF	cs.AI	82	Robust MARL for collaborative reasoning; tackles noisy/heavy-tailed rewards and structured critique loops	multi-agent, reinforcement-learning, robust-estimation, llm-reasoning, credit-assignment
`2603.17811`	Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference PDF	cs.LG, cs.AI	82	Systematic MC-dropout reliability study across 19 transformers; links variability to reasoning/memory	uncertainty, MC-dropout, reliability, transformers, stochastic-inference, evaluation
`2603.23406`	Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies PDF	cs.AI, cs.CL, cs.HC	82	Measures stance formation/identity negotiation in generative multi-agent societies; new metrics.	multi-agent, social-simulation, evaluation, trust, persuasion, agent-behavior
`2603.24167`	Walma: Learning to See Memory Corruption in WebAssembly PDF	cs.CR, cs.LG	82	ML-based WebAssembly memory attestation vs adversarial host; concrete security evaluation on CVEs	security, webassembly, memory-corruption, attestation, robustness, systems

AI Paper Insight Brief

2026-03-29

0) Executive takeaways (read this first)

“Perception is the bottleneck” is now measurable and fixable without retraining: multi-agent context engineering that cross-checks intermediate evidence (not just final answers) materially improves multimodal math accuracy (M$^3$-ACE).
Benchmarks are shifting from “did you get the box/answer” to “did you detect the structural failure mode”: LED reframes document layout evaluation around error types (missing/merge/split/etc.), exposing that strong VLMs still struggle on fine-grained structural diagnosis.
Inference-time stochasticity is not a free uncertainty win: MC Dropout often reduces accuracy (10/19 models) and disproportionately harms “memory” vs “reasoning,” so uncertainty methods must be architecture/task-aware.
Agent safety is increasingly about system surfaces (tools, GUIs, permissions), not just text: notification-icon visual backdoors can hijack mobile GUI agents at high ASR (AgentRAE), while internalized authorization trajectories can enforce permission boundaries (Chain-of-Authorization).
Closed-loop, environment-grounded training/evaluation is becoming the practical differentiator: EnterpriseLab (tool environments + executable synthesis + trajectory RL) and finance orchestration benchmarking show that architecture + cost controls dominate production viability.
Claim-level, bias-resistant verification is emerging as a scalable anti-hallucination training signal: MARCH uses an information-asymmetric Checker (blinded to the Solver output) + strict per-claim reward to lift an 8B model’s RAG factuality by ~20 points on reported averages.

2) Key themes (clusters)

Theme: Multi-agent evidence/consensus as a robustness primitive

Why it matters: Many failures persist because models commit early to wrong intermediate state (visual evidence, critiques, rewards). Cross-agent disagreement signals and structured reconciliation can selectively spend compute where it matters.
Representative papers:
Common approach:
- Separate intermediate artifacts (VE lists; critique deltas; verification Qs) from final answers and make them first-class objects.
- Use disagreement/consensus (conflict ratios, majority votes, inconsistency scores) to gate extra iterations or weight fusion.
- Add structure/tools around agent interaction (Summary/Refine tools; staged answer–critique–rewrite; verification protocols).
Open questions / failure modes:
- Heuristic thresholds and gating policies (e.g., conflict ratio > 0.2) may be brittle across domains/models.
- “Consistency ≠ correctness”: verification can reward coherent but wrong reasoning (noted in medical MCQA).
- Compute/latency overhead and scaling behavior under real-time constraints is often underreported.

Theme: Evaluation is becoming diagnostic (error types, contamination sensitivity, executable oracles)

Why it matters: Aggregate scores hide where systems fail (structural layout errors, contamination sensitivity, unverifiable code review). New benchmarks aim to expose failure modes with more actionable signals.
Representative papers:
Common approach:
- Replace single metrics with typed error taxonomies and hierarchical tasks (LED T1–T3).
- Stress-test score confidence via controlled perturbation/aggregation (router–worker noisy rewrite audit).
- Use executable evaluation oracles (convert review comments → tests that must fail-before/pass-after).
Open questions / failure modes:
- Synthetic construction and imbalance (LED Missing dominates) may skew conclusions.
- Contamination audits currently shown on small samples (n=100) and MCQ-only settings.
- Test-based oracles depend on environment reconstruction and a coding agent, conflating capabilities.

Theme: Tool/GUI agent security & governance is moving “inside the model”

Why it matters: As agents act in real systems, the attack surface includes UI pixels, tool schemas, and permission boundaries. Defenses need causal enforcement, not just prompts.
Representative papers:
Common approach:
- Formalize system-level threat models (supply-chain poisoning; permission mismatch; transaction semantics).
- Enforce information barriers or causal steps (authorization trajectory before answering; Checker blinded to Solver).
- Evaluate against adaptive attacks/defenses (PAIR jailbreaks; pruning/finetuning defenses for backdoors).
Open questions / failure modes:
- Backdoor evaluations are largely offline; online interactive agent dynamics remain under-tested.
- CoA requires fine-tuning + permission token engineering; real permission taxonomies are messy and dynamic.
- Tool orchestration safety needs transaction semantics and replayable audits, but unified standards remain open.

Theme: Planning-before-perception and adaptive observation for long-horizon video agents

Why it matters: Long videos break fixed-context VLM pipelines; agents must decide what to look at and how densely to sample to control cost while preserving evidence.
Representative papers:
- EVA: Efficient Reinforcement Learning for End-to-End Video Agent
- LensWalk: Agentic Video Understanding by Planning How You See in Videos
Common approach:
- Iterative plan–observe loops with parameterized observation actions (time window, frames, resize; tool choice).
- Staged training or modular toolkits (SFT→KTO→GRPO; Scan/Focus/Stitch tools + timestamp anchors).
- Explicit efficiency targets (visual token budgets; fewer frames; avoid heavy preprocessing).
Open questions / failure modes:
- Reward hacking and sampling pathologies remain (EVA mitigates but doesn’t eliminate).
- Planner stagnation (static repetition) and premature conclusions (LensWalk failure modes).
- Dependence on tool interfaces and observer quality; generalization to new tools/modalities is unclear.

Theme: Training-time and decode-time robustness interventions (noise, attention, DP/Byzantine)

Why it matters: Robustness failures often come from distribution shift (noisy context), attention pathologies (hallucination), or adversarial participants (federated learning). Lightweight interventions can be high leverage.
Representative papers:
Common approach:
- Train on realistic noise (teacher ASR hypotheses) + regularize reliance (context dropout) + align with preferences (DPO).
- Decode-time head-specific attention interventions (AIR) to reduce hallucinations without retraining.
- Robust aggregation + clipping + momentum + error feedback with high-probability guarantees (Byz-Clip21-SGD2M).
Open questions / failure modes:
- Hyperparameter sensitivity (AIR λ/β tradeoffs; CoA learning-rate sensitivity; DPO scaling γ).
- Teacher-bias in noise modeling (single ASR teacher) and limited scenario coverage (no overlapping speakers).
- Theory-to-practice gaps: constraints/hyperparameter restrictions and unproven variants (example-wise clipping).

3) Technical synthesis

Intermediate-representation auditing is converging across modalities: VE lists (math vision), claim QA pairs (RAG factuality), verification questions (medical MCQA), and graph memories (pentesting) all serve as auditable state that can be cross-checked.
Information asymmetry is a recurring anti-bias tool: MARCH blinds the Checker to the Solver output; CoA forces an explicit authorization trajectory before content; both aim to prevent “seeing the answer first” bias.
Selective compute is the dominant systems pattern: M$^3$-ACE iterates only on ~10% disputed samples; finance orchestration shows hierarchical “knee” + caching/routing; safety gates in robotics execute only when stable/OOD-safe.
Robust statistics are entering RL-for-LLMs: ARE replaces batch-mean normalization with median-of-block robust estimation; POISE discovers normalization/validity masking mechanisms for GRPO variants.
Prompt/configuration sensitivity is now benchmarked explicitly: LED measures prompt robustness (CV/NR) across P1/P2/P3; dropout-at-inference shows architecture-dependent volatility; these suggest “one prompt/one setting” reporting is insufficient.
Decoding-time interventions are gaining credibility: AIR reduces CHAIR hallucination metrics substantially while preserving/improving MM-Vet; this parallels other “training-free” fixes like M$^3$-ACE’s context engineering.
Environment-grounded evaluation is becoming the gold standard for agents: EnterpriseLab executes trajectories against tool containers; pentesting workflow grounds memory in observed outputs; code review benchmark uses executable tests.
Security threats are increasingly visual and supply-chain for agents: AgentRAE shows tiny notification icons can be robust triggers; defenses that assume text-only triggers or static prompts are incomplete.
Calibration/uncertainty remains tricky without labels: MARC improves ECE via consistency verification, but the paper notes failure when consistency rewards wrong knowledge—highlighting the need for grounding beyond self-consistency.

4) Top 5 papers (with “why now”)

1) M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Decouples visual evidence extraction from reasoning and uses multi-agent VE cross-validation with Summary/Refine tools.
Reports strong gains on MathVision (e.g., Gemini-3 Pro 85.0% → 89.1%) and large jumps for weaker models (e.g., GPT-5 72.0% → 82.2%).
Selective iteration: refine stage keeps high-consensus subset near 90% accuracy while only ~10% samples loop.
Skepticism: depends on access to multiple strong multimodal models; heuristic thresholds and compute/latency trade-offs aren’t fully quantified.

2) MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Introduces Solver–Proposer–Checker with the Checker blinded to reduce confirmation bias; trains via dual-trajectory PPO.
Large factuality gains reported: RAGTruth/FaithBench average 55.20% → ~75% (+~20).
Uses strict Zero-Tolerance Reward to enforce per-claim correctness (all claims must match).
Skepticism: verification focus is prioritized for numeric/quantitative claims; proposer reward-hacking (shrinking claims) is a known risk.

3) AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents

Shows a practical trigger surface: native notification icons as covert backdoor triggers for screenshot-based agents.
Two-phase poisoning (contrastive trigger separation + balanced poison loss) achieves high ASR (>90% in many settings), scaling to 9 targets.
Evaluates defenses (fine-pruning, fine-tuning, NAD) and finds ASR remains high post-defense.
Skepticism: evaluations are offline on two open-source agents/datasets; online timing/interaction effects are not tested.

4) LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Defines 8 structural layout error types and builds a synthetic injection benchmark with 3 hierarchical tasks (doc detect → type classify → element classify).
Finds Gemini 2.5 variants best and most prompt-stable; GPT models drop sharply on fine-grained tasks.
Provides prompt/input configuration comparisons (image+JSON best; boxes-only weakest).
Skepticism: synthetic + imbalanced error distribution (Missing dominates) and single-source injection modeling may limit generality.

5) EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises

Integrates MCP tool environments, executable trajectory synthesis from schemas, and training (SFT/DPO/Agentic GRPO) in a closed loop.
Reports Qwen3-8B Agentic GRPO competitive with GPT-4o on EnterpriseArena execution accuracy (0.43 vs 0.45) and claims ~8–10× inference cost reduction.
Shows adaptation via incremental trajectories after schema/API changes.
Skepticism: scope is tool/API environments (not GUI); performance depends on base model capability and synthesis quality.

5) Practical next steps

Adopt “intermediate artifact logging” as a default: store VE lists / claim lists / tool-call plans and measure disagreement rates; use them to trigger selective re-tries (as in M$^3$-ACE).
Add an information-asymmetric verifier path in RAG: implement a Checker that only sees retrieved docs + atomized questions (not the draft answer) and track factuality deltas vs standard self-critique.
Run a contamination-sensitivity audit before trusting leaderboard deltas: replicate router–worker noisy rewrite tests on your key MCQ benchmarks and report “violation breadth” alongside accuracy.
For tool agents, treat permissions as first-class tokens + trajectories: prototype CoA-style “resource review → identity → decision” outputs and enforce that downstream answer/tool calls are conditioned on that trajectory.
Harden GUI agents against visual trigger surfaces: add notification-aware preprocessing (mask/crop notification regions) and evaluate against icon-trigger backdoor scenarios similar to AgentRAE.
If using MC Dropout for uncertainty, benchmark memory-heavy vs reasoning-heavy tasks separately: measure mean+std under stochastic inference; avoid enabling dropout blindly for specialized checkpoints.
For long-video agents, measure “evidence efficiency” not just accuracy: track frames used / visual tokens / number of observation turns; add stagnation detectors for static repetition and premature stopping.
Prefer executable oracles where possible: for code review or agent actions, convert evaluation into tests or environment-grounded success metrics rather than text similarity.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-29

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Multi-agent evidence/consensus as a robustness primitive

Theme: Evaluation is becoming diagnostic (error types, contamination sensitivity, executable oracles)

Theme: Tool/GUI agent security & governance is moving “inside the model”

Theme: Planning-before-perception and adaptive observation for long-horizon video agents

Theme: Training-time and decode-time robustness interventions (noise, attention, DP/Byzantine)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps