Daily AI Paper Report (2026-03-29)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 1744
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.23951From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents
PDF
cs.CL95Closed-loop LLM agents discover improved LLM-RL algorithms; strong automation + eval/iteration framework.LLM-agents, RLHF, policy-optimization, auto-research, evaluation, algorithm-discovery
2603.23007AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents
PDF
cs.CR, cs.AI94Concrete backdoor for mobile GUI agents via notifications; high-impact agent security threat model.agent-security, mobile-agents, backdoors, visual-triggers, remote-action-execution, red-teaming
2603.22869Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories
PDF
cs.AI92Internalizes fine-grained authorization in LLM reasoning; targets data leakage and access-boundary failures.authorization, access-control, LLM-safety, data-leakage, reasoning-trajectories, security
2603.24477Composer 2 Technical Report
PDF
cs.SE, cs.LG92Agentic SWE model + RL in real tool harness; likely strong frontier agent capability signalagentic-coding, software-engineering, reinforcement-learning, tool-use, long-horizon, frontier-llm
2603.24579MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
PDF
cs.CL90Multi-agent asymmetry to reduce LLM-judge confirmation bias for RAG hallucination checkinghallucination, RAG, LLM-judge, multi-agent, verification, reliability
2603.21636Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
PDF
cs.AI, cs.CL90Audit framework for benchmark contamination sensitivity & score confidence; key for LLM eval integrityLLM-evaluation, benchmarking, data-contamination, leakage, audit, measurement
2603.24221Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing
PDF
cs.RO, cs.AI90Environment-grounded multi-agent LLM pentesting for robots; concrete security workflow + memory graph.agent-security, penetration-testing, cybersecurity, robotics, multi-agent, tool-use
2603.23231PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PDF
cs.AI88Benchmark for personalized memory agents with evolving preferences; more realistic than pure retrieval tests.agents, memory, personalization, evaluation, benchmarks, long-term-consistency
2603.24058Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification
PDF
cs.CV, cs.AI88Targets LVLM object hallucination via attention-imbalance rectification; reliability for high-stakes vision.hallucinations, vision-language, reliability, attention, calibration, safety
2603.21630EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises
PDF
cs.AI86Full-stack closed-loop platform for enterprise agents: tools+data synthesis+training+eval in one.agents, enterprise, tool-use, MCP, data-synthesis, evaluation, deployment
2603.23129Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair
PDF
cs.LG86Gödel-style self-improving agent for small models via auditable policy patches; relevant to safe autonomy.agents, self-improvement, policy-repair, small-models, auditing, reliability
2603.22862The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
PDF
cs.SE, cs.CL86Comprehensive review of multi-tool LLM agent orchestration incl. safety/cost/verifiability constraintsllm-agents, tool-use, orchestration, survey, safety, verification
2603.08369M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering
PDF
cs.AI86Multi-agent context engineering to correct perception errors in multimodal math reasoningmultimodal, VLM, math-reasoning, multi-agent, perception, robustness
2603.23448Code Review Agent Benchmark
PDF
cs.SE, cs.AI86New benchmark/dataset for code review agents; timely for agentic SE quality assurance.agents, benchmark, code-review, software-engineering, evaluation, datasets
2603.24481Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
PDF
cs.AI, cs.CL, cs.LG86Multi-agent verification + weighted fusion improves uncertainty calibration for medical MCQAuncertainty, calibration, verification, multi-agent, medical, reliability
2603.19195How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation
PDF
eess.AS, cs.CL, cs.SD86Holistic eval of LLM backbones' auditory knowledge + new benchmark (AKB-2000) for audio LMs.audio-language-models, LLM-backbones, evaluation, benchmark, probing, multimodal
2603.21475Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems
PDF
cs.AI86Decouples agent node creation from orchestration; targets knowledge-intensive MAS generation bottleneck.multi-agent, agent-architecture, orchestration, domain-experts, automation
2603.24034From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs
PDF
cs.CL, cs.AI86Mitigates contextual exposure bias in Speech-LLMs using noisy history + dropout + DPO on failures.speech-LLM, robustness, DPO, distribution-shift, evaluation, alignment
2603.23472Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions
PDF
cs.LG, cs.CR, math.OC84Unified DP + Byzantine-robust federated optimization with weaker assumptions and guarantees.federated-learning, differential-privacy, byzantine-robustness, secure-ml, optimization
2603.22651Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
PDF
cs.AI, cs.CL, cs.LG84Large-scale benchmark of multi-agent orchestration patterns with cost/latency/accuracy tradeoffs.multi-agent, orchestration, benchmark, evaluation, LLMs, cost-latency, document-IE
2603.15080Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database
PDF
cs.DB, cs.AI, q-bio.QM84Open biomedical KGs + federation + explicit AI-agent access layer; reusable infra at scaleknowledge-graphs, agents, tool-use, data-infrastructure, biomedicine, RAG
2603.22999PaperVoyager : Building Interactive Web with Visual Language Models
PDF
cs.CL84Benchmark + agent that turns papers into executable interactive web systems; strong tool-use/document agent angle.agents, tool-use, document-understanding, benchmark, evaluation, web-synthesis
2603.23983SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating
PDF
cs.RO, cs.AI, eess.SY84Text-driven humanoid control with explicit safety gating and physics guidance; addresses OOD unsafe motions.robot-safety, agents, humanoids, safety-gating, OOD-robustness, control
2603.24558LensWalk: Agentic Video Understanding by Planning How You See in Videos
PDF
cs.CV, cs.AI83Agentic video understanding with reason-plan-observe control of perception; likely reusable framework.agentic, video-understanding, planning, active-perception, VLM-tools, efficiency
2603.17265LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
PDF
cs.CV, cs.CL82LED benchmark targets structural layout errors beyond IoU; reusable eval for doc/LMM systems.benchmark, evaluation, multimodal, document-ai, hallucination, robustness
2603.22918EVA: Efficient Reinforcement Learning for End-to-End Video Agent
PDF
cs.CV, cs.AI, cs.CL82RL-based planning-before-perception for long videos; efficiency gains for multimodal agents.video-agents, reinforcement-learning, planning, multimodal, efficiency, long-context
2603.21574Adaptive Robust Estimator for Multi-Agent Reinforcement Learning
PDF
cs.AI82Robust MARL for collaborative reasoning; tackles noisy/heavy-tailed rewards and structured critique loopsmulti-agent, reinforcement-learning, robust-estimation, llm-reasoning, credit-assignment
2603.17811Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference
PDF
cs.LG, cs.AI82Systematic MC-dropout reliability study across 19 transformers; links variability to reasoning/memoryuncertainty, MC-dropout, reliability, transformers, stochastic-inference, evaluation
2603.23406Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies
PDF
cs.AI, cs.CL, cs.HC82Measures stance formation/identity negotiation in generative multi-agent societies; new metrics.multi-agent, social-simulation, evaluation, trust, persuasion, agent-behavior
2603.24167Walma: Learning to See Memory Corruption in WebAssembly
PDF
cs.CR, cs.LG82ML-based WebAssembly memory attestation vs adversarial host; concrete security evaluation on CVEssecurity, webassembly, memory-corruption, attestation, robustness, systems

AI Paper Insight Brief

2026-03-29

0) Executive takeaways (read this first)

  • “Perception is the bottleneck” is now measurable and fixable without retraining: multi-agent context engineering that cross-checks intermediate evidence (not just final answers) materially improves multimodal math accuracy (M$^3$-ACE).
  • Benchmarks are shifting from “did you get the box/answer” to “did you detect the structural failure mode”: LED reframes document layout evaluation around error types (missing/merge/split/etc.), exposing that strong VLMs still struggle on fine-grained structural diagnosis.
  • Inference-time stochasticity is not a free uncertainty win: MC Dropout often reduces accuracy (10/19 models) and disproportionately harms “memory” vs “reasoning,” so uncertainty methods must be architecture/task-aware.
  • Agent safety is increasingly about system surfaces (tools, GUIs, permissions), not just text: notification-icon visual backdoors can hijack mobile GUI agents at high ASR (AgentRAE), while internalized authorization trajectories can enforce permission boundaries (Chain-of-Authorization).
  • Closed-loop, environment-grounded training/evaluation is becoming the practical differentiator: EnterpriseLab (tool environments + executable synthesis + trajectory RL) and finance orchestration benchmarking show that architecture + cost controls dominate production viability.
  • Claim-level, bias-resistant verification is emerging as a scalable anti-hallucination training signal: MARCH uses an information-asymmetric Checker (blinded to the Solver output) + strict per-claim reward to lift an 8B model’s RAG factuality by ~20 points on reported averages.

2) Key themes (clusters)

Theme: Multi-agent evidence/consensus as a robustness primitive

Theme: Evaluation is becoming diagnostic (error types, contamination sensitivity, executable oracles)

Theme: Tool/GUI agent security & governance is moving “inside the model”

Theme: Planning-before-perception and adaptive observation for long-horizon video agents

  • Why it matters: Long videos break fixed-context VLM pipelines; agents must decide what to look at and how densely to sample to control cost while preserving evidence.
  • Representative papers:
  • Common approach:
    • Iterative plan–observe loops with parameterized observation actions (time window, frames, resize; tool choice).
    • Staged training or modular toolkits (SFT→KTO→GRPO; Scan/Focus/Stitch tools + timestamp anchors).
    • Explicit efficiency targets (visual token budgets; fewer frames; avoid heavy preprocessing).
  • Open questions / failure modes:
    • Reward hacking and sampling pathologies remain (EVA mitigates but doesn’t eliminate).
    • Planner stagnation (static repetition) and premature conclusions (LensWalk failure modes).
    • Dependence on tool interfaces and observer quality; generalization to new tools/modalities is unclear.

Theme: Training-time and decode-time robustness interventions (noise, attention, DP/Byzantine)

3) Technical synthesis

  • Intermediate-representation auditing is converging across modalities: VE lists (math vision), claim QA pairs (RAG factuality), verification questions (medical MCQA), and graph memories (pentesting) all serve as auditable state that can be cross-checked.
  • Information asymmetry is a recurring anti-bias tool: MARCH blinds the Checker to the Solver output; CoA forces an explicit authorization trajectory before content; both aim to prevent “seeing the answer first” bias.
  • Selective compute is the dominant systems pattern: M$^3$-ACE iterates only on ~10% disputed samples; finance orchestration shows hierarchical “knee” + caching/routing; safety gates in robotics execute only when stable/OOD-safe.
  • Robust statistics are entering RL-for-LLMs: ARE replaces batch-mean normalization with median-of-block robust estimation; POISE discovers normalization/validity masking mechanisms for GRPO variants.
  • Prompt/configuration sensitivity is now benchmarked explicitly: LED measures prompt robustness (CV/NR) across P1/P2/P3; dropout-at-inference shows architecture-dependent volatility; these suggest “one prompt/one setting” reporting is insufficient.
  • Decoding-time interventions are gaining credibility: AIR reduces CHAIR hallucination metrics substantially while preserving/improving MM-Vet; this parallels other “training-free” fixes like M$^3$-ACE’s context engineering.
  • Environment-grounded evaluation is becoming the gold standard for agents: EnterpriseLab executes trajectories against tool containers; pentesting workflow grounds memory in observed outputs; code review benchmark uses executable tests.
  • Security threats are increasingly visual and supply-chain for agents: AgentRAE shows tiny notification icons can be robust triggers; defenses that assume text-only triggers or static prompts are incomplete.
  • Calibration/uncertainty remains tricky without labels: MARC improves ECE via consistency verification, but the paper notes failure when consistency rewards wrong knowledge—highlighting the need for grounding beyond self-consistency.

4) Top 5 papers (with “why now”)

1) M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

  • Decouples visual evidence extraction from reasoning and uses multi-agent VE cross-validation with Summary/Refine tools.
  • Reports strong gains on MathVision (e.g., Gemini-3 Pro 85.0% → 89.1%) and large jumps for weaker models (e.g., GPT-5 72.0% → 82.2%).
  • Selective iteration: refine stage keeps high-consensus subset near 90% accuracy while only ~10% samples loop.
  • Skepticism: depends on access to multiple strong multimodal models; heuristic thresholds and compute/latency trade-offs aren’t fully quantified.

2) MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

  • Introduces Solver–Proposer–Checker with the Checker blinded to reduce confirmation bias; trains via dual-trajectory PPO.
  • Large factuality gains reported: RAGTruth/FaithBench average 55.20% → ~75% (+~20).
  • Uses strict Zero-Tolerance Reward to enforce per-claim correctness (all claims must match).
  • Skepticism: verification focus is prioritized for numeric/quantitative claims; proposer reward-hacking (shrinking claims) is a known risk.

3) AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents

  • Shows a practical trigger surface: native notification icons as covert backdoor triggers for screenshot-based agents.
  • Two-phase poisoning (contrastive trigger separation + balanced poison loss) achieves high ASR (>90% in many settings), scaling to 9 targets.
  • Evaluates defenses (fine-pruning, fine-tuning, NAD) and finds ASR remains high post-defense.
  • Skepticism: evaluations are offline on two open-source agents/datasets; online timing/interaction effects are not tested.

4) LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

  • Defines 8 structural layout error types and builds a synthetic injection benchmark with 3 hierarchical tasks (doc detect → type classify → element classify).
  • Finds Gemini 2.5 variants best and most prompt-stable; GPT models drop sharply on fine-grained tasks.
  • Provides prompt/input configuration comparisons (image+JSON best; boxes-only weakest).
  • Skepticism: synthetic + imbalanced error distribution (Missing dominates) and single-source injection modeling may limit generality.

5) EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises

  • Integrates MCP tool environments, executable trajectory synthesis from schemas, and training (SFT/DPO/Agentic GRPO) in a closed loop.
  • Reports Qwen3-8B Agentic GRPO competitive with GPT-4o on EnterpriseArena execution accuracy (0.43 vs 0.45) and claims ~8–10× inference cost reduction.
  • Shows adaptation via incremental trajectories after schema/API changes.
  • Skepticism: scope is tool/API environments (not GUI); performance depends on base model capability and synthesis quality.

5) Practical next steps

  • Adopt “intermediate artifact logging” as a default: store VE lists / claim lists / tool-call plans and measure disagreement rates; use them to trigger selective re-tries (as in M$^3$-ACE).
  • Add an information-asymmetric verifier path in RAG: implement a Checker that only sees retrieved docs + atomized questions (not the draft answer) and track factuality deltas vs standard self-critique.
  • Run a contamination-sensitivity audit before trusting leaderboard deltas: replicate router–worker noisy rewrite tests on your key MCQ benchmarks and report “violation breadth” alongside accuracy.
  • For tool agents, treat permissions as first-class tokens + trajectories: prototype CoA-style “resource review → identity → decision” outputs and enforce that downstream answer/tool calls are conditioned on that trajectory.
  • Harden GUI agents against visual trigger surfaces: add notification-aware preprocessing (mask/crop notification regions) and evaluate against icon-trigger backdoor scenarios similar to AgentRAE.
  • If using MC Dropout for uncertainty, benchmark memory-heavy vs reasoning-heavy tasks separately: measure mean+std under stochastic inference; avoid enabling dropout blindly for specialized checkpoints.
  • For long-video agents, measure “evidence efficiency” not just accuracy: track frames used / visual tokens / number of observation turns; add stagnation detectors for static repetition and premature stopping.
  • Prefer executable oracles where possible: for code review or agent actions, convert evaluation into tests or environment-grounded success metrics rather than text similarity.

Generated from per-paper analyses; no external browsing.