Daily AI Paper Report (2026-03-29)
Published:
Chinese version: [中文]
Run stats
- Candidates: 1744
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.23951 | From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents | cs.CL | 95 | Closed-loop LLM agents discover improved LLM-RL algorithms; strong automation + eval/iteration framework. | LLM-agents, RLHF, policy-optimization, auto-research, evaluation, algorithm-discovery |
2603.23007 | AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents | cs.CR, cs.AI | 94 | Concrete backdoor for mobile GUI agents via notifications; high-impact agent security threat model. | agent-security, mobile-agents, backdoors, visual-triggers, remote-action-execution, red-teaming |
2603.22869 | Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories | cs.AI | 92 | Internalizes fine-grained authorization in LLM reasoning; targets data leakage and access-boundary failures. | authorization, access-control, LLM-safety, data-leakage, reasoning-trajectories, security |
2603.24477 | Composer 2 Technical Report | cs.SE, cs.LG | 92 | Agentic SWE model + RL in real tool harness; likely strong frontier agent capability signal | agentic-coding, software-engineering, reinforcement-learning, tool-use, long-horizon, frontier-llm |
2603.24579 | MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination | cs.CL | 90 | Multi-agent asymmetry to reduce LLM-judge confirmation bias for RAG hallucination checking | hallucination, RAG, LLM-judge, multi-agent, verification, reliability |
2603.21636 | Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks | cs.AI, cs.CL | 90 | Audit framework for benchmark contamination sensitivity & score confidence; key for LLM eval integrity | LLM-evaluation, benchmarking, data-contamination, leakage, audit, measurement |
2603.24221 | Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing | cs.RO, cs.AI | 90 | Environment-grounded multi-agent LLM pentesting for robots; concrete security workflow + memory graph. | agent-security, penetration-testing, cybersecurity, robotics, multi-agent, tool-use |
2603.23231 | PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments | cs.AI | 88 | Benchmark for personalized memory agents with evolving preferences; more realistic than pure retrieval tests. | agents, memory, personalization, evaluation, benchmarks, long-term-consistency |
2603.24058 | Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification | cs.CV, cs.AI | 88 | Targets LVLM object hallucination via attention-imbalance rectification; reliability for high-stakes vision. | hallucinations, vision-language, reliability, attention, calibration, safety |
2603.21630 | EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises | cs.AI | 86 | Full-stack closed-loop platform for enterprise agents: tools+data synthesis+training+eval in one. | agents, enterprise, tool-use, MCP, data-synthesis, evaluation, deployment |
2603.23129 | Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair | cs.LG | 86 | Gödel-style self-improving agent for small models via auditable policy patches; relevant to safe autonomy. | agents, self-improvement, policy-repair, small-models, auditing, reliability |
2603.22862 | The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration | cs.SE, cs.CL | 86 | Comprehensive review of multi-tool LLM agent orchestration incl. safety/cost/verifiability constraints | llm-agents, tool-use, orchestration, survey, safety, verification |
2603.08369 | M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering | cs.AI | 86 | Multi-agent context engineering to correct perception errors in multimodal math reasoning | multimodal, VLM, math-reasoning, multi-agent, perception, robustness |
2603.23448 | Code Review Agent Benchmark | cs.SE, cs.AI | 86 | New benchmark/dataset for code review agents; timely for agentic SE quality assurance. | agents, benchmark, code-review, software-engineering, evaluation, datasets |
2603.24481 | Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA | cs.AI, cs.CL, cs.LG | 86 | Multi-agent verification + weighted fusion improves uncertainty calibration for medical MCQA | uncertainty, calibration, verification, multi-agent, medical, reliability |
2603.19195 | How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation | eess.AS, cs.CL, cs.SD | 86 | Holistic eval of LLM backbones' auditory knowledge + new benchmark (AKB-2000) for audio LMs. | audio-language-models, LLM-backbones, evaluation, benchmark, probing, multimodal |
2603.21475 | Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems | cs.AI | 86 | Decouples agent node creation from orchestration; targets knowledge-intensive MAS generation bottleneck. | multi-agent, agent-architecture, orchestration, domain-experts, automation |
2603.24034 | From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs | cs.CL, cs.AI | 86 | Mitigates contextual exposure bias in Speech-LLMs using noisy history + dropout + DPO on failures. | speech-LLM, robustness, DPO, distribution-shift, evaluation, alignment |
2603.23472 | Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions | cs.LG, cs.CR, math.OC | 84 | Unified DP + Byzantine-robust federated optimization with weaker assumptions and guarantees. | federated-learning, differential-privacy, byzantine-robustness, secure-ml, optimization |
2603.22651 | Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies | cs.AI, cs.CL, cs.LG | 84 | Large-scale benchmark of multi-agent orchestration patterns with cost/latency/accuracy tradeoffs. | multi-agent, orchestration, benchmark, evaluation, LLMs, cost-latency, document-IE |
2603.15080 | Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database | cs.DB, cs.AI, q-bio.QM | 84 | Open biomedical KGs + federation + explicit AI-agent access layer; reusable infra at scale | knowledge-graphs, agents, tool-use, data-infrastructure, biomedicine, RAG |
2603.22999 | PaperVoyager : Building Interactive Web with Visual Language Models | cs.CL | 84 | Benchmark + agent that turns papers into executable interactive web systems; strong tool-use/document agent angle. | agents, tool-use, document-understanding, benchmark, evaluation, web-synthesis |
2603.23983 | SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating | cs.RO, cs.AI, eess.SY | 84 | Text-driven humanoid control with explicit safety gating and physics guidance; addresses OOD unsafe motions. | robot-safety, agents, humanoids, safety-gating, OOD-robustness, control |
2603.24558 | LensWalk: Agentic Video Understanding by Planning How You See in Videos | cs.CV, cs.AI | 83 | Agentic video understanding with reason-plan-observe control of perception; likely reusable framework. | agentic, video-understanding, planning, active-perception, VLM-tools, efficiency |
2603.17265 | LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis | cs.CV, cs.CL | 82 | LED benchmark targets structural layout errors beyond IoU; reusable eval for doc/LMM systems. | benchmark, evaluation, multimodal, document-ai, hallucination, robustness |
2603.22918 | EVA: Efficient Reinforcement Learning for End-to-End Video Agent | cs.CV, cs.AI, cs.CL | 82 | RL-based planning-before-perception for long videos; efficiency gains for multimodal agents. | video-agents, reinforcement-learning, planning, multimodal, efficiency, long-context |
2603.21574 | Adaptive Robust Estimator for Multi-Agent Reinforcement Learning | cs.AI | 82 | Robust MARL for collaborative reasoning; tackles noisy/heavy-tailed rewards and structured critique loops | multi-agent, reinforcement-learning, robust-estimation, llm-reasoning, credit-assignment |
2603.17811 | Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference | cs.LG, cs.AI | 82 | Systematic MC-dropout reliability study across 19 transformers; links variability to reasoning/memory | uncertainty, MC-dropout, reliability, transformers, stochastic-inference, evaluation |
2603.23406 | Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies | cs.AI, cs.CL, cs.HC | 82 | Measures stance formation/identity negotiation in generative multi-agent societies; new metrics. | multi-agent, social-simulation, evaluation, trust, persuasion, agent-behavior |
2603.24167 | Walma: Learning to See Memory Corruption in WebAssembly | cs.CR, cs.LG | 82 | ML-based WebAssembly memory attestation vs adversarial host; concrete security evaluation on CVEs | security, webassembly, memory-corruption, attestation, robustness, systems |
AI Paper Insight Brief
2026-03-29
0) Executive takeaways (read this first)
- “Perception is the bottleneck” is now measurable and fixable without retraining: multi-agent context engineering that cross-checks intermediate evidence (not just final answers) materially improves multimodal math accuracy (M$^3$-ACE).
- Benchmarks are shifting from “did you get the box/answer” to “did you detect the structural failure mode”: LED reframes document layout evaluation around error types (missing/merge/split/etc.), exposing that strong VLMs still struggle on fine-grained structural diagnosis.
- Inference-time stochasticity is not a free uncertainty win: MC Dropout often reduces accuracy (10/19 models) and disproportionately harms “memory” vs “reasoning,” so uncertainty methods must be architecture/task-aware.
- Agent safety is increasingly about system surfaces (tools, GUIs, permissions), not just text: notification-icon visual backdoors can hijack mobile GUI agents at high ASR (AgentRAE), while internalized authorization trajectories can enforce permission boundaries (Chain-of-Authorization).
- Closed-loop, environment-grounded training/evaluation is becoming the practical differentiator: EnterpriseLab (tool environments + executable synthesis + trajectory RL) and finance orchestration benchmarking show that architecture + cost controls dominate production viability.
- Claim-level, bias-resistant verification is emerging as a scalable anti-hallucination training signal: MARCH uses an information-asymmetric Checker (blinded to the Solver output) + strict per-claim reward to lift an 8B model’s RAG factuality by ~20 points on reported averages.
2) Key themes (clusters)
Theme: Multi-agent evidence/consensus as a robustness primitive
- Why it matters: Many failures persist because models commit early to wrong intermediate state (visual evidence, critiques, rewards). Cross-agent disagreement signals and structured reconciliation can selectively spend compute where it matters.
- Representative papers:
- Common approach:
- Separate intermediate artifacts (VE lists; critique deltas; verification Qs) from final answers and make them first-class objects.
- Use disagreement/consensus (conflict ratios, majority votes, inconsistency scores) to gate extra iterations or weight fusion.
- Add structure/tools around agent interaction (Summary/Refine tools; staged answer–critique–rewrite; verification protocols).
- Open questions / failure modes:
- Heuristic thresholds and gating policies (e.g., conflict ratio > 0.2) may be brittle across domains/models.
- “Consistency ≠ correctness”: verification can reward coherent but wrong reasoning (noted in medical MCQA).
- Compute/latency overhead and scaling behavior under real-time constraints is often underreported.
Theme: Evaluation is becoming diagnostic (error types, contamination sensitivity, executable oracles)
- Why it matters: Aggregate scores hide where systems fail (structural layout errors, contamination sensitivity, unverifiable code review). New benchmarks aim to expose failure modes with more actionable signals.
- Representative papers:
- Common approach:
- Replace single metrics with typed error taxonomies and hierarchical tasks (LED T1–T3).
- Stress-test score confidence via controlled perturbation/aggregation (router–worker noisy rewrite audit).
- Use executable evaluation oracles (convert review comments → tests that must fail-before/pass-after).
- Open questions / failure modes:
- Synthetic construction and imbalance (LED Missing dominates) may skew conclusions.
- Contamination audits currently shown on small samples (n=100) and MCQ-only settings.
- Test-based oracles depend on environment reconstruction and a coding agent, conflating capabilities.
Theme: Tool/GUI agent security & governance is moving “inside the model”
- Why it matters: As agents act in real systems, the attack surface includes UI pixels, tool schemas, and permission boundaries. Defenses need causal enforcement, not just prompts.
- Representative papers:
- AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents
- Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories
- The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
- Common approach:
- Formalize system-level threat models (supply-chain poisoning; permission mismatch; transaction semantics).
- Enforce information barriers or causal steps (authorization trajectory before answering; Checker blinded to Solver).
- Evaluate against adaptive attacks/defenses (PAIR jailbreaks; pruning/finetuning defenses for backdoors).
- Open questions / failure modes:
- Backdoor evaluations are largely offline; online interactive agent dynamics remain under-tested.
- CoA requires fine-tuning + permission token engineering; real permission taxonomies are messy and dynamic.
- Tool orchestration safety needs transaction semantics and replayable audits, but unified standards remain open.
Theme: Planning-before-perception and adaptive observation for long-horizon video agents
- Why it matters: Long videos break fixed-context VLM pipelines; agents must decide what to look at and how densely to sample to control cost while preserving evidence.
- Representative papers:
- Common approach:
- Iterative plan–observe loops with parameterized observation actions (time window, frames, resize; tool choice).
- Staged training or modular toolkits (SFT→KTO→GRPO; Scan/Focus/Stitch tools + timestamp anchors).
- Explicit efficiency targets (visual token budgets; fewer frames; avoid heavy preprocessing).
- Open questions / failure modes:
- Reward hacking and sampling pathologies remain (EVA mitigates but doesn’t eliminate).
- Planner stagnation (static repetition) and premature conclusions (LensWalk failure modes).
- Dependence on tool interfaces and observer quality; generalization to new tools/modalities is unclear.
Theme: Training-time and decode-time robustness interventions (noise, attention, DP/Byzantine)
- Why it matters: Robustness failures often come from distribution shift (noisy context), attention pathologies (hallucination), or adversarial participants (federated learning). Lightweight interventions can be high leverage.
- Representative papers:
- Common approach:
- Train on realistic noise (teacher ASR hypotheses) + regularize reliance (context dropout) + align with preferences (DPO).
- Decode-time head-specific attention interventions (AIR) to reduce hallucinations without retraining.
- Robust aggregation + clipping + momentum + error feedback with high-probability guarantees (Byz-Clip21-SGD2M).
- Open questions / failure modes:
- Hyperparameter sensitivity (AIR λ/β tradeoffs; CoA learning-rate sensitivity; DPO scaling γ).
- Teacher-bias in noise modeling (single ASR teacher) and limited scenario coverage (no overlapping speakers).
- Theory-to-practice gaps: constraints/hyperparameter restrictions and unproven variants (example-wise clipping).
3) Technical synthesis
- Intermediate-representation auditing is converging across modalities: VE lists (math vision), claim QA pairs (RAG factuality), verification questions (medical MCQA), and graph memories (pentesting) all serve as auditable state that can be cross-checked.
- Information asymmetry is a recurring anti-bias tool: MARCH blinds the Checker to the Solver output; CoA forces an explicit authorization trajectory before content; both aim to prevent “seeing the answer first” bias.
- Selective compute is the dominant systems pattern: M$^3$-ACE iterates only on ~10% disputed samples; finance orchestration shows hierarchical “knee” + caching/routing; safety gates in robotics execute only when stable/OOD-safe.
- Robust statistics are entering RL-for-LLMs: ARE replaces batch-mean normalization with median-of-block robust estimation; POISE discovers normalization/validity masking mechanisms for GRPO variants.
- Prompt/configuration sensitivity is now benchmarked explicitly: LED measures prompt robustness (CV/NR) across P1/P2/P3; dropout-at-inference shows architecture-dependent volatility; these suggest “one prompt/one setting” reporting is insufficient.
- Decoding-time interventions are gaining credibility: AIR reduces CHAIR hallucination metrics substantially while preserving/improving MM-Vet; this parallels other “training-free” fixes like M$^3$-ACE’s context engineering.
- Environment-grounded evaluation is becoming the gold standard for agents: EnterpriseLab executes trajectories against tool containers; pentesting workflow grounds memory in observed outputs; code review benchmark uses executable tests.
- Security threats are increasingly visual and supply-chain for agents: AgentRAE shows tiny notification icons can be robust triggers; defenses that assume text-only triggers or static prompts are incomplete.
- Calibration/uncertainty remains tricky without labels: MARC improves ECE via consistency verification, but the paper notes failure when consistency rewards wrong knowledge—highlighting the need for grounding beyond self-consistency.
4) Top 5 papers (with “why now”)
- Decouples visual evidence extraction from reasoning and uses multi-agent VE cross-validation with Summary/Refine tools.
- Reports strong gains on MathVision (e.g., Gemini-3 Pro 85.0% → 89.1%) and large jumps for weaker models (e.g., GPT-5 72.0% → 82.2%).
- Selective iteration: refine stage keeps high-consensus subset near 90% accuracy while only ~10% samples loop.
- Skepticism: depends on access to multiple strong multimodal models; heuristic thresholds and compute/latency trade-offs aren’t fully quantified.
2) MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
- Introduces Solver–Proposer–Checker with the Checker blinded to reduce confirmation bias; trains via dual-trajectory PPO.
- Large factuality gains reported: RAGTruth/FaithBench average 55.20% → ~75% (+~20).
- Uses strict Zero-Tolerance Reward to enforce per-claim correctness (all claims must match).
- Skepticism: verification focus is prioritized for numeric/quantitative claims; proposer reward-hacking (shrinking claims) is a known risk.
- Shows a practical trigger surface: native notification icons as covert backdoor triggers for screenshot-based agents.
- Two-phase poisoning (contrastive trigger separation + balanced poison loss) achieves high ASR (>90% in many settings), scaling to 9 targets.
- Evaluates defenses (fine-pruning, fine-tuning, NAD) and finds ASR remains high post-defense.
- Skepticism: evaluations are offline on two open-source agents/datasets; online timing/interaction effects are not tested.
4) LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
- Defines 8 structural layout error types and builds a synthetic injection benchmark with 3 hierarchical tasks (doc detect → type classify → element classify).
- Finds Gemini 2.5 variants best and most prompt-stable; GPT models drop sharply on fine-grained tasks.
- Provides prompt/input configuration comparisons (image+JSON best; boxes-only weakest).
- Skepticism: synthetic + imbalanced error distribution (Missing dominates) and single-source injection modeling may limit generality.
5) EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises
- Integrates MCP tool environments, executable trajectory synthesis from schemas, and training (SFT/DPO/Agentic GRPO) in a closed loop.
- Reports Qwen3-8B Agentic GRPO competitive with GPT-4o on EnterpriseArena execution accuracy (0.43 vs 0.45) and claims ~8–10× inference cost reduction.
- Shows adaptation via incremental trajectories after schema/API changes.
- Skepticism: scope is tool/API environments (not GUI); performance depends on base model capability and synthesis quality.
5) Practical next steps
- Adopt “intermediate artifact logging” as a default: store VE lists / claim lists / tool-call plans and measure disagreement rates; use them to trigger selective re-tries (as in M$^3$-ACE).
- Add an information-asymmetric verifier path in RAG: implement a Checker that only sees retrieved docs + atomized questions (not the draft answer) and track factuality deltas vs standard self-critique.
- Run a contamination-sensitivity audit before trusting leaderboard deltas: replicate router–worker noisy rewrite tests on your key MCQ benchmarks and report “violation breadth” alongside accuracy.
- For tool agents, treat permissions as first-class tokens + trajectories: prototype CoA-style “resource review → identity → decision” outputs and enforce that downstream answer/tool calls are conditioned on that trajectory.
- Harden GUI agents against visual trigger surfaces: add notification-aware preprocessing (mask/crop notification regions) and evaluate against icon-trigger backdoor scenarios similar to AgentRAE.
- If using MC Dropout for uncertainty, benchmark memory-heavy vs reasoning-heavy tasks separately: measure mean+std under stochastic inference; avoid enabling dropout blindly for specialized checkpoints.
- For long-video agents, measure “evidence efficiency” not just accuracy: track frames used / visual tokens / number of observation turns; add stagnation detectors for static repetition and premature stopping.
- Prefer executable oracles where possible: for code review or agent actions, convert evaluation into tests or environment-grounded success metrics rather than text similarity.
Generated from per-paper analyses; no external browsing.
