AI 论文日报(2026-03-29)

Published:

English version: /paper-news/2026-03-29/

运行统计

  • 候选论文: 1744
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_unknown, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2603.23951From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents
PDF
cs.CL95Closed-loop LLM agents discover improved LLM-RL algorithms; strong automation + eval/iteration framework.LLM-agents, RLHF, policy-optimization, auto-research, evaluation, algorithm-discovery
2603.23007AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents
PDF
cs.CR, cs.AI94Concrete backdoor for mobile GUI agents via notifications; high-impact agent security threat model.agent-security, mobile-agents, backdoors, visual-triggers, remote-action-execution, red-teaming
2603.22869Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories
PDF
cs.AI92Internalizes fine-grained authorization in LLM reasoning; targets data leakage and access-boundary failures.authorization, access-control, LLM-safety, data-leakage, reasoning-trajectories, security
2603.24477Composer 2 Technical Report
PDF
cs.SE, cs.LG92Agentic SWE model + RL in real tool harness; likely strong frontier agent capability signalagentic-coding, software-engineering, reinforcement-learning, tool-use, long-horizon, frontier-llm
2603.24579MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
PDF
cs.CL90Multi-agent asymmetry to reduce LLM-judge confirmation bias for RAG hallucination checkinghallucination, RAG, LLM-judge, multi-agent, verification, reliability
2603.21636Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and 评分 Confidence in LLM Benchmarks
PDF
cs.AI, cs.CL90Audit framework for benchmark contamination sensitivity & score confidence; key for LLM eval integrityLLM-evaluation, benchmarking, data-contamination, leakage, audit, measurement
2603.24221Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing
PDF
cs.RO, cs.AI90Environment-grounded multi-agent LLM pentesting for robots; concrete security workflow + memory graph.agent-security, penetration-testing, cybersecurity, robotics, multi-agent, tool-use
2603.23231PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PDF
cs.AI88Benchmark for personalized memory agents with evolving preferences; more realistic than pure retrieval tests.agents, memory, personalization, evaluation, benchmarks, long-term-consistency
2603.24058Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification
PDF
cs.CV, cs.AI88Targets LVLM object hallucination via attention-imbalance rectification; reliability for high-stakes vision.hallucinations, vision-language, reliability, attention, calibration, safety
2603.21630EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises
PDF
cs.AI86Full-stack closed-loop platform for enterprise agents: tools+data synthesis+training+eval in one.agents, enterprise, tool-use, MCP, data-synthesis, evaluation, deployment
2603.23129Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair
PDF
cs.LG86Gödel-style self-improving agent for small models via auditable policy patches; relevant to safe autonomy.agents, self-improvement, policy-repair, small-models, auditing, reliability
2603.22862The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
PDF
cs.SE, cs.CL86Comprehensive review of multi-tool LLM agent orchestration incl. safety/cost/verifiability constraintsllm-agents, tool-use, orchestration, survey, safety, verification
2603.08369M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering
PDF
cs.AI86Multi-agent context engineering to correct perception errors in multimodal math reasoningmultimodal, VLM, math-reasoning, multi-agent, perception, robustness
2603.23448Code Review Agent Benchmark
PDF
cs.SE, cs.AI86New benchmark/dataset for code review agents; timely for agentic SE quality assurance.agents, benchmark, code-review, software-engineering, evaluation, datasets
2603.24481Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
PDF
cs.AI, cs.CL, cs.LG86Multi-agent verification + weighted fusion improves uncertainty calibration for medical MCQAuncertainty, calibration, verification, multi-agent, medical, reliability
2603.19195How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation
PDF
eess.AS, cs.CL, cs.SD86Holistic eval of LLM backbones' auditory knowledge + new benchmark (AKB-2000) for audio LMs.audio-language-models, LLM-backbones, evaluation, benchmark, probing, multimodal
2603.21475Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems
PDF
cs.AI86Decouples agent node creation from orchestration; targets knowledge-intensive MAS generation bottleneck.multi-agent, agent-architecture, orchestration, domain-experts, automation
2603.24034From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs
PDF
cs.CL, cs.AI86Mitigates contextual exposure bias in Speech-LLMs using noisy history + dropout + DPO on failures.speech-LLM, robustness, DPO, distribution-shift, evaluation, alignment
2603.23472Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions
PDF
cs.LG, cs.CR, math.OC84Unified DP + Byzantine-robust federated optimization with weaker assumptions and guarantees.federated-learning, differential-privacy, byzantine-robustness, secure-ml, optimization
2603.22651Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
PDF
cs.AI, cs.CL, cs.LG84Large-scale benchmark of multi-agent orchestration patterns with cost/latency/accuracy tradeoffs.multi-agent, orchestration, benchmark, evaluation, LLMs, cost-latency, document-IE
2603.15080Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database
PDF
cs.DB, cs.AI, q-bio.QM84Open biomedical KGs + federation + explicit AI-agent access layer; reusable infra at scaleknowledge-graphs, agents, tool-use, data-infrastructure, biomedicine, RAG
2603.22999PaperVoyager : Building Interactive Web with Visual Language Models
PDF
cs.CL84Benchmark + agent that turns papers into executable interactive web systems; strong tool-use/document agent angle.agents, tool-use, document-understanding, benchmark, evaluation, web-synthesis
2603.23983SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating
PDF
cs.RO, cs.AI, eess.SY84Text-driven humanoid control with explicit safety gating and physics guidance; addresses OOD unsafe motions.robot-safety, agents, humanoids, safety-gating, OOD-robustness, control
2603.24558LensWalk: Agentic Video Understanding by Planning How You See in Videos
PDF
cs.CV, cs.AI83Agentic video understanding with reason-plan-observe control of perception; likely reusable framework.agentic, video-understanding, planning, active-perception, VLM-tools, efficiency
2603.17265LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
PDF
cs.CV, cs.CL82LED benchmark targets structural layout errors beyond IoU; reusable eval for doc/LMM systems.benchmark, evaluation, multimodal, document-ai, hallucination, robustness
2603.22918EVA: Efficient Reinforcement Learning for End-to-End Video Agent
PDF
cs.CV, cs.AI, cs.CL82RL-based planning-before-perception for long videos; efficiency gains for multimodal agents.video-agents, reinforcement-learning, planning, multimodal, efficiency, long-context
2603.21574Adaptive Robust Estimator for Multi-Agent Reinforcement Learning
PDF
cs.AI82Robust MARL for collaborative reasoning; tackles noisy/heavy-tailed rewards and structured critique loopsmulti-agent, reinforcement-learning, robust-estimation, llm-reasoning, credit-assignment
2603.17811Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference
PDF
cs.LG, cs.AI82Systematic MC-dropout reliability study across 19 transformers; links variability to reasoning/memoryuncertainty, MC-dropout, reliability, transformers, stochastic-inference, evaluation
2603.23406Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies
PDF
cs.AI, cs.CL, cs.HC82Measures stance formation/identity negotiation in generative multi-agent societies; new metrics.multi-agent, social-simulation, evaluation, trust, persuasion, agent-behavior
2603.24167Walma: Learning to See Memory Corruption in WebAssembly
PDF
cs.CR, cs.LG82ML-based WebAssembly memory attestation vs adversarial host; concrete security evaluation on CVEssecurity, webassembly, memory-corruption, attestation, robustness, systems

AI 论文洞察简报

2026-03-29

0) 核心要点(先读这个)

  • “感知是瓶颈”如今可被量化且无需再训练即可修复:通过多智能体上下文工程,对中间证据(而非仅最终答案)进行交叉核对,可显著提升多模态数学准确率(M$^3$-ACE)。
  • 基准正在从“你画对框/答对题了吗”转向“你是否识别出结构性失效模式”:LED 将文档版面评估重构为围绕错误类型(缺失/合并/拆分等),揭示强 VLM 仍难以进行细粒度结构诊断。
  • 推理时随机性并非免费的不确定性收益:MC Dropout 往往降低准确率(19 个模型中有 10 个),且对“记忆”类任务的伤害大于“推理”类任务,因此不确定性方法必须与架构/任务相匹配。
  • 智能体安全越来越关乎系统表面(工具、GUI、权限),而不只是文本:通知图标视觉后门可在高 ASR 下劫持移动 GUI 智能体(AgentRAE);而内化的授权轨迹可强制权限边界(Chain-of-Authorization)。
  • 闭环、环境落地的训练/评估正在成为实用分水岭:EnterpriseLab(工具环境 + 可执行合成 + 轨迹 RL)与金融编排基准显示:架构与成本控制主导生产可行性。
  • 按主张(claim)级、抗偏置的验证正在成为可扩展的反幻觉训练信号:MARCH 使用信息不对称的 Checker(对 Solver 输出“盲视”)+ 严格逐主张奖励,使 8B 模型的 RAG 事实性在报告均值上提升约 20 点。

2) 关键主题(聚类)

主题:多智能体证据/共识作为鲁棒性原语

主题:评估正在变得诊断化(错误类型、污染敏感性、可执行裁判)

主题:工具/GUI 智能体安全与治理正在“进入模型内部”

主题:长时程视频智能体的“先规划后感知”与自适应观测

主题:训练时与解码时的鲁棒性干预(噪声、注意力、DP/拜占庭)

3) 技术综合

  • 中间表示审计正在跨模态收敛:VE 列表(数学视觉)、主张 QA 对(RAG 事实性)、验证问题(医疗 MCQA)与图记忆(渗透测试)都作为可审计状态,可被交叉核对。
  • 信息不对称是反偏置的反复出现工具:MARCH 让 Checker 对 Solver 输出盲视;CoA 强制在内容前显式走授权轨迹;二者都旨在避免“先看到答案”的偏置。
  • 选择性算力是主导系统模式:M$^3$-ACE 仅在约 10% 有争议样本上迭代;金融编排显示分层“拐点”+ 缓存/路由;机器人安全门在稳定/OOD 安全时才执行。
  • 鲁棒统计正在进入 RL-for-LLMs:ARE 用分块中位数鲁棒估计替代批均值归一化;POISE 发现 GRPO 变体的归一化/有效性掩码机制。
  • 提示/配置敏感性如今被显式基准化:LED 在 P1/P2/P3 下衡量提示鲁棒性(CV/NR);推理时 dropout 展示架构相关波动;这表明“一个提示/一个设置”的报告不足。
  • 解码时干预正在获得可信度:AIR 在保持/提升 MM-Vet 的同时显著降低 CHAIR 幻觉指标;这与 M$^3$-ACE 的上下文工程等“免训练”修复相呼应。
  • 环境落地评估正在成为智能体金标准:EnterpriseLab 在工具容器中执行轨迹;渗透测试工作流将记忆落地于观测输出;代码审查基准使用可执行测试。
  • 智能体安全威胁越来越视觉化与供应链化:AgentRAE 表明微小通知图标可作为稳健触发器;假设仅文本触发或静态提示的防御并不完整。
  • 无标签下的校准/不确定性仍棘手:MARC 通过一致性验证改善 ECE,但论文指出一致性会奖励错误知识——凸显需要超越自一致性的落地约束。

4) Top 5 论文(含“为何是现在”)

1) M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering(通过多智能体上下文工程修复多模态数学推理中的视觉感知)

  • 视觉证据抽取与推理解耦,并用多智能体 VE 交叉验证配合 Summary/Refine 工具。
  • 报告在 MathVision 上显著提升(如 Gemini-3 Pro 85.0% → 89.1%),对较弱模型提升更大(如 GPT-5 72.0% → 82.2%)。
  • 选择性迭代:refine 阶段使高共识子集准确率接近 90%,同时仅约 10% 样本进入循环。
  • 质疑点:依赖多个强多模态模型的可用性;启发式阈值与算力/时延权衡未充分量化。

2) MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination(用于 LLM 幻觉的多智能体强化自检)

  • 引入 Solver–Proposer–Checker,并让 Checker 盲视以降低确认偏差;通过双轨迹 PPO 训练。
  • 报告大幅事实性提升:RAGTruth/FaithBench 平均 55.20% → ~75%(+~20)。
  • 使用严格的 Zero-Tolerance Reward 强制逐主张正确(所有主张都必须匹配)。
  • 质疑点:验证重点偏向数值/定量主张;Proposer 奖励黑客(缩减主张)是已知风险。

3) AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents(针对截图式移动 GUI 智能体的基于通知的视觉后门远程动作执行)

  • 展示了实用触发面:原生通知图标可作为截图式智能体的隐蔽后门触发器。
  • 两阶段投毒(对比式触发分离 + 平衡投毒损失)实现高 ASR(多设置下 >90%),可扩展到 9 个目标。
  • 评估防御(fine-pruning、fine-tuning、NAD),发现防御后 ASR 仍高。
  • 质疑点:评估为离线,基于两个开源智能体/数据集;未测试在线时序/交互效应。

4) LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis(用于评估文档分析中版面错误检测的基准)

  • 定义 8 类结构版面错误,并构建合成注入基准,含 3 个分层任务(文档检测 → 类型分类 → 元素分类)。
  • 发现 Gemini 2.5 系列最好且提示最稳定;GPT 模型在细粒度任务上下降明显。
  • 提供提示/输入配置对比(image+JSON 最佳;仅 boxes 最弱)。
  • 质疑点:合成 + 错误分布不均衡(Missing 占主导)与单源注入建模可能限制泛化。

5) EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises(企业级智能体开发与部署的全栈平台)

  • 集成 MCP 工具环境、从 schema 生成可执行轨迹合成,以及闭环训练(SFT/DPO/Agentic GRPO)。
  • 报告 Qwen3-8B Agentic GRPO 在 EnterpriseArena 执行准确率上接近 GPT-4o(0.43 vs 0.45),并声称推理成本降低约 8–10×。
  • 展示在 schema/API 变更后通过增量轨迹进行适配。
  • 质疑点:范围主要是工具/API 环境(非 GUI);性能依赖基座模型能力与合成质量。

5) 实用下一步

  • 将“中间产物日志”设为默认:保存 VE 列表/主张列表/工具调用计划并衡量分歧率;用其触发选择性重试(如 M$^3$-ACE)。
  • 在 RAG 中加入信息不对称的验证路径:实现只看检索文档 + 原子化问题(不看草稿答案)的 Checker,并跟踪相对标准自我批评的事实性增量。
  • 在信任榜单差异前先做污染敏感性审计:在关键 MCQ 基准上复现 router–worker 噪声改写测试,并与准确率一起报告“违规广度”。
  • 对工具智能体,将权限视为一等 token + 轨迹:原型化 CoA 风格的“资源审查 → 身份 → 决策”输出,并强制下游回答/工具调用以该轨迹为条件。
  • 加固 GUI 智能体以抵御视觉触发面:加入通知感知预处理(遮罩/裁剪通知区域),并在类似 AgentRAE 的图标触发后门场景下评估。
  • 若用 MC Dropout 做不确定性,分别基准化记忆重 vs 推理重任务:在随机推理下测均值+标准差;避免对专用 checkpoint 盲目开启 dropout。
  • 对长视频智能体,衡量“证据效率”而非仅准确率:跟踪使用帧数/视觉 token/观测轮次;加入静态重复与过早停止的停滞检测器。
  • 尽可能偏好可执行裁判:对代码审查或智能体动作,将评估转为测试或环境落地成功指标,而非文本相似度。

由逐论文分析生成;无外部浏览。