AI 论文日报（2026-03-29）

Published: March 29, 2026

English version: /paper-news/2026-03-29/

运行统计

候选论文: 1744
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_unknown, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2603.23951`	From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents PDF	cs.CL	95	Closed-loop LLM agents discover improved LLM-RL algorithms; strong automation + eval/iteration framework.	LLM-agents, RLHF, policy-optimization, auto-research, evaluation, algorithm-discovery
`2603.23007`	AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents PDF	cs.CR, cs.AI	94	Concrete backdoor for mobile GUI agents via notifications; high-impact agent security threat model.	agent-security, mobile-agents, backdoors, visual-triggers, remote-action-execution, red-teaming
`2603.22869`	Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories PDF	cs.AI	92	Internalizes fine-grained authorization in LLM reasoning; targets data leakage and access-boundary failures.	authorization, access-control, LLM-safety, data-leakage, reasoning-trajectories, security
`2603.24477`	Composer 2 Technical Report PDF	cs.SE, cs.LG	92	Agentic SWE model + RL in real tool harness; likely strong frontier agent capability signal	agentic-coding, software-engineering, reinforcement-learning, tool-use, long-horizon, frontier-llm
`2603.24579`	MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination PDF	cs.CL	90	Multi-agent asymmetry to reduce LLM-judge confirmation bias for RAG hallucination checking	hallucination, RAG, LLM-judge, multi-agent, verification, reliability
`2603.21636`	Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and 评分 Confidence in LLM Benchmarks PDF	cs.AI, cs.CL	90	Audit framework for benchmark contamination sensitivity & score confidence; key for LLM eval integrity	LLM-evaluation, benchmarking, data-contamination, leakage, audit, measurement
`2603.24221`	Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing PDF	cs.RO, cs.AI	90	Environment-grounded multi-agent LLM pentesting for robots; concrete security workflow + memory graph.	agent-security, penetration-testing, cybersecurity, robotics, multi-agent, tool-use
`2603.23231`	PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments PDF	cs.AI	88	Benchmark for personalized memory agents with evolving preferences; more realistic than pure retrieval tests.	agents, memory, personalization, evaluation, benchmarks, long-term-consistency
`2603.24058`	Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification PDF	cs.CV, cs.AI	88	Targets LVLM object hallucination via attention-imbalance rectification; reliability for high-stakes vision.	hallucinations, vision-language, reliability, attention, calibration, safety
`2603.21630`	EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises PDF	cs.AI	86	Full-stack closed-loop platform for enterprise agents: tools+data synthesis+training+eval in one.	agents, enterprise, tool-use, MCP, data-synthesis, evaluation, deployment
`2603.23129`	Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair PDF	cs.LG	86	Gödel-style self-improving agent for small models via auditable policy patches; relevant to safe autonomy.	agents, self-improvement, policy-repair, small-models, auditing, reliability
`2603.22862`	The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration PDF	cs.SE, cs.CL	86	Comprehensive review of multi-tool LLM agent orchestration incl. safety/cost/verifiability constraints	llm-agents, tool-use, orchestration, survey, safety, verification
`2603.08369`	M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering PDF	cs.AI	86	Multi-agent context engineering to correct perception errors in multimodal math reasoning	multimodal, VLM, math-reasoning, multi-agent, perception, robustness
`2603.23448`	Code Review Agent Benchmark PDF	cs.SE, cs.AI	86	New benchmark/dataset for code review agents; timely for agentic SE quality assurance.	agents, benchmark, code-review, software-engineering, evaluation, datasets
`2603.24481`	Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA PDF	cs.AI, cs.CL, cs.LG	86	Multi-agent verification + weighted fusion improves uncertainty calibration for medical MCQA	uncertainty, calibration, verification, multi-agent, medical, reliability
`2603.19195`	How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation PDF	eess.AS, cs.CL, cs.SD	86	Holistic eval of LLM backbones' auditory knowledge + new benchmark (AKB-2000) for audio LMs.	audio-language-models, LLM-backbones, evaluation, benchmark, probing, multimodal
`2603.21475`	Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems PDF	cs.AI	86	Decouples agent node creation from orchestration; targets knowledge-intensive MAS generation bottleneck.	multi-agent, agent-architecture, orchestration, domain-experts, automation
`2603.24034`	From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs PDF	cs.CL, cs.AI	86	Mitigates contextual exposure bias in Speech-LLMs using noisy history + dropout + DPO on failures.	speech-LLM, robustness, DPO, distribution-shift, evaluation, alignment
`2603.23472`	Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions PDF	cs.LG, cs.CR, math.OC	84	Unified DP + Byzantine-robust federated optimization with weaker assumptions and guarantees.	federated-learning, differential-privacy, byzantine-robustness, secure-ml, optimization
`2603.22651`	Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies PDF	cs.AI, cs.CL, cs.LG	84	Large-scale benchmark of multi-agent orchestration patterns with cost/latency/accuracy tradeoffs.	multi-agent, orchestration, benchmark, evaluation, LLMs, cost-latency, document-IE
`2603.15080`	Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database PDF	cs.DB, cs.AI, q-bio.QM	84	Open biomedical KGs + federation + explicit AI-agent access layer; reusable infra at scale	knowledge-graphs, agents, tool-use, data-infrastructure, biomedicine, RAG
`2603.22999`	PaperVoyager : Building Interactive Web with Visual Language Models PDF	cs.CL	84	Benchmark + agent that turns papers into executable interactive web systems; strong tool-use/document agent angle.	agents, tool-use, document-understanding, benchmark, evaluation, web-synthesis
`2603.23983`	SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating PDF	cs.RO, cs.AI, eess.SY	84	Text-driven humanoid control with explicit safety gating and physics guidance; addresses OOD unsafe motions.	robot-safety, agents, humanoids, safety-gating, OOD-robustness, control
`2603.24558`	LensWalk: Agentic Video Understanding by Planning How You See in Videos PDF	cs.CV, cs.AI	83	Agentic video understanding with reason-plan-observe control of perception; likely reusable framework.	agentic, video-understanding, planning, active-perception, VLM-tools, efficiency
`2603.17265`	LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis PDF	cs.CV, cs.CL	82	LED benchmark targets structural layout errors beyond IoU; reusable eval for doc/LMM systems.	benchmark, evaluation, multimodal, document-ai, hallucination, robustness
`2603.22918`	EVA: Efficient Reinforcement Learning for End-to-End Video Agent PDF	cs.CV, cs.AI, cs.CL	82	RL-based planning-before-perception for long videos; efficiency gains for multimodal agents.	video-agents, reinforcement-learning, planning, multimodal, efficiency, long-context
`2603.21574`	Adaptive Robust Estimator for Multi-Agent Reinforcement Learning PDF	cs.AI	82	Robust MARL for collaborative reasoning; tackles noisy/heavy-tailed rewards and structured critique loops	multi-agent, reinforcement-learning, robust-estimation, llm-reasoning, credit-assignment
`2603.17811`	Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference PDF	cs.LG, cs.AI	82	Systematic MC-dropout reliability study across 19 transformers; links variability to reasoning/memory	uncertainty, MC-dropout, reliability, transformers, stochastic-inference, evaluation
`2603.23406`	Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies PDF	cs.AI, cs.CL, cs.HC	82	Measures stance formation/identity negotiation in generative multi-agent societies; new metrics.	multi-agent, social-simulation, evaluation, trust, persuasion, agent-behavior
`2603.24167`	Walma: Learning to See Memory Corruption in WebAssembly PDF	cs.CR, cs.LG	82	ML-based WebAssembly memory attestation vs adversarial host; concrete security evaluation on CVEs	security, webassembly, memory-corruption, attestation, robustness, systems

AI 论文洞察简报

2026-03-29

0) 核心要点（先读这个）

“感知是瓶颈”如今可被量化且无需再训练即可修复：通过多智能体上下文工程，对中间证据（而非仅最终答案）进行交叉核对，可显著提升多模态数学准确率（M$^3$-ACE）。
基准正在从“你画对框/答对题了吗”转向“你是否识别出结构性失效模式”：LED 将文档版面评估重构为围绕错误类型（缺失/合并/拆分等），揭示强 VLM 仍难以进行细粒度结构诊断。
推理时随机性并非免费的不确定性收益：MC Dropout 往往降低准确率（19 个模型中有 10 个），且对“记忆”类任务的伤害大于“推理”类任务，因此不确定性方法必须与架构/任务相匹配。
智能体安全越来越关乎系统表面（工具、GUI、权限），而不只是文本：通知图标视觉后门可在高 ASR 下劫持移动 GUI 智能体（AgentRAE）；而内化的授权轨迹可强制权限边界（Chain-of-Authorization）。
闭环、环境落地的训练/评估正在成为实用分水岭：EnterpriseLab（工具环境 + 可执行合成 + 轨迹 RL）与金融编排基准显示：架构与成本控制主导生产可行性。
按主张（claim）级、抗偏置的验证正在成为可扩展的反幻觉训练信号：MARCH 使用信息不对称的 Checker（对 Solver 输出“盲视”）+ 严格逐主张奖励，使 8B 模型的 RAG 事实性在报告均值上提升约 20 点。

2) 关键主题（聚类）

主题：多智能体证据/共识作为鲁棒性原语

重要性：许多失败源于模型过早锁定错误的中间状态（视觉证据、批评、奖励）。跨智能体分歧信号与结构化调和可在关键处选择性投入算力。
代表论文：
共同方法：
- 将中间产物（VE 列表；批评差分；验证问题）与最终答案分离，并将其作为一等对象。
- 使用分歧/共识（冲突比、多数投票、不一致分数）来触发额外迭代或加权融合。
- 为智能体交互增加结构/工具（Summary/Refine 工具；分阶段“回答–批评–改写”；验证协议）。
开放问题 / 失效模式：
- 启发式阈值与门控策略（如冲突比 > 0.2）在跨领域/跨模型时可能脆弱。
- “一致性 ≠ 正确性”：验证可能奖励连贯但错误的推理（医疗 MCQA 中有指出）。
- 计算/时延开销与在实时约束下的扩展行为常被低报。

主题：评估正在变得诊断化（错误类型、污染敏感性、可执行裁判）

重要性：汇总分数会掩盖系统失败位置（结构版面错误、污染敏感性、不可验证的代码审查）。新基准旨在用更可操作的信号暴露失效模式。
代表论文：
共同方法：
- 用类型化错误分类体系与分层任务（LED T1–T3）替代单一指标。
- 通过受控扰动/聚合来压力测试分数置信度（router–worker 噪声改写审计）。
- 使用可执行评估裁判（将审查评论 → 测试，要求“修复前失败/修复后通过”）。
开放问题 / 失效模式：
- 合成构造与不均衡（LED 中 Missing 占主导）可能扭曲结论。
- 污染审计目前多在小样本（n=100）与仅 MCQ 场景展示。
- 基于测试的裁判依赖环境重建与编码智能体，能力被混合在一起。

主题：工具/GUI 智能体安全与治理正在“进入模型内部”

重要性：当智能体在真实系统中行动时，攻击面包括 UI 像素、工具 schema 与权限边界。防御需要因果式约束执行，而不只是提示词。
代表论文：
共同方法：
- 形式化系统级威胁模型（供应链投毒；权限不匹配；交易语义）。
- 强制信息屏障或因果步骤（回答前先走授权轨迹；Checker 对 Solver 盲视）。
- 在自适应攻防下评估（PAIR 越狱；后门的剪枝/微调防御）。
开放问题 / 失效模式：
- 后门评估多为离线；在线交互式智能体动态仍缺乏测试。
- CoA 需要微调 + 权限 token 工程；真实权限分类体系复杂且动态变化。
- 工具编排安全需要交易语义与可回放审计，但统一标准仍未形成。

主题：长时程视频智能体的“先规划后感知”与自适应观测

重要性：长视频会击穿固定上下文的 VLM 流水线；智能体必须决定看什么与采样多密，以控制成本并保留证据。
代表论文：
- EVA: Efficient Reinforcement Learning for End-to-End Video Agent（端到端视频智能体的高效强化学习）
- LensWalk: Agentic Video Understanding by Planning How You See in Videos（通过规划“如何观看”实现智能体式视频理解）
共同方法：
- 迭代式“规划–观测”循环，带参数化观测动作（时间窗、帧数、缩放；工具选择）。
- 分阶段训练或模块化工具包（SFT→KTO→GRPO；Scan/Focus/Stitch 工具 + 时间戳锚点）。
- 明确效率目标（视觉 token 预算；更少帧；避免重预处理）。
开放问题 / 失效模式：
- 奖励黑客与采样病态仍存在（EVA 缓解但未消除）。
- 规划器停滞（静态重复）与过早下结论（LensWalk 失效模式）。
- 依赖工具接口与观测器质量；对新工具/新模态的泛化不明确。

主题：训练时与解码时的鲁棒性干预（噪声、注意力、DP/拜占庭）

重要性：鲁棒性失败常来自分布漂移（噪声上下文）、注意力病态（幻觉）或对抗参与者（联邦学习）。轻量干预可能具有高杠杆。
代表论文：
共同方法：
- 在真实噪声上训练（教师 ASR 假设）+ 正则化依赖（context dropout）+ 偏好对齐（DPO）。
- 解码时进行头部特定注意力干预（AIR），无需再训练即可降低幻觉。
- 鲁棒聚合 + 裁剪 + 动量 + 误差反馈，并给出高概率保证（Byz-Clip21-SGD2M）。
开放问题 / 失效模式：
- 超参敏感（AIR 的 λ/β 权衡；CoA 学习率敏感；DPO 缩放 γ）。
- 噪声建模的教师偏置（单一 ASR 教师）与场景覆盖有限（无重叠说话人）。
- 理论到实践差距：约束/超参限制与未证明的变体（按样本裁剪）。

3) 技术综合

中间表示审计正在跨模态收敛：VE 列表（数学视觉）、主张 QA 对（RAG 事实性）、验证问题（医疗 MCQA）与图记忆（渗透测试）都作为可审计状态，可被交叉核对。
信息不对称是反偏置的反复出现工具：MARCH 让 Checker 对 Solver 输出盲视；CoA 强制在内容前显式走授权轨迹；二者都旨在避免“先看到答案”的偏置。
选择性算力是主导系统模式：M$^3$-ACE 仅在约 10% 有争议样本上迭代；金融编排显示分层“拐点”+ 缓存/路由；机器人安全门在稳定/OOD 安全时才执行。
鲁棒统计正在进入 RL-for-LLMs：ARE 用分块中位数鲁棒估计替代批均值归一化；POISE 发现 GRPO 变体的归一化/有效性掩码机制。
提示/配置敏感性如今被显式基准化：LED 在 P1/P2/P3 下衡量提示鲁棒性（CV/NR）；推理时 dropout 展示架构相关波动；这表明“一个提示/一个设置”的报告不足。
解码时干预正在获得可信度：AIR 在保持/提升 MM-Vet 的同时显著降低 CHAIR 幻觉指标；这与 M$^3$-ACE 的上下文工程等“免训练”修复相呼应。
环境落地评估正在成为智能体金标准：EnterpriseLab 在工具容器中执行轨迹；渗透测试工作流将记忆落地于观测输出；代码审查基准使用可执行测试。
智能体安全威胁越来越视觉化与供应链化：AgentRAE 表明微小通知图标可作为稳健触发器；假设仅文本触发或静态提示的防御并不完整。
无标签下的校准/不确定性仍棘手：MARC 通过一致性验证改善 ECE，但论文指出一致性会奖励错误知识——凸显需要超越自一致性的落地约束。

4) Top 5 论文（含“为何是现在”）

1) M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering（通过多智能体上下文工程修复多模态数学推理中的视觉感知）

将视觉证据抽取与推理解耦，并用多智能体 VE 交叉验证配合 Summary/Refine 工具。
报告在 MathVision 上显著提升（如 Gemini-3 Pro 85.0% → 89.1%），对较弱模型提升更大（如 GPT-5 72.0% → 82.2%）。
选择性迭代：refine 阶段使高共识子集准确率接近 90%，同时仅约 10% 样本进入循环。
质疑点：依赖多个强多模态模型的可用性；启发式阈值与算力/时延权衡未充分量化。

2) MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination（用于 LLM 幻觉的多智能体强化自检）

引入 Solver–Proposer–Checker，并让 Checker 盲视以降低确认偏差；通过双轨迹 PPO 训练。
报告大幅事实性提升：RAGTruth/FaithBench 平均 55.20% → ~75%（+~20）。
使用严格的 Zero-Tolerance Reward 强制逐主张正确（所有主张都必须匹配）。
质疑点：验证重点偏向数值/定量主张；Proposer 奖励黑客（缩减主张）是已知风险。

3) AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents（针对截图式移动 GUI 智能体的基于通知的视觉后门远程动作执行）

展示了实用触发面：原生通知图标可作为截图式智能体的隐蔽后门触发器。
两阶段投毒（对比式触发分离 + 平衡投毒损失）实现高 ASR（多设置下 >90%），可扩展到 9 个目标。
评估防御（fine-pruning、fine-tuning、NAD），发现防御后 ASR 仍高。
质疑点：评估为离线，基于两个开源智能体/数据集；未测试在线时序/交互效应。

4) LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis（用于评估文档分析中版面错误检测的基准）

定义 8 类结构版面错误，并构建合成注入基准，含 3 个分层任务（文档检测 → 类型分类 → 元素分类）。
发现 Gemini 2.5 系列最好且提示最稳定；GPT 模型在细粒度任务上下降明显。
提供提示/输入配置对比（image+JSON 最佳；仅 boxes 最弱）。
质疑点：合成 + 错误分布不均衡（Missing 占主导）与单源注入建模可能限制泛化。

5) EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises（企业级智能体开发与部署的全栈平台）

集成 MCP 工具环境、从 schema 生成可执行轨迹合成，以及闭环训练（SFT/DPO/Agentic GRPO）。
报告 Qwen3-8B Agentic GRPO 在 EnterpriseArena 执行准确率上接近 GPT-4o（0.43 vs 0.45），并声称推理成本降低约 8–10×。
展示在 schema/API 变更后通过增量轨迹进行适配。
质疑点：范围主要是工具/API 环境（非 GUI）；性能依赖基座模型能力与合成质量。

5) 实用下一步

将“中间产物日志”设为默认：保存 VE 列表/主张列表/工具调用计划并衡量分歧率；用其触发选择性重试（如 M$^3$-ACE）。
在 RAG 中加入信息不对称的验证路径：实现只看检索文档 + 原子化问题（不看草稿答案）的 Checker，并跟踪相对标准自我批评的事实性增量。
在信任榜单差异前先做污染敏感性审计：在关键 MCQ 基准上复现 router–worker 噪声改写测试，并与准确率一起报告“违规广度”。
对工具智能体，将权限视为一等 token + 轨迹：原型化 CoA 风格的“资源审查 → 身份 → 决策”输出，并强制下游回答/工具调用以该轨迹为条件。
加固 GUI 智能体以抵御视觉触发面：加入通知感知预处理（遮罩/裁剪通知区域），并在类似 AgentRAE 的图标触发后门场景下评估。
若用 MC Dropout 做不确定性，分别基准化记忆重 vs 推理重任务：在随机推理下测均值+标准差；避免对专用 checkpoint 盲目开启 dropout。
对长视频智能体，衡量“证据效率”而非仅准确率：跟踪使用帧数/视觉 token/观测轮次；加入静态重复与过早停止的停滞检测器。
尽可能偏好可执行裁判：对代码审查或智能体动作，将评估转为测试或环境落地成功指标，而非文本相似度。

由逐论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-03-29

0) 核心要点（先读这个）

2) 关键主题（聚类）

主题：多智能体证据/共识作为鲁棒性原语

主题：评估正在变得诊断化（错误类型、污染敏感性、可执行裁判）

主题：工具/GUI 智能体安全与治理正在“进入模型内部”

主题：长时程视频智能体的“先规划后感知”与自适应观测

主题：训练时与解码时的鲁棒性干预（噪声、注意力、DP/拜占庭）

3) 技术综合

4) Top 5 论文（含“为何是现在”）

5) 实用下一步