AI 论文日报(2026-03-29)
Published:
English version: /paper-news/2026-03-29/
运行统计
- 候选论文: 1744
- 入选论文: 30
- 已精读完成: 30
- 时间窗口 (UTC): 2026-03-27T00:00:00Z → 2026-03-28T00:00:00Z (weekend_backlog_unknown, expanded=0)
展开查看用于总结的论文列表
| arXiv ID | 标题 / 链接 | 分类 | 评分 | 入选理由 | 标签 |
|---|---|---|---|---|---|
2603.23951 | From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents | cs.CL | 95 | Closed-loop LLM agents discover improved LLM-RL algorithms; strong automation + eval/iteration framework. | LLM-agents, RLHF, policy-optimization, auto-research, evaluation, algorithm-discovery |
2603.23007 | AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents | cs.CR, cs.AI | 94 | Concrete backdoor for mobile GUI agents via notifications; high-impact agent security threat model. | agent-security, mobile-agents, backdoors, visual-triggers, remote-action-execution, red-teaming |
2603.22869 | Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories | cs.AI | 92 | Internalizes fine-grained authorization in LLM reasoning; targets data leakage and access-boundary failures. | authorization, access-control, LLM-safety, data-leakage, reasoning-trajectories, security |
2603.24477 | Composer 2 Technical Report | cs.SE, cs.LG | 92 | Agentic SWE model + RL in real tool harness; likely strong frontier agent capability signal | agentic-coding, software-engineering, reinforcement-learning, tool-use, long-horizon, frontier-llm |
2603.24579 | MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination | cs.CL | 90 | Multi-agent asymmetry to reduce LLM-judge confirmation bias for RAG hallucination checking | hallucination, RAG, LLM-judge, multi-agent, verification, reliability |
2603.21636 | Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and 评分 Confidence in LLM Benchmarks | cs.AI, cs.CL | 90 | Audit framework for benchmark contamination sensitivity & score confidence; key for LLM eval integrity | LLM-evaluation, benchmarking, data-contamination, leakage, audit, measurement |
2603.24221 | Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing | cs.RO, cs.AI | 90 | Environment-grounded multi-agent LLM pentesting for robots; concrete security workflow + memory graph. | agent-security, penetration-testing, cybersecurity, robotics, multi-agent, tool-use |
2603.23231 | PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments | cs.AI | 88 | Benchmark for personalized memory agents with evolving preferences; more realistic than pure retrieval tests. | agents, memory, personalization, evaluation, benchmarks, long-term-consistency |
2603.24058 | Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification | cs.CV, cs.AI | 88 | Targets LVLM object hallucination via attention-imbalance rectification; reliability for high-stakes vision. | hallucinations, vision-language, reliability, attention, calibration, safety |
2603.21630 | EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises | cs.AI | 86 | Full-stack closed-loop platform for enterprise agents: tools+data synthesis+training+eval in one. | agents, enterprise, tool-use, MCP, data-synthesis, evaluation, deployment |
2603.23129 | Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair | cs.LG | 86 | Gödel-style self-improving agent for small models via auditable policy patches; relevant to safe autonomy. | agents, self-improvement, policy-repair, small-models, auditing, reliability |
2603.22862 | The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration | cs.SE, cs.CL | 86 | Comprehensive review of multi-tool LLM agent orchestration incl. safety/cost/verifiability constraints | llm-agents, tool-use, orchestration, survey, safety, verification |
2603.08369 | M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering | cs.AI | 86 | Multi-agent context engineering to correct perception errors in multimodal math reasoning | multimodal, VLM, math-reasoning, multi-agent, perception, robustness |
2603.23448 | Code Review Agent Benchmark | cs.SE, cs.AI | 86 | New benchmark/dataset for code review agents; timely for agentic SE quality assurance. | agents, benchmark, code-review, software-engineering, evaluation, datasets |
2603.24481 | Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA | cs.AI, cs.CL, cs.LG | 86 | Multi-agent verification + weighted fusion improves uncertainty calibration for medical MCQA | uncertainty, calibration, verification, multi-agent, medical, reliability |
2603.19195 | How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation | eess.AS, cs.CL, cs.SD | 86 | Holistic eval of LLM backbones' auditory knowledge + new benchmark (AKB-2000) for audio LMs. | audio-language-models, LLM-backbones, evaluation, benchmark, probing, multimodal |
2603.21475 | Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems | cs.AI | 86 | Decouples agent node creation from orchestration; targets knowledge-intensive MAS generation bottleneck. | multi-agent, agent-architecture, orchestration, domain-experts, automation |
2603.24034 | From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs | cs.CL, cs.AI | 86 | Mitigates contextual exposure bias in Speech-LLMs using noisy history + dropout + DPO on failures. | speech-LLM, robustness, DPO, distribution-shift, evaluation, alignment |
2603.23472 | Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions | cs.LG, cs.CR, math.OC | 84 | Unified DP + Byzantine-robust federated optimization with weaker assumptions and guarantees. | federated-learning, differential-privacy, byzantine-robustness, secure-ml, optimization |
2603.22651 | Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies | cs.AI, cs.CL, cs.LG | 84 | Large-scale benchmark of multi-agent orchestration patterns with cost/latency/accuracy tradeoffs. | multi-agent, orchestration, benchmark, evaluation, LLMs, cost-latency, document-IE |
2603.15080 | Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database | cs.DB, cs.AI, q-bio.QM | 84 | Open biomedical KGs + federation + explicit AI-agent access layer; reusable infra at scale | knowledge-graphs, agents, tool-use, data-infrastructure, biomedicine, RAG |
2603.22999 | PaperVoyager : Building Interactive Web with Visual Language Models | cs.CL | 84 | Benchmark + agent that turns papers into executable interactive web systems; strong tool-use/document agent angle. | agents, tool-use, document-understanding, benchmark, evaluation, web-synthesis |
2603.23983 | SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating | cs.RO, cs.AI, eess.SY | 84 | Text-driven humanoid control with explicit safety gating and physics guidance; addresses OOD unsafe motions. | robot-safety, agents, humanoids, safety-gating, OOD-robustness, control |
2603.24558 | LensWalk: Agentic Video Understanding by Planning How You See in Videos | cs.CV, cs.AI | 83 | Agentic video understanding with reason-plan-observe control of perception; likely reusable framework. | agentic, video-understanding, planning, active-perception, VLM-tools, efficiency |
2603.17265 | LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis | cs.CV, cs.CL | 82 | LED benchmark targets structural layout errors beyond IoU; reusable eval for doc/LMM systems. | benchmark, evaluation, multimodal, document-ai, hallucination, robustness |
2603.22918 | EVA: Efficient Reinforcement Learning for End-to-End Video Agent | cs.CV, cs.AI, cs.CL | 82 | RL-based planning-before-perception for long videos; efficiency gains for multimodal agents. | video-agents, reinforcement-learning, planning, multimodal, efficiency, long-context |
2603.21574 | Adaptive Robust Estimator for Multi-Agent Reinforcement Learning | cs.AI | 82 | Robust MARL for collaborative reasoning; tackles noisy/heavy-tailed rewards and structured critique loops | multi-agent, reinforcement-learning, robust-estimation, llm-reasoning, credit-assignment |
2603.17811 | Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference | cs.LG, cs.AI | 82 | Systematic MC-dropout reliability study across 19 transformers; links variability to reasoning/memory | uncertainty, MC-dropout, reliability, transformers, stochastic-inference, evaluation |
2603.23406 | Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies | cs.AI, cs.CL, cs.HC | 82 | Measures stance formation/identity negotiation in generative multi-agent societies; new metrics. | multi-agent, social-simulation, evaluation, trust, persuasion, agent-behavior |
2603.24167 | Walma: Learning to See Memory Corruption in WebAssembly | cs.CR, cs.LG | 82 | ML-based WebAssembly memory attestation vs adversarial host; concrete security evaluation on CVEs | security, webassembly, memory-corruption, attestation, robustness, systems |
AI 论文洞察简报
2026-03-29
0) 核心要点(先读这个)
- “感知是瓶颈”如今可被量化且无需再训练即可修复:通过多智能体上下文工程,对中间证据(而非仅最终答案)进行交叉核对,可显著提升多模态数学准确率(M$^3$-ACE)。
- 基准正在从“你画对框/答对题了吗”转向“你是否识别出结构性失效模式”:LED 将文档版面评估重构为围绕错误类型(缺失/合并/拆分等),揭示强 VLM 仍难以进行细粒度结构诊断。
- 推理时随机性并非免费的不确定性收益:MC Dropout 往往降低准确率(19 个模型中有 10 个),且对“记忆”类任务的伤害大于“推理”类任务,因此不确定性方法必须与架构/任务相匹配。
- 智能体安全越来越关乎系统表面(工具、GUI、权限),而不只是文本:通知图标视觉后门可在高 ASR 下劫持移动 GUI 智能体(AgentRAE);而内化的授权轨迹可强制权限边界(Chain-of-Authorization)。
- 闭环、环境落地的训练/评估正在成为实用分水岭:EnterpriseLab(工具环境 + 可执行合成 + 轨迹 RL)与金融编排基准显示:架构与成本控制主导生产可行性。
- 按主张(claim)级、抗偏置的验证正在成为可扩展的反幻觉训练信号:MARCH 使用信息不对称的 Checker(对 Solver 输出“盲视”)+ 严格逐主张奖励,使 8B 模型的 RAG 事实性在报告均值上提升约 20 点。
2) 关键主题(聚类)
主题:多智能体证据/共识作为鲁棒性原语
- 重要性:许多失败源于模型过早锁定错误的中间状态(视觉证据、批评、奖励)。跨智能体分歧信号与结构化调和可在关键处选择性投入算力。
- 代表论文:
- M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering(通过多智能体上下文工程修复多模态数学推理中的视觉感知)
- Adaptive Robust Estimator for Multi-Agent Reinforcement Learning(多智能体强化学习的自适应鲁棒估计器)
- Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA(带一致性验证的多智能体推理提升医疗 MCQA 的不确定性校准)
- 共同方法:
- 将中间产物(VE 列表;批评差分;验证问题)与最终答案分离,并将其作为一等对象。
- 使用分歧/共识(冲突比、多数投票、不一致分数)来触发额外迭代或加权融合。
- 为智能体交互增加结构/工具(Summary/Refine 工具;分阶段“回答–批评–改写”;验证协议)。
- 开放问题 / 失效模式:
- 启发式阈值与门控策略(如冲突比 > 0.2)在跨领域/跨模型时可能脆弱。
- “一致性 ≠ 正确性”:验证可能奖励连贯但错误的推理(医疗 MCQA 中有指出)。
- 计算/时延开销与在实时约束下的扩展行为常被低报。
主题:评估正在变得诊断化(错误类型、污染敏感性、可执行裁判)
- 重要性:汇总分数会掩盖系统失败位置(结构版面错误、污染敏感性、不可验证的代码审查)。新基准旨在用更可操作的信号暴露失效模式。
- 代表论文:
- 共同方法:
- 用类型化错误分类体系与分层任务(LED T1–T3)替代单一指标。
- 通过受控扰动/聚合来压力测试分数置信度(router–worker 噪声改写审计)。
- 使用可执行评估裁判(将审查评论 → 测试,要求“修复前失败/修复后通过”)。
- 开放问题 / 失效模式:
- 合成构造与不均衡(LED 中 Missing 占主导)可能扭曲结论。
- 污染审计目前多在小样本(n=100)与仅 MCQ 场景展示。
- 基于测试的裁判依赖环境重建与编码智能体,能力被混合在一起。
主题:工具/GUI 智能体安全与治理正在“进入模型内部”
- 重要性:当智能体在真实系统中行动时,攻击面包括 UI 像素、工具 schema 与权限边界。防御需要因果式约束执行,而不只是提示词。
- 代表论文:
- AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents(针对截图式移动 GUI 智能体的基于通知的视觉后门远程动作执行)
- Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories(通过推理轨迹将授权内化进大语言模型)
- The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration(LLM 智能体工具使用的演进:从单工具调用到多工具编排)
- 共同方法:
- 形式化系统级威胁模型(供应链投毒;权限不匹配;交易语义)。
- 强制信息屏障或因果步骤(回答前先走授权轨迹;Checker 对 Solver 盲视)。
- 在自适应攻防下评估(PAIR 越狱;后门的剪枝/微调防御)。
- 开放问题 / 失效模式:
- 后门评估多为离线;在线交互式智能体动态仍缺乏测试。
- CoA 需要微调 + 权限 token 工程;真实权限分类体系复杂且动态变化。
- 工具编排安全需要交易语义与可回放审计,但统一标准仍未形成。
主题:长时程视频智能体的“先规划后感知”与自适应观测
- 重要性:长视频会击穿固定上下文的 VLM 流水线;智能体必须决定看什么与采样多密,以控制成本并保留证据。
- 代表论文:
- 共同方法:
- 迭代式“规划–观测”循环,带参数化观测动作(时间窗、帧数、缩放;工具选择)。
- 分阶段训练或模块化工具包(SFT→KTO→GRPO;Scan/Focus/Stitch 工具 + 时间戳锚点)。
- 明确效率目标(视觉 token 预算;更少帧;避免重预处理)。
- 开放问题 / 失效模式:
- 奖励黑客与采样病态仍存在(EVA 缓解但未消除)。
- 规划器停滞(静态重复)与过早下结论(LensWalk 失效模式)。
- 依赖工具接口与观测器质量;对新工具/新模态的泛化不明确。
主题:训练时与解码时的鲁棒性干预(噪声、注意力、DP/拜占庭)
- 重要性:鲁棒性失败常来自分布漂移(噪声上下文)、注意力病态(幻觉)或对抗参与者(联邦学习)。轻量干预可能具有高杠杆。
- 代表论文:
- From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs(从“神谕”到噪声上下文:缓解语音 LLM 的上下文暴露偏差)
- Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification(通过注意力失衡校正缓解 LVLM 的物体幻觉)
- Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions(在更弱假设下的拜占庭鲁棒与差分隐私联邦优化)
- 共同方法:
- 在真实噪声上训练(教师 ASR 假设)+ 正则化依赖(context dropout)+ 偏好对齐(DPO)。
- 解码时进行头部特定注意力干预(AIR),无需再训练即可降低幻觉。
- 鲁棒聚合 + 裁剪 + 动量 + 误差反馈,并给出高概率保证(Byz-Clip21-SGD2M)。
- 开放问题 / 失效模式:
- 超参敏感(AIR 的 λ/β 权衡;CoA 学习率敏感;DPO 缩放 γ)。
- 噪声建模的教师偏置(单一 ASR 教师)与场景覆盖有限(无重叠说话人)。
- 理论到实践差距:约束/超参限制与未证明的变体(按样本裁剪)。
3) 技术综合
- 中间表示审计正在跨模态收敛:VE 列表(数学视觉)、主张 QA 对(RAG 事实性)、验证问题(医疗 MCQA)与图记忆(渗透测试)都作为可审计状态,可被交叉核对。
- 信息不对称是反偏置的反复出现工具:MARCH 让 Checker 对 Solver 输出盲视;CoA 强制在内容前显式走授权轨迹;二者都旨在避免“先看到答案”的偏置。
- 选择性算力是主导系统模式:M$^3$-ACE 仅在约 10% 有争议样本上迭代;金融编排显示分层“拐点”+ 缓存/路由;机器人安全门在稳定/OOD 安全时才执行。
- 鲁棒统计正在进入 RL-for-LLMs:ARE 用分块中位数鲁棒估计替代批均值归一化;POISE 发现 GRPO 变体的归一化/有效性掩码机制。
- 提示/配置敏感性如今被显式基准化:LED 在 P1/P2/P3 下衡量提示鲁棒性(CV/NR);推理时 dropout 展示架构相关波动;这表明“一个提示/一个设置”的报告不足。
- 解码时干预正在获得可信度:AIR 在保持/提升 MM-Vet 的同时显著降低 CHAIR 幻觉指标;这与 M$^3$-ACE 的上下文工程等“免训练”修复相呼应。
- 环境落地评估正在成为智能体金标准:EnterpriseLab 在工具容器中执行轨迹;渗透测试工作流将记忆落地于观测输出;代码审查基准使用可执行测试。
- 智能体安全威胁越来越视觉化与供应链化:AgentRAE 表明微小通知图标可作为稳健触发器;假设仅文本触发或静态提示的防御并不完整。
- 无标签下的校准/不确定性仍棘手:MARC 通过一致性验证改善 ECE,但论文指出一致性会奖励错误知识——凸显需要超越自一致性的落地约束。
4) Top 5 论文(含“为何是现在”)
- 将视觉证据抽取与推理解耦,并用多智能体 VE 交叉验证配合 Summary/Refine 工具。
- 报告在 MathVision 上显著提升(如 Gemini-3 Pro 85.0% → 89.1%),对较弱模型提升更大(如 GPT-5 72.0% → 82.2%)。
- 选择性迭代:refine 阶段使高共识子集准确率接近 90%,同时仅约 10% 样本进入循环。
- 质疑点:依赖多个强多模态模型的可用性;启发式阈值与算力/时延权衡未充分量化。
2) MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination(用于 LLM 幻觉的多智能体强化自检)
- 引入 Solver–Proposer–Checker,并让 Checker 盲视以降低确认偏差;通过双轨迹 PPO 训练。
- 报告大幅事实性提升:RAGTruth/FaithBench 平均 55.20% → ~75%(+~20)。
- 使用严格的 Zero-Tolerance Reward 强制逐主张正确(所有主张都必须匹配)。
- 质疑点:验证重点偏向数值/定量主张;Proposer 奖励黑客(缩减主张)是已知风险。
- 展示了实用触发面:原生通知图标可作为截图式智能体的隐蔽后门触发器。
- 两阶段投毒(对比式触发分离 + 平衡投毒损失)实现高 ASR(多设置下 >90%),可扩展到 9 个目标。
- 评估防御(fine-pruning、fine-tuning、NAD),发现防御后 ASR 仍高。
- 质疑点:评估为离线,基于两个开源智能体/数据集;未测试在线时序/交互效应。
4) LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis(用于评估文档分析中版面错误检测的基准)
- 定义 8 类结构版面错误,并构建合成注入基准,含 3 个分层任务(文档检测 → 类型分类 → 元素分类)。
- 发现 Gemini 2.5 系列最好且提示最稳定;GPT 模型在细粒度任务上下降明显。
- 提供提示/输入配置对比(image+JSON 最佳;仅 boxes 最弱)。
- 质疑点:合成 + 错误分布不均衡(Missing 占主导)与单源注入建模可能限制泛化。
- 集成 MCP 工具环境、从 schema 生成可执行轨迹合成,以及闭环训练(SFT/DPO/Agentic GRPO)。
- 报告 Qwen3-8B Agentic GRPO 在 EnterpriseArena 执行准确率上接近 GPT-4o(0.43 vs 0.45),并声称推理成本降低约 8–10×。
- 展示在 schema/API 变更后通过增量轨迹进行适配。
- 质疑点:范围主要是工具/API 环境(非 GUI);性能依赖基座模型能力与合成质量。
5) 实用下一步
- 将“中间产物日志”设为默认:保存 VE 列表/主张列表/工具调用计划并衡量分歧率;用其触发选择性重试(如 M$^3$-ACE)。
- 在 RAG 中加入信息不对称的验证路径:实现只看检索文档 + 原子化问题(不看草稿答案)的 Checker,并跟踪相对标准自我批评的事实性增量。
- 在信任榜单差异前先做污染敏感性审计:在关键 MCQ 基准上复现 router–worker 噪声改写测试,并与准确率一起报告“违规广度”。
- 对工具智能体,将权限视为一等 token + 轨迹:原型化 CoA 风格的“资源审查 → 身份 → 决策”输出,并强制下游回答/工具调用以该轨迹为条件。
- 加固 GUI 智能体以抵御视觉触发面:加入通知感知预处理(遮罩/裁剪通知区域),并在类似 AgentRAE 的图标触发后门场景下评估。
- 若用 MC Dropout 做不确定性,分别基准化记忆重 vs 推理重任务:在随机推理下测均值+标准差;避免对专用 checkpoint 盲目开启 dropout。
- 对长视频智能体,衡量“证据效率”而非仅准确率:跟踪使用帧数/视觉 token/观测轮次;加入静态重复与过早停止的停滞检测器。
- 尽可能偏好可执行裁判:对代码审查或智能体动作,将评估转为测试或环境落地成功指标,而非文本相似度。
由逐论文分析生成;无外部浏览。
