AI 论文日报(2026-03-09)

Published:

English version: /paper-news/2026-03-09/

运行统计

  • 候选论文: 1352
  • 入选论文: 30
  • 已精读完成: 30
  • 时间窗口 (UTC): 2026-03-06T01:00:00Z → 2026-03-07T01:00:00Z (weekend_backlog_sat, expanded=0)
展开查看用于总结的论文列表
arXiv ID标题 / 链接分类评分入选理由标签
2602.22983Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
PDF
cs.AI, cs.CR93Automated classical-Chinese jailbreak optimization; highlights multilingual safety gaps.jailbreaks, prompt-optimization, multilingual, red-teaming, llm-security
2603.00529CaptionFool: Universal Image Captioning Model Attacks
PDF
cs.CV, cs.AI93Universal adversarial attack on captioners (94–96%); can induce offensive captions & evade filtersmultimodal-security, adversarial-attacks, image-captioning, robustness, content-moderation, red-teaming
2603.05344Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned
PDF
cs.AI92Open-source terminal coding agent with explicit safety controls + context management lessons.coding-agents, tool-use, agent-architecture, safety-controls, context-engineering, open-source
2603.01712FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
PDF
cs.AI, cs.LG92Benchmark for end-to-end autonomous LLM fine-tuning with agents; realistic tooling+iteration loopagents, auto-ML, fine-tuning, benchmark, evaluation, tool-use
2603.00436ROKA: Robust Knowledge Unlearning against Adversaries
PDF
cs.LG, cs.AI90Unlearning can induce new attacks; proposes robust unlearning framework to mitigate.machine-unlearning, privacy, security, backdoors, robustness
2603.03761AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation
PDF
cs.AI, cs.IR90Benchmark for query-conditioned agent configuration recommendation; fills key gap for agent ecosystems.agents, benchmark, tool-selection, evaluation, recommendation
2603.01724GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules
PDF
cs.AI90Content moderation benchmark with co-occurring harms + dynamic policies; closer to real deploymentsafety, content-moderation, evaluation, policy-following, robustness, benchmarks
2603.04948$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space
PDF
cs.LG90Test-time latent gradient descent for LLM reasoning; potentially strong inference-time scaling methodllm, reasoning, test-time-compute, decoding, optimization, reward-model
2603.01499Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)
PDF
cs.CR, cs.AI90Practical privacy-preserving LLM inference proposal targeting accuracy, clusters, and infra compatibilityprivacy, secure-inference, llm-serving, systems, confidential-computing
2603.00724RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
PDF
cs.CL90Agentic, dynamic reward acquisition for RL alignment; tackles reward generalization & verifier synthesis.alignment, RLHF, reward-models, agents, verifiers, tool-use
2603.05026RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform
PDF
cs.SE, cs.LG, cs.MA90Agent that auto-builds/tests any repo; enables scalable SWE benchmarks & training data pipelinescoding-agents, SWE-benchmarking, automation, build-and-test, evaluation-pipeline, datasets
2603.01421SciDER: Scientific Data-centric End-to-end Researcher
PDF
cs.AI, cs.CL90End-to-end LLM scientist that parses raw data, writes/executes code, with benchmarks and feedback loopagents, scientific-discovery, tool-use, code-execution, memory, evaluation
2603.01104Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI
PDF
cs.HC, cs.AI, cs.CV, cs.CY90Smart-glasses LLM agent w/ long-horizon video reasoning + web tools; real-world agent safety stakesagents, tool-use, multimodal, long-horizon, context-compression, assistive-tech, deployment
2603.00634BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
PDF
cs.CL89Large benchmark for false/synthetic content across many low-resource languages.benchmarks, misinformation, synthetic-text-detection, multilingual, evaluation
2603.01012FastCode: Fast and Cost-Efficient Code Understanding and Reasoning
PDF
cs.SE, cs.AI88Repo-scale code reasoning with cost-aware structure scouting; strong relevance to agent efficiency.code-reasoning, agents, context-efficiency, repository-mapping, software-engineering
2603.03906Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets
PDF
cs.CR88Measures privacy leakage in synthetic social media text via authorship attribution re-ID attacksprivacy, synthetic-data, LLMs, membership-inference, authorship-attribution, security
2603.01203How Well Does Agent Development Reflect Real-World Work?
PDF
cs.AI88Maps 43 agent benchmarks to 1,016 occupations; finds big mismatch vs real labor/economic valueagents, evaluation, benchmarks, labor-market, task-distribution, deployment-relevance
2603.01501GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control
PDF
cs.LG, cs.AI88Stabilizes async RL for LLMs; identifies stale-aligned gradients and proposes control methodLLM-RL, asynchronous-RL, training-stability, policy-gradient, scaling
2603.04814Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
PDF
cs.CL88Direct cost/accuracy tradeoff study: long-context vs fact-memory for persistent agents on 3 benchmarksagents, memory, long-context, cost-modeling, evaluation, RAG
2603.01050MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
PDF
cs.CV, cs.AI88Multimodal deep-research agent baseline + synthetic search-intensive data/trajectories; reusable for agents eval.agents, multimodal, tool-use, search, planning, datasets, benchmarks
2603.01241TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents
PDF
cs.IR, cs.AI88Test-time retrieval of skills + verified reasoning trajectories to improve clinical reasoning agentsreasoning-agents, test-time-adaptation, retrieval, skills, experience, healthcare
2603.04277VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments
PDF
cs.RO, cs.AI87Shows VLM spatial scale hallucinations; adds deterministic tool for metric grounding.embodied-agents, hallucinations, tool-use, robot-safety, evaluation
2603.01167DEP: A Decentralized Large Language Model Evaluation Protocol
PDF
cs.CL86Decentralized LLM evaluation protocol aiming at reproducibility and reducing benchmark leakage risk.evaluation, benchmarking, reproducibility, data-leakage, protocols
2603.01455From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
PDF
cs.CV, cs.AI, cs.CL, cs.IR, cs.MM86Pyramidal multimodal memory for long-horizon video agents; distills verbatim→gist to fit contextagents, long-context, memory, multimodal, video-understanding, efficiency
2603.00883Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
PDF
cs.LG, cs.AI, cs.CY, stat.AP86Shows benchmark success can be negatively aligned with learning outcomes; ensembles worsen misalignmentalignment, evaluation, OOD-generalization, education, impact-misalignment, reliability
2603.04815EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue
PDF
cs.AI86Agentic KG memory to detect manipulation over long dialogues; relevant to safety monitoring & oversightagents, safety, long-context, memory, knowledge-graphs, monitoring, dialogue
2603.01563LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
PDF
cs.LG, cs.AI86RLVR-style alignment for diffusion LLMs via likelihood-free policy optimization; potentially reusable methodalignment, RL, diffusion-LLM, RLVR, optimization, post-training
2603.00590Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs
PDF
cs.AI86Fairness benchmark for multimodal LLMs covering understanding+generation with metric normalization framework.fairness, evaluation, multimodal, benchmarks, bias, metrics
2603.05068Cyber Threat Intelligence for Artificial Intelligence Systems
PDF
cs.CR, cs.AI86Systematizes AI-focused cyber threat intelligence: assets, IoCs, supply-chain phases, workflowsAI-security, threat-intelligence, supply-chain, IoC, risk-management, governance
2603.00856PARCER as an Operational Contract to Reduce Variance, Cost, and Risk in LLM Systems
PDF
cs.SE, cs.AI86YAML operational contract to reduce variance/cost/risk + improve long-context reliability in LLM systemsLLM-systems, governance, reliability, prompting, long-context, auditability, cost-control

AI 论文洞察简报

2026-03-09

0) 执行要点(先读这个)

  • “非标准语言/文体”已成为一等越狱攻击面:古典/古语风格提示可在极低查询次数下实现近乎普遍的越狱成功,甚至可跨模型迁移——表明许多防御可能过拟合于现代语言模式。
  • 鲁棒性正从“更好的模型”转向“更好的系统”:多篇论文显示,通过系统级干预可获得显著收益——动态奖励工具(RLAR)、梯度几何稳定(GAC)、结构化代码库侦察(FastCode)、离线搜索引擎(MM-DeepResearch)——往往在提升质量的同时降低成本。
  • 评测正在成为基础设施,而不只是数据集:DEP 提出抗泄漏的基准服务器;IRIS 与 BLUFF 将评测扩展到多模态公平性与长尾多语种虚假信息;AgentSelect 将评测产物重构为可部署智能体的推荐基准。
  • 隐私/安全威胁越来越“二阶化”:间接反学习攻击会削弱其他安全关键类别;合成文本仍可能泄露作者身份;隐私保护推理正走向可部署的混淆方案,并与现有服务栈兼容。
  • 在度量型任务中,工具增强的“反幻觉”正在胜出:VANGUARD 显示 VLM 在空间尺度上幻觉严重,而确定性的几何工具可显著降低误差——强化了一个模式:对安全关键的量化指标,应加入可验证工具,而不是更用力地提示。

2) 关键主题(聚类)

主题:语言学与文体越狱攻击面

主题:对抗场景下的鲁棒反学习(unlearning)

主题:智能体奖励与 RL 稳定性成为扩展瓶颈

主题:评测与治理基础设施(公平性、泄漏、代表性)

主题:长时程智能体:记忆、上下文与成本

3) 技术综合

  • 多项工作在结构化中间表示上趋同,将其作为鲁棒性的杠杆:CC-BOS 使用 8 维提示策略向量;TARSE 使用按步骤索引的 LogicalChains + 技能;FastCode 使用多层代码图;MM-Mem 使用感知/情景/图式层;EchoGuard 使用情景/语义知识图谱(KG)。
  • 优化正在进入“闭环”:CC-BOS 以黑盒方式优化提示;∇-Reasoner 在测试时优化 logits;LFPO 优化扩散 logits/速度场;GAC 在训练中修改梯度以防崩溃。
  • 验证门控正在成为标准:RLAR 的 EvalTool 验证、RepoLaunch 的 Verify Agent、PARCER 的验证门控、VANGUARD 的置信度分数都在编码“默认不信任模型”。
  • 成本/时延被视为一等指标(而非事后补充):RLAR 报告相对基于裁判的 RLAIF 显著减少 token/GPU-hour;MM-DeepResearch 量化在线 vs 离线成本/时间;记忆 vs 长上下文工作给出明确盈亏平衡轮次;FastCode 目标是单次摄取的上下文组装。
  • 跨语言与长尾泛化反复被证明很弱:BLUFF 量化长尾语言 F1 大幅下降;CC-BOS 展示古语绕过;两者都意味着安全与检测工具必须在高资源语言之外评估。
  • “代理对齐(proxy alignment)”失败在实证上可见:课堂转录研究显示 FM 一致性甚至专家量表一致性都可能偏离预期影响(学生学习增益),警示不要过度依赖代理指标。
  • 异步性引入一种独特的 RL 失效模式(陈旧但对齐的梯度),不只是“离策略(off-policy)”:GAC 针对梯度几何而非仅做分布校正。
  • 在度量型数量上,工具增强胜过端到端 VLM 推理:VANGUARD 的确定性 GSD 估计优于 VLM 面积估计,强化了具身安全的设计模式。

4) Top 5 论文(含“为何是现在”)

1) Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search(晦涩但有效:通过仿生搜索优化古文越狱提示)

  • 表明古典/古语语言是重大安全盲点;在其设定下报告多个前沿模型 100% ASR
  • 提供结构化 8 维提示策略空间 + 黑盒 FOA 优化器,报告查询次数极低。
  • 展示跨模型可迁移性,并可适用于其他古典语言(拉丁语、梵语)。
  • 质疑点:结果依赖选定的基准子集与闭源受害模型;组合防御/翻译过滤可降低 ASR。

2) ROKA: Robust Knowledge Unlearning against Adversaries(ROKA:对抗者下的鲁棒知识反学习)

  • 提出间接反学习攻击:反学习请求可被用来削弱其他安全关键类别。
  • 提出神经修复(Neural Healing)/贡献重分配,用定向/非定向随机算法保留被保留知识。
  • 视觉、多模态与 LLM(包括在 MMLU 上的 Llama 3.2)上评估,相比 GA 反学习稳定性/平衡性更好。
  • 质疑点:精确重分配不可行;效果依赖同类/保留数据的代表性。

3) RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models(RLAR:用于 LLM 多任务强化学习的智能体奖励系统)

  • 使奖励建模变为自适应且基于工具(封装奖励检查点;生成代码验证器),而非静态。
  • 报告多领域 RL 强提升(例如表 2 中 GSM8K 提升)以及相对 GPT-5 裁判 RLAIF 的大幅成本降低
  • REWARDBENCH-V2 上的奖励路由准确率较高(平均精度 90.44%)。
  • 质疑点:依赖网页检索与仓库文档;作者指出易受“readme hacking”影响。

4) GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control(GAC:通过梯度对齐控制稳定 LLM 异步 RL 训练)

  • 指出异步 RL 中具体不稳定机制:崩溃前出现持续对齐的连续梯度
  • 提供低开销投影/跳过控制,在存在陈旧性时基本缩小与同步 GRPO 的差距(表 1)。
  • 有理论支撑,将投影与收敛界中的偏差降低联系起来。
  • 质疑点:实验仅在单机 8-GPU 设置上报告;未展示大规模分布式行为。

5) BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages(BLUFF:覆盖 58 种低资源语言的虚假与合成内容检测基准)

  • 提供大型多语种基准(201K 样本,78 种语言),包含受控操纵与作者类型。
  • 量化跨语言迁移退化:长尾语言最高下降 25.3 个 F1 点;解码器零样本在多分类上常接近随机。
  • 提供智能体生成流水线(AXL-CoI)与重型多语种质量过滤器(mPURIFY),并报告保留统计。
  • 质疑点:地理/句法覆盖仍有缺口;解码器模型仅以零样本评估。

5) 实用下一步

  • 古典/文体迁移红队测试加入安全评测套件(如古文式压缩/歧义),并衡量跨模型与跨防御的迁移。
  • 对反学习流水线,显式测试间接反学习攻击:请求遗忘良性/无关类别,并测量安全关键类别的退化;跟踪预测分布不均衡。
  • 若进行大规模 RL 后训练,在异步设置中监控梯度余弦相似度随时间变化;在追逐奖励模型修复前先试验 GAC 式投影/跳过控制。
  • 奖励工具集替代单体奖励模型:对确定性任务集成代码验证器,并在引入新奖励工具前加入验证门控(RLAR 模式)。
  • 对持久智能体,依据供应商缓存规则计算你的成本盈亏平衡点(轮次 × 上下文长度);决定何时从长上下文切换到记忆,并测量准确率损失。
  • 对具身/度量任务,优先采用确定性感知技能 + 置信度门控(VANGUARD 模式),而非仅靠 VLM 的数值估计;将不确定案例路由到回退行为。
  • 对多语种虚假信息/合成检测,在 大语种→长尾语种迁移设置下评估检测器(BLUFF 风格),而非只做多语种同分布划分。
  • 若发布基准,考虑采用服务器端评测(DEP 风格)以减少泄漏/污染,并降低用户集成成本。

由逐篇论文分析生成;无外部浏览。