AI 论文日报（2026-03-09）

Published: March 09, 2026

English version: /paper-news/2026-03-09/

运行统计

候选论文: 1352
入选论文: 30
已精读完成: 30
时间窗口 (UTC): 2026-03-06T01:00:00Z → 2026-03-07T01:00:00Z (weekend_backlog_sat, expanded=0)

展开查看用于总结的论文列表

arXiv ID	标题 / 链接	分类	评分	入选理由	标签
`2602.22983`	Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search PDF	cs.AI, cs.CR	93	Automated classical-Chinese jailbreak optimization; highlights multilingual safety gaps.	jailbreaks, prompt-optimization, multilingual, red-teaming, llm-security
`2603.00529`	CaptionFool: Universal Image Captioning Model Attacks PDF	cs.CV, cs.AI	93	Universal adversarial attack on captioners (94–96%); can induce offensive captions & evade filters	multimodal-security, adversarial-attacks, image-captioning, robustness, content-moderation, red-teaming
`2603.05344`	Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned PDF	cs.AI	92	Open-source terminal coding agent with explicit safety controls + context management lessons.	coding-agents, tool-use, agent-architecture, safety-controls, context-engineering, open-source
`2603.01712`	FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents PDF	cs.AI, cs.LG	92	Benchmark for end-to-end autonomous LLM fine-tuning with agents; realistic tooling+iteration loop	agents, auto-ML, fine-tuning, benchmark, evaluation, tool-use
`2603.00436`	ROKA: Robust Knowledge Unlearning against Adversaries PDF	cs.LG, cs.AI	90	Unlearning can induce new attacks; proposes robust unlearning framework to mitigate.	machine-unlearning, privacy, security, backdoors, robustness
`2603.03761`	AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation PDF	cs.AI, cs.IR	90	Benchmark for query-conditioned agent configuration recommendation; fills key gap for agent ecosystems.	agents, benchmark, tool-selection, evaluation, recommendation
`2603.01724`	GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules PDF	cs.AI	90	Content moderation benchmark with co-occurring harms + dynamic policies; closer to real deployment	safety, content-moderation, evaluation, policy-following, robustness, benchmarks
`2603.04948`	$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space PDF	cs.LG	90	Test-time latent gradient descent for LLM reasoning; potentially strong inference-time scaling method	llm, reasoning, test-time-compute, decoding, optimization, reward-model
`2603.01499`	Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report) PDF	cs.CR, cs.AI	90	Practical privacy-preserving LLM inference proposal targeting accuracy, clusters, and infra compatibility	privacy, secure-inference, llm-serving, systems, confidential-computing
`2603.00724`	RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models PDF	cs.CL	90	Agentic, dynamic reward acquisition for RL alignment; tackles reward generalization & verifier synthesis.	alignment, RLHF, reward-models, agents, verifiers, tool-use
`2603.05026`	RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform PDF	cs.SE, cs.LG, cs.MA	90	Agent that auto-builds/tests any repo; enables scalable SWE benchmarks & training data pipelines	coding-agents, SWE-benchmarking, automation, build-and-test, evaluation-pipeline, datasets
`2603.01421`	SciDER: Scientific Data-centric End-to-end Researcher PDF	cs.AI, cs.CL	90	End-to-end LLM scientist that parses raw data, writes/executes code, with benchmarks and feedback loop	agents, scientific-discovery, tool-use, code-execution, memory, evaluation
`2603.01104`	Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI PDF	cs.HC, cs.AI, cs.CV, cs.CY	90	Smart-glasses LLM agent w/ long-horizon video reasoning + web tools; real-world agent safety stakes	agents, tool-use, multimodal, long-horizon, context-compression, assistive-tech, deployment
`2603.00634`	BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages PDF	cs.CL	89	Large benchmark for false/synthetic content across many low-resource languages.	benchmarks, misinformation, synthetic-text-detection, multilingual, evaluation
`2603.01012`	FastCode: Fast and Cost-Efficient Code Understanding and Reasoning PDF	cs.SE, cs.AI	88	Repo-scale code reasoning with cost-aware structure scouting; strong relevance to agent efficiency.	code-reasoning, agents, context-efficiency, repository-mapping, software-engineering
`2603.03906`	Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets PDF	cs.CR	88	Measures privacy leakage in synthetic social media text via authorship attribution re-ID attacks	privacy, synthetic-data, LLMs, membership-inference, authorship-attribution, security
`2603.01203`	How Well Does Agent Development Reflect Real-World Work? PDF	cs.AI	88	Maps 43 agent benchmarks to 1,016 occupations; finds big mismatch vs real labor/economic value	agents, evaluation, benchmarks, labor-market, task-distribution, deployment-relevance
`2603.01501`	GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control PDF	cs.LG, cs.AI	88	Stabilizes async RL for LLMs; identifies stale-aligned gradients and proposes control method	LLM-RL, asynchronous-RL, training-stability, policy-gradient, scaling
`2603.04814`	Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents PDF	cs.CL	88	Direct cost/accuracy tradeoff study: long-context vs fact-memory for persistent agents on 3 benchmarks	agents, memory, long-context, cost-modeling, evaluation, RAG
`2603.01050`	MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline PDF	cs.CV, cs.AI	88	Multimodal deep-research agent baseline + synthetic search-intensive data/trajectories; reusable for agents eval.	agents, multimodal, tool-use, search, planning, datasets, benchmarks
`2603.01241`	TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents PDF	cs.IR, cs.AI	88	Test-time retrieval of skills + verified reasoning trajectories to improve clinical reasoning agents	reasoning-agents, test-time-adaptation, retrieval, skills, experience, healthcare
`2603.04277`	VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments PDF	cs.RO, cs.AI	87	Shows VLM spatial scale hallucinations; adds deterministic tool for metric grounding.	embodied-agents, hallucinations, tool-use, robot-safety, evaluation
`2603.01167`	DEP: A Decentralized Large Language Model Evaluation Protocol PDF	cs.CL	86	Decentralized LLM evaluation protocol aiming at reproducibility and reducing benchmark leakage risk.	evaluation, benchmarking, reproducibility, data-leakage, protocols
`2603.01455`	From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents PDF	cs.CV, cs.AI, cs.CL, cs.IR, cs.MM	86	Pyramidal multimodal memory for long-horizon video agents; distills verbatim→gist to fit context	agents, long-context, memory, multimodal, video-understanding, efficiency
`2603.00883`	Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact PDF	cs.LG, cs.AI, cs.CY, stat.AP	86	Shows benchmark success can be negatively aligned with learning outcomes; ensembles worsen misalignment	alignment, evaluation, OOD-generalization, education, impact-misalignment, reliability
`2603.04815`	EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue PDF	cs.AI	86	Agentic KG memory to detect manipulation over long dialogues; relevant to safety monitoring & oversight	agents, safety, long-context, memory, knowledge-graphs, monitoring, dialogue
`2603.01563`	LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models PDF	cs.LG, cs.AI	86	RLVR-style alignment for diffusion LLMs via likelihood-free policy optimization; potentially reusable method	alignment, RL, diffusion-LLM, RLVR, optimization, post-training
`2603.00590`	Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs PDF	cs.AI	86	Fairness benchmark for multimodal LLMs covering understanding+generation with metric normalization framework.	fairness, evaluation, multimodal, benchmarks, bias, metrics
`2603.05068`	Cyber Threat Intelligence for Artificial Intelligence Systems PDF	cs.CR, cs.AI	86	Systematizes AI-focused cyber threat intelligence: assets, IoCs, supply-chain phases, workflows	AI-security, threat-intelligence, supply-chain, IoC, risk-management, governance
`2603.00856`	PARCER as an Operational Contract to Reduce Variance, Cost, and Risk in LLM Systems PDF	cs.SE, cs.AI	86	YAML operational contract to reduce variance/cost/risk + improve long-context reliability in LLM systems	LLM-systems, governance, reliability, prompting, long-context, auditability, cost-control

AI 论文洞察简报

2026-03-09

0) 执行要点（先读这个）

“非标准语言/文体”已成为一等越狱攻击面：古典/古语风格提示可在极低查询次数下实现近乎普遍的越狱成功，甚至可跨模型迁移——表明许多防御可能过拟合于现代语言模式。
鲁棒性正从“更好的模型”转向“更好的系统”：多篇论文显示，通过系统级干预可获得显著收益——动态奖励工具（RLAR）、梯度几何稳定（GAC）、结构化代码库侦察（FastCode）、离线搜索引擎（MM-DeepResearch）——往往在提升质量的同时降低成本。
评测正在成为基础设施，而不只是数据集：DEP 提出抗泄漏的基准服务器；IRIS 与 BLUFF 将评测扩展到多模态公平性与长尾多语种虚假信息；AgentSelect 将评测产物重构为可部署智能体的推荐基准。
隐私/安全威胁越来越“二阶化”：间接反学习攻击会削弱其他安全关键类别；合成文本仍可能泄露作者身份；隐私保护推理正走向可部署的混淆方案，并与现有服务栈兼容。
在度量型任务中，工具增强的“反幻觉”正在胜出：VANGUARD 显示 VLM 在空间尺度上幻觉严重，而确定性的几何工具可显著降低误差——强化了一个模式：对安全关键的量化指标，应加入可验证工具，而不是更用力地提示。

2) 关键主题（聚类）

主题：语言学与文体越狱攻击面

重要性：在主流英语/现代中文下有效的安全层，可能在文体压缩/歧义（古文；甚至其他古典语言）下失效，从而实现高迁移性的高效黑盒越狱。
代表论文：
- Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search（晦涩但有效：通过仿生搜索优化古文越狱提示）
- BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages（BLUFF：覆盖 58 种低资源语言的虚假与合成内容检测基准）
常见方法：
- 将攻击视为对离散提示策略的搜索/优化（结构化策略空间 + 黑盒优化）。
- 使用翻译/规范化流水线，对跨语言输出进行一致评分。
- 衡量跨多个前沿模型的可迁移性以估计真实世界风险。
开放问题 / 失效模式：
- 如何构建能跨古典文体泛化的防御，同时不过度拦截良性的历史文本。
- 基于翻译的过滤是否能在不引入新绕过或误报的情况下显著降风险。

主题：对抗场景下的鲁棒反学习（unlearning）

重要性：“忘掉这个类别”的请求可被武器化以削弱其他类别（间接反学习攻击），使反学习从隐私特性变成安全漏洞。
代表论文：
- ROKA: Robust Knowledge Unlearning against Adversaries（ROKA：对抗者下的鲁棒知识反学习）
- Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets（衡量合成社交媒体数据集中的隐私与保真度）
常见方法：
- 形式化附带损害（知识污染/破坏），并在遗忘之外设计保留/修复目标。
- 以分布漂移/不均衡视角评估（平衡的预测分布；对合成文本的归因攻击）。
开放问题 / 失效模式：
- 对保留/同类（sibling）数据质量依赖强：有偏或不完整的保留集会导致修复不足/过度修复。
- 合成文本带来的隐私“胜利”是部分的：归因准确率下降但仍不低，且保真度选择会改变风险。

主题：智能体奖励与 RL 稳定性成为扩展瓶颈

重要性：随着 RL 后训练规模化，两大瓶颈主导：(1) 奖励泛化/成本；(2) 异步不稳定性。两者都可能导致脆弱策略或训练崩溃。
代表论文：
常见方法：
- 用动态工具选择/合成（代码验证器、封装的奖励检查点）+ 验证门控，替代静态奖励模型。
- 通过控制梯度几何（余弦对齐投影/跳过机制）稳定异步 RL。
- 对扩散式语言模型，通过logit/速度场目标与方差缩减采样，绕开不可解的似然。
开放问题 / 失效模式：
- 奖励工具合成引入新攻击面（例如 检索/README 操纵）。
- 异步稳定化仅在有限设置中展示；在大规模多节点下的行为在所给分析中仍未验证。

主题：评测与治理基础设施（公平性、泄漏、代表性）

重要性：基准越来越直接驱动部署决策；泄漏与代表性失效会误导进展判断并错配投入。
代表论文：
常见方法：
- 将评测逻辑/答案放到服务器端以减少泄漏并标准化流水线（协议 + 工具包）。
- 同步评估公平性：跨任务（生成 + 理解）与跨多种公平理念。
- 将基准映射到外部分类体系（O*NET）以量化覆盖偏斜并定义自主性与复杂度。
- 将异构评测产物转化为按查询条件的推荐监督，面向可部署智能体。
开放问题 / 失效模式：
- 协议采纳是协同问题：价值取决于被打包的服务器/基准数量。
- 自动标注器（如人口统计分类器）可能注入测量偏差；可控性（steerability）指标需要更强验证。

主题：长时程智能体：记忆、上下文与成本

重要性：持久智能体受上下文窗口、成本与长时程时间推理的硬限制；多篇论文提出结构化记忆与成本感知的上下文获取。
代表论文：
常见方法：
- 分层记忆（逐字→要旨），用不确定性/熵门控检索控制计算量。
- 离线语料/搜索引擎以实现低成本 RL与多轮工具学习。
- “先侦察（scouting-first）”的元数据/图导航，减少代码智能体重复全量摄取文本。
- 显式成本模型（提示缓存）以计算记忆 vs 长上下文的盈亏平衡轮次。
开放问题 / 失效模式：
- 扁平事实抽取可能丢失时间/指代线索；在部分基准上记忆准确率落后于长上下文。
- 离线搜索引入离线/在线差距；语料陈旧会限制性能上限。

3) 技术综合

多项工作在结构化中间表示上趋同，将其作为鲁棒性的杠杆：CC-BOS 使用 8 维提示策略向量；TARSE 使用按步骤索引的 LogicalChains + 技能；FastCode 使用多层代码图；MM-Mem 使用感知/情景/图式层；EchoGuard 使用情景/语义知识图谱（KG）。
优化正在进入“闭环”：CC-BOS 以黑盒方式优化提示；∇-Reasoner 在测试时优化 logits；LFPO 优化扩散 logits/速度场；GAC 在训练中修改梯度以防崩溃。
验证门控正在成为标准：RLAR 的 EvalTool 验证、RepoLaunch 的 Verify Agent、PARCER 的验证门控、VANGUARD 的置信度分数都在编码“默认不信任模型”。
成本/时延被视为一等指标（而非事后补充）：RLAR 报告相对基于裁判的 RLAIF 显著减少 token/GPU-hour；MM-DeepResearch 量化在线 vs 离线成本/时间；记忆 vs 长上下文工作给出明确盈亏平衡轮次；FastCode 目标是单次摄取的上下文组装。
跨语言与长尾泛化反复被证明很弱：BLUFF 量化长尾语言 F1 大幅下降；CC-BOS 展示古语绕过；两者都意味着安全与检测工具必须在高资源语言之外评估。
“代理对齐（proxy alignment）”失败在实证上可见：课堂转录研究显示 FM 一致性甚至专家量表一致性都可能偏离预期影响（学生学习增益），警示不要过度依赖代理指标。
异步性引入一种独特的 RL 失效模式（陈旧但对齐的梯度），不只是“离策略（off-policy）”：GAC 针对梯度几何而非仅做分布校正。
在度量型数量上，工具增强胜过端到端 VLM 推理：VANGUARD 的确定性 GSD 估计优于 VLM 面积估计，强化了具身安全的设计模式。

4) Top 5 论文（含“为何是现在”）

1) Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search（晦涩但有效：通过仿生搜索优化古文越狱提示）

表明古典/古语语言是重大安全盲点；在其设定下报告多个前沿模型 100% ASR。
提供结构化 8 维提示策略空间 + 黑盒 FOA 优化器，报告查询次数极低。
展示跨模型可迁移性，并可适用于其他古典语言（拉丁语、梵语）。
质疑点：结果依赖选定的基准子集与闭源受害模型；组合防御/翻译过滤可降低 ASR。

2) ROKA: Robust Knowledge Unlearning against Adversaries（ROKA：对抗者下的鲁棒知识反学习）

提出间接反学习攻击：反学习请求可被用来削弱其他安全关键类别。
提出神经修复（Neural Healing）/贡献重分配，用定向/非定向随机算法保留被保留知识。
在视觉、多模态与 LLM（包括在 MMLU 上的 Llama 3.2）上评估，相比 GA 反学习稳定性/平衡性更好。
质疑点：精确重分配不可行；效果依赖同类/保留数据的代表性。

3) RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models（RLAR：用于 LLM 多任务强化学习的智能体奖励系统）

使奖励建模变为自适应且基于工具（封装奖励检查点；生成代码验证器），而非静态。
报告多领域 RL 强提升（例如表 2 中 GSM8K 提升）以及相对 GPT-5 裁判 RLAIF 的大幅成本降低。
在 REWARDBENCH-V2 上的奖励路由准确率较高（平均精度 90.44%）。
质疑点：依赖网页检索与仓库文档；作者指出易受“readme hacking”影响。

4) GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control（GAC：通过梯度对齐控制稳定 LLM 异步 RL 训练）

指出异步 RL 中具体不稳定机制：崩溃前出现持续对齐的连续梯度。
提供低开销投影/跳过控制，在存在陈旧性时基本缩小与同步 GRPO 的差距（表 1）。
有理论支撑，将投影与收敛界中的偏差降低联系起来。
质疑点：实验仅在单机 8-GPU 设置上报告；未展示大规模分布式行为。

5) BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages（BLUFF：覆盖 58 种低资源语言的虚假与合成内容检测基准）

提供大型多语种基准（201K 样本，78 种语言），包含受控操纵与作者类型。
量化跨语言迁移退化：长尾语言最高下降 25.3 个 F1 点；解码器零样本在多分类上常接近随机。
提供智能体生成流水线（AXL-CoI）与重型多语种质量过滤器（mPURIFY），并报告保留统计。
质疑点：地理/句法覆盖仍有缺口；解码器模型仅以零样本评估。

5) 实用下一步

将古典/文体迁移红队测试加入安全评测套件（如古文式压缩/歧义），并衡量跨模型与跨防御的迁移。
对反学习流水线，显式测试间接反学习攻击：请求遗忘良性/无关类别，并测量安全关键类别的退化；跟踪预测分布不均衡。
若进行大规模 RL 后训练，在异步设置中监控梯度余弦相似度随时间变化；在追逐奖励模型修复前先试验 GAC 式投影/跳过控制。
用奖励工具集替代单体奖励模型：对确定性任务集成代码验证器，并在引入新奖励工具前加入验证门控（RLAR 模式）。
对持久智能体，依据供应商缓存规则计算你的成本盈亏平衡点（轮次 × 上下文长度）；决定何时从长上下文切换到记忆，并测量准确率损失。
对具身/度量任务，优先采用确定性感知技能 + 置信度门控（VANGUARD 模式），而非仅靠 VLM 的数值估计；将不确定案例路由到回退行为。
对多语种虚假信息/合成检测，在 大语种→长尾语种迁移设置下评估检测器（BLUFF 风格），而非只做多语种同分布划分。
若发布基准，考虑采用服务器端评测（DEP 风格）以减少泄漏/污染，并降低用户集成成本。

由逐篇论文分析生成；无外部浏览。

Di Tang

AI 论文洞察简报

2026-03-09

0) 执行要点（先读这个）

2) 关键主题（聚类）

主题：语言学与文体越狱攻击面

主题：对抗场景下的鲁棒反学习（unlearning）

主题：智能体奖励与 RL 稳定性成为扩展瓶颈

主题：评测与治理基础设施（公平性、泄漏、代表性）

主题：长时程智能体：记忆、上下文与成本

3) 技术综合

4) Top 5 论文（含“为何是现在”）

5) 实用下一步