Daily AI Paper Report (2026-03-09)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 1352
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-06T01:00:00Z → 2026-03-07T01:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2602.22983Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
PDF
cs.AI, cs.CR93Automated classical-Chinese jailbreak optimization; highlights multilingual safety gaps.jailbreaks, prompt-optimization, multilingual, red-teaming, llm-security
2603.00529CaptionFool: Universal Image Captioning Model Attacks
PDF
cs.CV, cs.AI93Universal adversarial attack on captioners (94–96%); can induce offensive captions & evade filtersmultimodal-security, adversarial-attacks, image-captioning, robustness, content-moderation, red-teaming
2603.05344Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned
PDF
cs.AI92Open-source terminal coding agent with explicit safety controls + context management lessons.coding-agents, tool-use, agent-architecture, safety-controls, context-engineering, open-source
2603.01712FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
PDF
cs.AI, cs.LG92Benchmark for end-to-end autonomous LLM fine-tuning with agents; realistic tooling+iteration loopagents, auto-ML, fine-tuning, benchmark, evaluation, tool-use
2603.00436ROKA: Robust Knowledge Unlearning against Adversaries
PDF
cs.LG, cs.AI90Unlearning can induce new attacks; proposes robust unlearning framework to mitigate.machine-unlearning, privacy, security, backdoors, robustness
2603.03761AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation
PDF
cs.AI, cs.IR90Benchmark for query-conditioned agent configuration recommendation; fills key gap for agent ecosystems.agents, benchmark, tool-selection, evaluation, recommendation
2603.01724GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules
PDF
cs.AI90Content moderation benchmark with co-occurring harms + dynamic policies; closer to real deploymentsafety, content-moderation, evaluation, policy-following, robustness, benchmarks
2603.04948$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space
PDF
cs.LG90Test-time latent gradient descent for LLM reasoning; potentially strong inference-time scaling methodllm, reasoning, test-time-compute, decoding, optimization, reward-model
2603.01499Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)
PDF
cs.CR, cs.AI90Practical privacy-preserving LLM inference proposal targeting accuracy, clusters, and infra compatibilityprivacy, secure-inference, llm-serving, systems, confidential-computing
2603.00724RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
PDF
cs.CL90Agentic, dynamic reward acquisition for RL alignment; tackles reward generalization & verifier synthesis.alignment, RLHF, reward-models, agents, verifiers, tool-use
2603.05026RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform
PDF
cs.SE, cs.LG, cs.MA90Agent that auto-builds/tests any repo; enables scalable SWE benchmarks & training data pipelinescoding-agents, SWE-benchmarking, automation, build-and-test, evaluation-pipeline, datasets
2603.01421SciDER: Scientific Data-centric End-to-end Researcher
PDF
cs.AI, cs.CL90End-to-end LLM scientist that parses raw data, writes/executes code, with benchmarks and feedback loopagents, scientific-discovery, tool-use, code-execution, memory, evaluation
2603.01104Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI
PDF
cs.HC, cs.AI, cs.CV, cs.CY90Smart-glasses LLM agent w/ long-horizon video reasoning + web tools; real-world agent safety stakesagents, tool-use, multimodal, long-horizon, context-compression, assistive-tech, deployment
2603.00634BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
PDF
cs.CL89Large benchmark for false/synthetic content across many low-resource languages.benchmarks, misinformation, synthetic-text-detection, multilingual, evaluation
2603.01012FastCode: Fast and Cost-Efficient Code Understanding and Reasoning
PDF
cs.SE, cs.AI88Repo-scale code reasoning with cost-aware structure scouting; strong relevance to agent efficiency.code-reasoning, agents, context-efficiency, repository-mapping, software-engineering
2603.03906Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets
PDF
cs.CR88Measures privacy leakage in synthetic social media text via authorship attribution re-ID attacksprivacy, synthetic-data, LLMs, membership-inference, authorship-attribution, security
2603.01203How Well Does Agent Development Reflect Real-World Work?
PDF
cs.AI88Maps 43 agent benchmarks to 1,016 occupations; finds big mismatch vs real labor/economic valueagents, evaluation, benchmarks, labor-market, task-distribution, deployment-relevance
2603.01501GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control
PDF
cs.LG, cs.AI88Stabilizes async RL for LLMs; identifies stale-aligned gradients and proposes control methodLLM-RL, asynchronous-RL, training-stability, policy-gradient, scaling
2603.04814Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
PDF
cs.CL88Direct cost/accuracy tradeoff study: long-context vs fact-memory for persistent agents on 3 benchmarksagents, memory, long-context, cost-modeling, evaluation, RAG
2603.01050MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
PDF
cs.CV, cs.AI88Multimodal deep-research agent baseline + synthetic search-intensive data/trajectories; reusable for agents eval.agents, multimodal, tool-use, search, planning, datasets, benchmarks
2603.01241TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents
PDF
cs.IR, cs.AI88Test-time retrieval of skills + verified reasoning trajectories to improve clinical reasoning agentsreasoning-agents, test-time-adaptation, retrieval, skills, experience, healthcare
2603.04277VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments
PDF
cs.RO, cs.AI87Shows VLM spatial scale hallucinations; adds deterministic tool for metric grounding.embodied-agents, hallucinations, tool-use, robot-safety, evaluation
2603.01167DEP: A Decentralized Large Language Model Evaluation Protocol
PDF
cs.CL86Decentralized LLM evaluation protocol aiming at reproducibility and reducing benchmark leakage risk.evaluation, benchmarking, reproducibility, data-leakage, protocols
2603.01455From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
PDF
cs.CV, cs.AI, cs.CL, cs.IR, cs.MM86Pyramidal multimodal memory for long-horizon video agents; distills verbatim→gist to fit contextagents, long-context, memory, multimodal, video-understanding, efficiency
2603.00883Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
PDF
cs.LG, cs.AI, cs.CY, stat.AP86Shows benchmark success can be negatively aligned with learning outcomes; ensembles worsen misalignmentalignment, evaluation, OOD-generalization, education, impact-misalignment, reliability
2603.04815EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue
PDF
cs.AI86Agentic KG memory to detect manipulation over long dialogues; relevant to safety monitoring & oversightagents, safety, long-context, memory, knowledge-graphs, monitoring, dialogue
2603.01563LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
PDF
cs.LG, cs.AI86RLVR-style alignment for diffusion LLMs via likelihood-free policy optimization; potentially reusable methodalignment, RL, diffusion-LLM, RLVR, optimization, post-training
2603.00590Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs
PDF
cs.AI86Fairness benchmark for multimodal LLMs covering understanding+generation with metric normalization framework.fairness, evaluation, multimodal, benchmarks, bias, metrics
2603.05068Cyber Threat Intelligence for Artificial Intelligence Systems
PDF
cs.CR, cs.AI86Systematizes AI-focused cyber threat intelligence: assets, IoCs, supply-chain phases, workflowsAI-security, threat-intelligence, supply-chain, IoC, risk-management, governance
2603.00856PARCER as an Operational Contract to Reduce Variance, Cost, and Risk in LLM Systems
PDF
cs.SE, cs.AI86YAML operational contract to reduce variance/cost/risk + improve long-context reliability in LLM systemsLLM-systems, governance, reliability, prompting, long-context, auditability, cost-control

AI Paper Insight Brief

2026-03-09

0) Executive takeaways (read this first)

  • “Non-standard language/style” is now a first-class jailbreak surface: classical/archaic language prompts can drive near-universal jailbreak success with very low query counts, and even transfer across models—suggesting many defenses are overfit to modern-language patterns.
  • Robustness is shifting from “better models” to “better systems”: multiple papers show large gains from system-level interventions—dynamic reward tooling (RLAR), gradient-geometry stabilization (GAC), structured codebase scouting (FastCode), and offline search engines (MM-DeepResearch)—often cutting cost while improving quality.
  • Evaluation is becoming infrastructure, not just datasets: DEP proposes leak-resistant benchmark servers; IRIS and BLUFF expand evaluation into multimodal fairness and long-tail multilingual disinformation; AgentSelect reframes evaluation artifacts into a recommendation benchmark for deployable agents.
  • Privacy/security threats are increasingly “second-order”: indirect unlearning attacks can degrade other security-critical classes; synthetic text can still leak author identity; and privacy-preserving inference is moving toward deployable obfuscation compatible with existing serving stacks.
  • Tool-augmented “anti-hallucination” is winning in metric domains: VANGUARD shows VLMs hallucinate spatial scale badly, while a deterministic geometric tool sharply reduces error—reinforcing a pattern: for safety-critical quantities, add verifiable tools rather than prompt harder.

2) Key themes (clusters)

Theme: Linguistic & stylistic jailbreak surfaces

  • Why it matters: Safety layers that work in mainstream English/modern Chinese can fail under stylistic compression/ambiguity (classical Chinese; even other classical languages), enabling efficient black-box jailbreaks with high transfer.
  • Representative papers:
  • Common approach:
    • Treat attacks as search/optimization over discrete prompt strategies (structured strategy spaces + black-box optimization).
    • Use translation/normalization pipelines to score cross-lingual outputs consistently.
    • Measure transferability across multiple frontier models to estimate real-world risk.
  • Open questions / failure modes:
    • How to build defenses that generalize across archaic styles without overblocking benign historical text.
    • Whether translation-based filtering meaningfully reduces risk without introducing new bypasses or false positives.

Theme: Robust unlearning under adversaries

  • Why it matters: “Forget this class” requests can be weaponized to degrade other classes (indirect unlearning attack), turning unlearning into a security vulnerability rather than a privacy feature.
  • Representative papers:
  • Common approach:
    • Formalize collateral damage (knowledge contamination/destruction) and design preservation/healing objectives alongside forgetting.
    • Evaluate with distributional shift / imbalance lenses (balanced prediction distributions; attribution attacks on synthetic text).
  • Open questions / failure modes:
    • Dependence on retain/sibling data quality: biased or incomplete retain sets can cause under/over-healing.
    • Privacy “wins” from synthetic text are partial: attribution accuracy drops but remains non-trivial, and fidelity choices move the risk.

Theme: Agentic reward & RL stability as scaling bottlenecks

Theme: Evaluation & governance infrastructure (fairness, leakage, representativeness)

Theme: Long-horizon agents: memory, context, and cost

3) Technical synthesis

  • Several works converge on structured intermediate representations as the lever for robustness: CC-BOS uses an 8D prompt strategy vector; TARSE uses step-indexed LogicalChains + skills; FastCode uses multi-layer code graphs; MM-Mem uses sensory/episodic/schema layers; EchoGuard uses episodic/semantic KGs.
  • Optimization is moving “inside the loop”: CC-BOS optimizes prompts black-box; ∇-Reasoner optimizes logits at test time; LFPO optimizes diffusion logits/velocity fields; GAC modifies gradients during training to prevent collapse.
  • Verification gates are becoming standard: RLAR’s EvalTool verification, RepoLaunch’s Verify Agent, PARCER’s validation gates, and VANGUARD’s confidence score all encode “don’t trust the model by default.”
  • Cost/latency is treated as a first-class metric (not an afterthought): RLAR reports large token/GPU-hour reductions vs judge-based RLAIF; MM-DeepResearch quantifies online vs offline cost/time; memory-vs-long-context work gives explicit break-even turns; FastCode targets single-ingestion context assembly.
  • Cross-lingual and long-tail generalization is repeatedly shown to be weak: BLUFF quantifies large F1 drops for long-tail languages; CC-BOS shows archaic-language bypass; both imply safety and detection tooling must be evaluated beyond high-resource languages.
  • “Proxy alignment” failures are empirically visible: classroom transcript study shows FM agreement and even expert-rubric alignment can diverge from intended impact (student learning gains), warning against over-reliance on proxy metrics.
  • Asynchrony introduces a distinct RL failure mode (stale-aligned gradients) that is not just “off-policy”: GAC targets gradient geometry rather than only distribution correction.
  • Tool augmentation beats end-to-end VLM reasoning for metric quantities: VANGUARD’s deterministic GSD estimation outperforms VLM area estimation, reinforcing a design pattern for embodied safety.

4) Top 5 papers (with “why now”)

1) Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

  • Shows classical/archaic language is a major safety blind spot; reports 100% ASR across several frontier models in their setting.
  • Provides a structured 8D prompt strategy space + black-box FOA optimizer with very low reported query counts.
  • Demonstrates cross-model transferability and applicability to other classical languages (Latin, Sanskrit).
  • Skepticism: results rely on selected benchmark subsets and closed-source victims; combined defenses/translation filtering can reduce ASR.

2) ROKA: Robust Knowledge Unlearning against Adversaries

  • Introduces the indirect unlearning attack: unlearning requests can be used to degrade other security-critical classes.
  • Proposes Neural Healing / contribution re-allocation with targeted/non-targeted stochastic algorithms to preserve retained knowledge.
  • Evaluated across vision, multimodal, and LLMs (including Llama 3.2 on MMLU) with improved stability/balance vs GA unlearning.
  • Skepticism: exact re-allocation is infeasible; effectiveness depends on sibling/retain data representativeness.

3) RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

  • Makes reward modeling adaptive and tool-based (wrap reward checkpoints; generate code verifiers) rather than static.
  • Reports strong multi-domain RL gains (e.g., GSM8K improvements in Table 2) and large cost reductions vs GPT-5 judge RLAIF.
  • Reward routing accuracy on REWARDBENCH-V2 is high (90.44% avg precision).
  • Skepticism: relies on web retrieval and repository documentation; vulnerable to “readme hacking” per authors.

4) GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

  • Identifies a concrete instability mechanism in async RL: persistently aligned consecutive gradients preceding collapse.
  • Provides a low-overhead projection/skip control that largely closes the gap to synchronized GRPO under staleness (Table 1).
  • Backed by theory linking projection to bias reduction in convergence bounds.
  • Skepticism: experiments reported on a single-machine 8-GPU setup; large-scale distributed behavior not shown.

5) BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages

  • Delivers a large multilingual benchmark (201K samples, 78 languages) with controlled manipulations and authorship types.
  • Quantifies cross-lingual transfer degradation up to 25.3 F1 points for long-tail languages; decoder zero-shot often near random on multiclass.
  • Provides an agentic generation pipeline (AXL-CoI) and a heavy multilingual quality filter (mPURIFY) with reported retention stats.
  • Skepticism: geographic/syntactic coverage gaps remain; decoder models only evaluated zero-shot.

5) Practical next steps

  • Add archaic/style-shifted red-teaming to your safety eval suite (e.g., classical Chinese–style compression/ambiguity) and measure transfer across models and defenses.
  • For unlearning pipelines, explicitly test indirect unlearning attacks: request forgetting of benign/unrelated classes and measure degradation on security-critical classes; track prediction-distribution imbalance.
  • If doing RL post-training at scale, instrument gradient cosine similarity over time in async setups; trial GAC-style projection/skip controls before chasing reward-model fixes.
  • Replace monolithic reward models with a reward toolset: integrate code verifiers for deterministic tasks and add a verification gate before admitting new reward tools (RLAR pattern).
  • For persistent agents, compute your cost break-even (turn count × context length) using your provider’s caching rules; decide when to switch from long-context to memory, and measure the accuracy hit.
  • For embodied/metric tasks, prefer deterministic perception skills + confidence gating (VANGUARD pattern) over VLM-only numeric estimation; route uncertain cases to fallback behaviors.
  • For multilingual disinformation/synthetic detection, evaluate detectors in big-head→long-tail transfer settings (BLUFF-style) rather than only multilingual in-domain splits.
  • If you publish benchmarks, consider server-side evaluation (DEP-style) to reduce leakage/contamination and lower integration cost for users.

Generated from per-paper analyses; no external browsing.