Daily AI Paper Report (2026-03-09)

Published: March 09, 2026

Chinese version: [中文]

Run stats

Candidates: 1352
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-06T01:00:00Z → 2026-03-07T01:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2602.22983`	Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search PDF	cs.AI, cs.CR	93	Automated classical-Chinese jailbreak optimization; highlights multilingual safety gaps.	jailbreaks, prompt-optimization, multilingual, red-teaming, llm-security
`2603.00529`	CaptionFool: Universal Image Captioning Model Attacks PDF	cs.CV, cs.AI	93	Universal adversarial attack on captioners (94–96%); can induce offensive captions & evade filters	multimodal-security, adversarial-attacks, image-captioning, robustness, content-moderation, red-teaming
`2603.05344`	Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned PDF	cs.AI	92	Open-source terminal coding agent with explicit safety controls + context management lessons.	coding-agents, tool-use, agent-architecture, safety-controls, context-engineering, open-source
`2603.01712`	FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents PDF	cs.AI, cs.LG	92	Benchmark for end-to-end autonomous LLM fine-tuning with agents; realistic tooling+iteration loop	agents, auto-ML, fine-tuning, benchmark, evaluation, tool-use
`2603.00436`	ROKA: Robust Knowledge Unlearning against Adversaries PDF	cs.LG, cs.AI	90	Unlearning can induce new attacks; proposes robust unlearning framework to mitigate.	machine-unlearning, privacy, security, backdoors, robustness
`2603.03761`	AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation PDF	cs.AI, cs.IR	90	Benchmark for query-conditioned agent configuration recommendation; fills key gap for agent ecosystems.	agents, benchmark, tool-selection, evaluation, recommendation
`2603.01724`	GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules PDF	cs.AI	90	Content moderation benchmark with co-occurring harms + dynamic policies; closer to real deployment	safety, content-moderation, evaluation, policy-following, robustness, benchmarks
`2603.04948`	$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space PDF	cs.LG	90	Test-time latent gradient descent for LLM reasoning; potentially strong inference-time scaling method	llm, reasoning, test-time-compute, decoding, optimization, reward-model
`2603.01499`	Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report) PDF	cs.CR, cs.AI	90	Practical privacy-preserving LLM inference proposal targeting accuracy, clusters, and infra compatibility	privacy, secure-inference, llm-serving, systems, confidential-computing
`2603.00724`	RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models PDF	cs.CL	90	Agentic, dynamic reward acquisition for RL alignment; tackles reward generalization & verifier synthesis.	alignment, RLHF, reward-models, agents, verifiers, tool-use
`2603.05026`	RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform PDF	cs.SE, cs.LG, cs.MA	90	Agent that auto-builds/tests any repo; enables scalable SWE benchmarks & training data pipelines	coding-agents, SWE-benchmarking, automation, build-and-test, evaluation-pipeline, datasets
`2603.01421`	SciDER: Scientific Data-centric End-to-end Researcher PDF	cs.AI, cs.CL	90	End-to-end LLM scientist that parses raw data, writes/executes code, with benchmarks and feedback loop	agents, scientific-discovery, tool-use, code-execution, memory, evaluation
`2603.01104`	Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI PDF	cs.HC, cs.AI, cs.CV, cs.CY	90	Smart-glasses LLM agent w/ long-horizon video reasoning + web tools; real-world agent safety stakes	agents, tool-use, multimodal, long-horizon, context-compression, assistive-tech, deployment
`2603.00634`	BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages PDF	cs.CL	89	Large benchmark for false/synthetic content across many low-resource languages.	benchmarks, misinformation, synthetic-text-detection, multilingual, evaluation
`2603.01012`	FastCode: Fast and Cost-Efficient Code Understanding and Reasoning PDF	cs.SE, cs.AI	88	Repo-scale code reasoning with cost-aware structure scouting; strong relevance to agent efficiency.	code-reasoning, agents, context-efficiency, repository-mapping, software-engineering
`2603.03906`	Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets PDF	cs.CR	88	Measures privacy leakage in synthetic social media text via authorship attribution re-ID attacks	privacy, synthetic-data, LLMs, membership-inference, authorship-attribution, security
`2603.01203`	How Well Does Agent Development Reflect Real-World Work? PDF	cs.AI	88	Maps 43 agent benchmarks to 1,016 occupations; finds big mismatch vs real labor/economic value	agents, evaluation, benchmarks, labor-market, task-distribution, deployment-relevance
`2603.01501`	GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control PDF	cs.LG, cs.AI	88	Stabilizes async RL for LLMs; identifies stale-aligned gradients and proposes control method	LLM-RL, asynchronous-RL, training-stability, policy-gradient, scaling
`2603.04814`	Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents PDF	cs.CL	88	Direct cost/accuracy tradeoff study: long-context vs fact-memory for persistent agents on 3 benchmarks	agents, memory, long-context, cost-modeling, evaluation, RAG
`2603.01050`	MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline PDF	cs.CV, cs.AI	88	Multimodal deep-research agent baseline + synthetic search-intensive data/trajectories; reusable for agents eval.	agents, multimodal, tool-use, search, planning, datasets, benchmarks
`2603.01241`	TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents PDF	cs.IR, cs.AI	88	Test-time retrieval of skills + verified reasoning trajectories to improve clinical reasoning agents	reasoning-agents, test-time-adaptation, retrieval, skills, experience, healthcare
`2603.04277`	VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments PDF	cs.RO, cs.AI	87	Shows VLM spatial scale hallucinations; adds deterministic tool for metric grounding.	embodied-agents, hallucinations, tool-use, robot-safety, evaluation
`2603.01167`	DEP: A Decentralized Large Language Model Evaluation Protocol PDF	cs.CL	86	Decentralized LLM evaluation protocol aiming at reproducibility and reducing benchmark leakage risk.	evaluation, benchmarking, reproducibility, data-leakage, protocols
`2603.01455`	From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents PDF	cs.CV, cs.AI, cs.CL, cs.IR, cs.MM	86	Pyramidal multimodal memory for long-horizon video agents; distills verbatim→gist to fit context	agents, long-context, memory, multimodal, video-understanding, efficiency
`2603.00883`	Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact PDF	cs.LG, cs.AI, cs.CY, stat.AP	86	Shows benchmark success can be negatively aligned with learning outcomes; ensembles worsen misalignment	alignment, evaluation, OOD-generalization, education, impact-misalignment, reliability
`2603.04815`	EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue PDF	cs.AI	86	Agentic KG memory to detect manipulation over long dialogues; relevant to safety monitoring & oversight	agents, safety, long-context, memory, knowledge-graphs, monitoring, dialogue
`2603.01563`	LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models PDF	cs.LG, cs.AI	86	RLVR-style alignment for diffusion LLMs via likelihood-free policy optimization; potentially reusable method	alignment, RL, diffusion-LLM, RLVR, optimization, post-training
`2603.00590`	Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs PDF	cs.AI	86	Fairness benchmark for multimodal LLMs covering understanding+generation with metric normalization framework.	fairness, evaluation, multimodal, benchmarks, bias, metrics
`2603.05068`	Cyber Threat Intelligence for Artificial Intelligence Systems PDF	cs.CR, cs.AI	86	Systematizes AI-focused cyber threat intelligence: assets, IoCs, supply-chain phases, workflows	AI-security, threat-intelligence, supply-chain, IoC, risk-management, governance
`2603.00856`	PARCER as an Operational Contract to Reduce Variance, Cost, and Risk in LLM Systems PDF	cs.SE, cs.AI	86	YAML operational contract to reduce variance/cost/risk + improve long-context reliability in LLM systems	LLM-systems, governance, reliability, prompting, long-context, auditability, cost-control

AI Paper Insight Brief

2026-03-09

0) Executive takeaways (read this first)

“Non-standard language/style” is now a first-class jailbreak surface: classical/archaic language prompts can drive near-universal jailbreak success with very low query counts, and even transfer across models—suggesting many defenses are overfit to modern-language patterns.
Robustness is shifting from “better models” to “better systems”: multiple papers show large gains from system-level interventions—dynamic reward tooling (RLAR), gradient-geometry stabilization (GAC), structured codebase scouting (FastCode), and offline search engines (MM-DeepResearch)—often cutting cost while improving quality.
Evaluation is becoming infrastructure, not just datasets: DEP proposes leak-resistant benchmark servers; IRIS and BLUFF expand evaluation into multimodal fairness and long-tail multilingual disinformation; AgentSelect reframes evaluation artifacts into a recommendation benchmark for deployable agents.
Privacy/security threats are increasingly “second-order”: indirect unlearning attacks can degrade other security-critical classes; synthetic text can still leak author identity; and privacy-preserving inference is moving toward deployable obfuscation compatible with existing serving stacks.
Tool-augmented “anti-hallucination” is winning in metric domains: VANGUARD shows VLMs hallucinate spatial scale badly, while a deterministic geometric tool sharply reduces error—reinforcing a pattern: for safety-critical quantities, add verifiable tools rather than prompt harder.

2) Key themes (clusters)

Theme: Linguistic & stylistic jailbreak surfaces

Why it matters: Safety layers that work in mainstream English/modern Chinese can fail under stylistic compression/ambiguity (classical Chinese; even other classical languages), enabling efficient black-box jailbreaks with high transfer.
Representative papers:
- Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
- BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Common approach:
- Treat attacks as search/optimization over discrete prompt strategies (structured strategy spaces + black-box optimization).
- Use translation/normalization pipelines to score cross-lingual outputs consistently.
- Measure transferability across multiple frontier models to estimate real-world risk.
Open questions / failure modes:
- How to build defenses that generalize across archaic styles without overblocking benign historical text.
- Whether translation-based filtering meaningfully reduces risk without introducing new bypasses or false positives.

Theme: Robust unlearning under adversaries

Why it matters: “Forget this class” requests can be weaponized to degrade other classes (indirect unlearning attack), turning unlearning into a security vulnerability rather than a privacy feature.
Representative papers:
- ROKA: Robust Knowledge Unlearning against Adversaries
- Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets
Common approach:
- Formalize collateral damage (knowledge contamination/destruction) and design preservation/healing objectives alongside forgetting.
- Evaluate with distributional shift / imbalance lenses (balanced prediction distributions; attribution attacks on synthetic text).
Open questions / failure modes:
- Dependence on retain/sibling data quality: biased or incomplete retain sets can cause under/over-healing.
- Privacy “wins” from synthetic text are partial: attribution accuracy drops but remains non-trivial, and fidelity choices move the risk.

Theme: Agentic reward & RL stability as scaling bottlenecks

Why it matters: As RL post-training scales, two bottlenecks dominate: (1) reward generalization/cost and (2) asynchronous instability. Both can cause brittle policies or training collapse.
Representative papers:
Common approach:
- Replace static reward models with dynamic tool selection/synthesis (code verifiers, wrapped reward checkpoints) plus verification gates.
- Stabilize async RL by controlling gradient geometry (cosine-alignment projection/skip regimes).
- For diffusion LMs, avoid intractable likelihoods via logit/velocity-field objectives and variance-reduction sampling.
Open questions / failure modes:
- Reward-tool synthesis introduces new attack surfaces (e.g., retrieval/README manipulation).
- Async stabilization shown on limited setups; behavior at large multi-node scale remains untested in the provided analyses.

Theme: Evaluation & governance infrastructure (fairness, leakage, representativeness)

Why it matters: Benchmarks increasingly drive deployment decisions; leakage and representativeness failures can mislead progress claims and misallocate effort.
Representative papers:
Common approach:
- Move evaluation logic/answers server-side to reduce leakage and standardize pipelines (protocols + toolkits).
- Evaluate fairness synchronously across tasks (generation + understanding) and across multiple fairness philosophies.
- Map benchmarks to external taxonomies (O*NET) to quantify coverage skew and define autonomy vs complexity.
- Convert heterogeneous evaluation artifacts into query-conditioned recommendation supervision for deployable agents.
Open questions / failure modes:
- Protocol adoption is a coordination problem: value depends on the number of packaged servers/benchmarks.
- Automated annotators (e.g., demographic classifiers) can inject measurement bias; steerability metrics need stronger validation.

Theme: Long-horizon agents: memory, context, and cost

Why it matters: Persistent agents hit hard limits from context windows, cost, and long-horizon temporal reasoning; multiple papers propose structured memory and cost-aware context acquisition.
Representative papers:
Common approach:
- Hierarchical memory (verbatim→gist) with uncertainty/entropy-gated retrieval to control compute.
- Offline corpora/search engines to enable cheap RL and multi-turn tool learning.
- “Scouting-first” metadata/graph navigation to reduce repeated full-text ingestion in code agents.
- Explicit cost models (prompt caching) to compute break-even turns for memory vs long-context.
Open questions / failure modes:
- Flat fact extraction can lose temporal/coreference cues; memory accuracy lags long-context on some benchmarks.
- Offline search introduces an offline/online gap; corpus staleness can cap performance.

3) Technical synthesis

Several works converge on structured intermediate representations as the lever for robustness: CC-BOS uses an 8D prompt strategy vector; TARSE uses step-indexed LogicalChains + skills; FastCode uses multi-layer code graphs; MM-Mem uses sensory/episodic/schema layers; EchoGuard uses episodic/semantic KGs.
Optimization is moving “inside the loop”: CC-BOS optimizes prompts black-box; ∇-Reasoner optimizes logits at test time; LFPO optimizes diffusion logits/velocity fields; GAC modifies gradients during training to prevent collapse.
Verification gates are becoming standard: RLAR’s EvalTool verification, RepoLaunch’s Verify Agent, PARCER’s validation gates, and VANGUARD’s confidence score all encode “don’t trust the model by default.”
Cost/latency is treated as a first-class metric (not an afterthought): RLAR reports large token/GPU-hour reductions vs judge-based RLAIF; MM-DeepResearch quantifies online vs offline cost/time; memory-vs-long-context work gives explicit break-even turns; FastCode targets single-ingestion context assembly.
Cross-lingual and long-tail generalization is repeatedly shown to be weak: BLUFF quantifies large F1 drops for long-tail languages; CC-BOS shows archaic-language bypass; both imply safety and detection tooling must be evaluated beyond high-resource languages.
“Proxy alignment” failures are empirically visible: classroom transcript study shows FM agreement and even expert-rubric alignment can diverge from intended impact (student learning gains), warning against over-reliance on proxy metrics.
Asynchrony introduces a distinct RL failure mode (stale-aligned gradients) that is not just “off-policy”: GAC targets gradient geometry rather than only distribution correction.
Tool augmentation beats end-to-end VLM reasoning for metric quantities: VANGUARD’s deterministic GSD estimation outperforms VLM area estimation, reinforcing a design pattern for embodied safety.

4) Top 5 papers (with “why now”)

1) Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Shows classical/archaic language is a major safety blind spot; reports 100% ASR across several frontier models in their setting.
Provides a structured 8D prompt strategy space + black-box FOA optimizer with very low reported query counts.
Demonstrates cross-model transferability and applicability to other classical languages (Latin, Sanskrit).
Skepticism: results rely on selected benchmark subsets and closed-source victims; combined defenses/translation filtering can reduce ASR.

2) ROKA: Robust Knowledge Unlearning against Adversaries

Introduces the indirect unlearning attack: unlearning requests can be used to degrade other security-critical classes.
Proposes Neural Healing / contribution re-allocation with targeted/non-targeted stochastic algorithms to preserve retained knowledge.
Evaluated across vision, multimodal, and LLMs (including Llama 3.2 on MMLU) with improved stability/balance vs GA unlearning.
Skepticism: exact re-allocation is infeasible; effectiveness depends on sibling/retain data representativeness.

3) RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

Makes reward modeling adaptive and tool-based (wrap reward checkpoints; generate code verifiers) rather than static.
Reports strong multi-domain RL gains (e.g., GSM8K improvements in Table 2) and large cost reductions vs GPT-5 judge RLAIF.
Reward routing accuracy on REWARDBENCH-V2 is high (90.44% avg precision).
Skepticism: relies on web retrieval and repository documentation; vulnerable to “readme hacking” per authors.

4) GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

Identifies a concrete instability mechanism in async RL: persistently aligned consecutive gradients preceding collapse.
Provides a low-overhead projection/skip control that largely closes the gap to synchronized GRPO under staleness (Table 1).
Backed by theory linking projection to bias reduction in convergence bounds.
Skepticism: experiments reported on a single-machine 8-GPU setup; large-scale distributed behavior not shown.

5) BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages

Delivers a large multilingual benchmark (201K samples, 78 languages) with controlled manipulations and authorship types.
Quantifies cross-lingual transfer degradation up to 25.3 F1 points for long-tail languages; decoder zero-shot often near random on multiclass.
Provides an agentic generation pipeline (AXL-CoI) and a heavy multilingual quality filter (mPURIFY) with reported retention stats.
Skepticism: geographic/syntactic coverage gaps remain; decoder models only evaluated zero-shot.

5) Practical next steps

Add archaic/style-shifted red-teaming to your safety eval suite (e.g., classical Chinese–style compression/ambiguity) and measure transfer across models and defenses.
For unlearning pipelines, explicitly test indirect unlearning attacks: request forgetting of benign/unrelated classes and measure degradation on security-critical classes; track prediction-distribution imbalance.
If doing RL post-training at scale, instrument gradient cosine similarity over time in async setups; trial GAC-style projection/skip controls before chasing reward-model fixes.
Replace monolithic reward models with a reward toolset: integrate code verifiers for deterministic tasks and add a verification gate before admitting new reward tools (RLAR pattern).
For persistent agents, compute your cost break-even (turn count × context length) using your provider’s caching rules; decide when to switch from long-context to memory, and measure the accuracy hit.
For embodied/metric tasks, prefer deterministic perception skills + confidence gating (VANGUARD pattern) over VLM-only numeric estimation; route uncertain cases to fallback behaviors.
For multilingual disinformation/synthetic detection, evaluate detectors in big-head→long-tail transfer settings (BLUFF-style) rather than only multilingual in-domain splits.
If you publish benchmarks, consider server-side evaluation (DEP-style) to reduce leakage/contamination and lower integration cost for users.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-09

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Linguistic & stylistic jailbreak surfaces

Theme: Robust unlearning under adversaries

Theme: Agentic reward & RL stability as scaling bottlenecks

Theme: Evaluation & governance infrastructure (fairness, leakage, representativeness)

Theme: Long-horizon agents: memory, context, and cost

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps