Daily AI Paper Report (2026-03-09)
Published:
Chinese version: [中文]
Run stats
- Candidates: 1352
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-06T01:00:00Z → 2026-03-07T01:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2602.22983 | Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search | cs.AI, cs.CR | 93 | Automated classical-Chinese jailbreak optimization; highlights multilingual safety gaps. | jailbreaks, prompt-optimization, multilingual, red-teaming, llm-security |
2603.00529 | CaptionFool: Universal Image Captioning Model Attacks | cs.CV, cs.AI | 93 | Universal adversarial attack on captioners (94–96%); can induce offensive captions & evade filters | multimodal-security, adversarial-attacks, image-captioning, robustness, content-moderation, red-teaming |
2603.05344 | Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned | cs.AI | 92 | Open-source terminal coding agent with explicit safety controls + context management lessons. | coding-agents, tool-use, agent-architecture, safety-controls, context-engineering, open-source |
2603.01712 | FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents | cs.AI, cs.LG | 92 | Benchmark for end-to-end autonomous LLM fine-tuning with agents; realistic tooling+iteration loop | agents, auto-ML, fine-tuning, benchmark, evaluation, tool-use |
2603.00436 | ROKA: Robust Knowledge Unlearning against Adversaries | cs.LG, cs.AI | 90 | Unlearning can induce new attacks; proposes robust unlearning framework to mitigate. | machine-unlearning, privacy, security, backdoors, robustness |
2603.03761 | AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation | cs.AI, cs.IR | 90 | Benchmark for query-conditioned agent configuration recommendation; fills key gap for agent ecosystems. | agents, benchmark, tool-selection, evaluation, recommendation |
2603.01724 | GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules | cs.AI | 90 | Content moderation benchmark with co-occurring harms + dynamic policies; closer to real deployment | safety, content-moderation, evaluation, policy-following, robustness, benchmarks |
2603.04948 | $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space | cs.LG | 90 | Test-time latent gradient descent for LLM reasoning; potentially strong inference-time scaling method | llm, reasoning, test-time-compute, decoding, optimization, reward-model |
2603.01499 | Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report) | cs.CR, cs.AI | 90 | Practical privacy-preserving LLM inference proposal targeting accuracy, clusters, and infra compatibility | privacy, secure-inference, llm-serving, systems, confidential-computing |
2603.00724 | RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models | cs.CL | 90 | Agentic, dynamic reward acquisition for RL alignment; tackles reward generalization & verifier synthesis. | alignment, RLHF, reward-models, agents, verifiers, tool-use |
2603.05026 | RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform | cs.SE, cs.LG, cs.MA | 90 | Agent that auto-builds/tests any repo; enables scalable SWE benchmarks & training data pipelines | coding-agents, SWE-benchmarking, automation, build-and-test, evaluation-pipeline, datasets |
2603.01421 | SciDER: Scientific Data-centric End-to-end Researcher | cs.AI, cs.CL | 90 | End-to-end LLM scientist that parses raw data, writes/executes code, with benchmarks and feedback loop | agents, scientific-discovery, tool-use, code-execution, memory, evaluation |
2603.01104 | Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI | cs.HC, cs.AI, cs.CV, cs.CY | 90 | Smart-glasses LLM agent w/ long-horizon video reasoning + web tools; real-world agent safety stakes | agents, tool-use, multimodal, long-horizon, context-compression, assistive-tech, deployment |
2603.00634 | BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages | cs.CL | 89 | Large benchmark for false/synthetic content across many low-resource languages. | benchmarks, misinformation, synthetic-text-detection, multilingual, evaluation |
2603.01012 | FastCode: Fast and Cost-Efficient Code Understanding and Reasoning | cs.SE, cs.AI | 88 | Repo-scale code reasoning with cost-aware structure scouting; strong relevance to agent efficiency. | code-reasoning, agents, context-efficiency, repository-mapping, software-engineering |
2603.03906 | Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets | cs.CR | 88 | Measures privacy leakage in synthetic social media text via authorship attribution re-ID attacks | privacy, synthetic-data, LLMs, membership-inference, authorship-attribution, security |
2603.01203 | How Well Does Agent Development Reflect Real-World Work? | cs.AI | 88 | Maps 43 agent benchmarks to 1,016 occupations; finds big mismatch vs real labor/economic value | agents, evaluation, benchmarks, labor-market, task-distribution, deployment-relevance |
2603.01501 | GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control | cs.LG, cs.AI | 88 | Stabilizes async RL for LLMs; identifies stale-aligned gradients and proposes control method | LLM-RL, asynchronous-RL, training-stability, policy-gradient, scaling |
2603.04814 | Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents | cs.CL | 88 | Direct cost/accuracy tradeoff study: long-context vs fact-memory for persistent agents on 3 benchmarks | agents, memory, long-context, cost-modeling, evaluation, RAG |
2603.01050 | MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline | cs.CV, cs.AI | 88 | Multimodal deep-research agent baseline + synthetic search-intensive data/trajectories; reusable for agents eval. | agents, multimodal, tool-use, search, planning, datasets, benchmarks |
2603.01241 | TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents | cs.IR, cs.AI | 88 | Test-time retrieval of skills + verified reasoning trajectories to improve clinical reasoning agents | reasoning-agents, test-time-adaptation, retrieval, skills, experience, healthcare |
2603.04277 | VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments | cs.RO, cs.AI | 87 | Shows VLM spatial scale hallucinations; adds deterministic tool for metric grounding. | embodied-agents, hallucinations, tool-use, robot-safety, evaluation |
2603.01167 | DEP: A Decentralized Large Language Model Evaluation Protocol | cs.CL | 86 | Decentralized LLM evaluation protocol aiming at reproducibility and reducing benchmark leakage risk. | evaluation, benchmarking, reproducibility, data-leakage, protocols |
2603.01455 | From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents | cs.CV, cs.AI, cs.CL, cs.IR, cs.MM | 86 | Pyramidal multimodal memory for long-horizon video agents; distills verbatim→gist to fit context | agents, long-context, memory, multimodal, video-understanding, efficiency |
2603.00883 | Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact | cs.LG, cs.AI, cs.CY, stat.AP | 86 | Shows benchmark success can be negatively aligned with learning outcomes; ensembles worsen misalignment | alignment, evaluation, OOD-generalization, education, impact-misalignment, reliability |
2603.04815 | EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue | cs.AI | 86 | Agentic KG memory to detect manipulation over long dialogues; relevant to safety monitoring & oversight | agents, safety, long-context, memory, knowledge-graphs, monitoring, dialogue |
2603.01563 | LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models | cs.LG, cs.AI | 86 | RLVR-style alignment for diffusion LLMs via likelihood-free policy optimization; potentially reusable method | alignment, RL, diffusion-LLM, RLVR, optimization, post-training |
2603.00590 | Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs | cs.AI | 86 | Fairness benchmark for multimodal LLMs covering understanding+generation with metric normalization framework. | fairness, evaluation, multimodal, benchmarks, bias, metrics |
2603.05068 | Cyber Threat Intelligence for Artificial Intelligence Systems | cs.CR, cs.AI | 86 | Systematizes AI-focused cyber threat intelligence: assets, IoCs, supply-chain phases, workflows | AI-security, threat-intelligence, supply-chain, IoC, risk-management, governance |
2603.00856 | PARCER as an Operational Contract to Reduce Variance, Cost, and Risk in LLM Systems | cs.SE, cs.AI | 86 | YAML operational contract to reduce variance/cost/risk + improve long-context reliability in LLM systems | LLM-systems, governance, reliability, prompting, long-context, auditability, cost-control |
AI Paper Insight Brief
2026-03-09
0) Executive takeaways (read this first)
- “Non-standard language/style” is now a first-class jailbreak surface: classical/archaic language prompts can drive near-universal jailbreak success with very low query counts, and even transfer across models—suggesting many defenses are overfit to modern-language patterns.
- Robustness is shifting from “better models” to “better systems”: multiple papers show large gains from system-level interventions—dynamic reward tooling (RLAR), gradient-geometry stabilization (GAC), structured codebase scouting (FastCode), and offline search engines (MM-DeepResearch)—often cutting cost while improving quality.
- Evaluation is becoming infrastructure, not just datasets: DEP proposes leak-resistant benchmark servers; IRIS and BLUFF expand evaluation into multimodal fairness and long-tail multilingual disinformation; AgentSelect reframes evaluation artifacts into a recommendation benchmark for deployable agents.
- Privacy/security threats are increasingly “second-order”: indirect unlearning attacks can degrade other security-critical classes; synthetic text can still leak author identity; and privacy-preserving inference is moving toward deployable obfuscation compatible with existing serving stacks.
- Tool-augmented “anti-hallucination” is winning in metric domains: VANGUARD shows VLMs hallucinate spatial scale badly, while a deterministic geometric tool sharply reduces error—reinforcing a pattern: for safety-critical quantities, add verifiable tools rather than prompt harder.
2) Key themes (clusters)
Theme: Linguistic & stylistic jailbreak surfaces
- Why it matters: Safety layers that work in mainstream English/modern Chinese can fail under stylistic compression/ambiguity (classical Chinese; even other classical languages), enabling efficient black-box jailbreaks with high transfer.
- Representative papers:
- Common approach:
- Treat attacks as search/optimization over discrete prompt strategies (structured strategy spaces + black-box optimization).
- Use translation/normalization pipelines to score cross-lingual outputs consistently.
- Measure transferability across multiple frontier models to estimate real-world risk.
- Open questions / failure modes:
- How to build defenses that generalize across archaic styles without overblocking benign historical text.
- Whether translation-based filtering meaningfully reduces risk without introducing new bypasses or false positives.
Theme: Robust unlearning under adversaries
- Why it matters: “Forget this class” requests can be weaponized to degrade other classes (indirect unlearning attack), turning unlearning into a security vulnerability rather than a privacy feature.
- Representative papers:
- Common approach:
- Formalize collateral damage (knowledge contamination/destruction) and design preservation/healing objectives alongside forgetting.
- Evaluate with distributional shift / imbalance lenses (balanced prediction distributions; attribution attacks on synthetic text).
- Open questions / failure modes:
- Dependence on retain/sibling data quality: biased or incomplete retain sets can cause under/over-healing.
- Privacy “wins” from synthetic text are partial: attribution accuracy drops but remains non-trivial, and fidelity choices move the risk.
Theme: Agentic reward & RL stability as scaling bottlenecks
- Why it matters: As RL post-training scales, two bottlenecks dominate: (1) reward generalization/cost and (2) asynchronous instability. Both can cause brittle policies or training collapse.
- Representative papers:
- Common approach:
- Replace static reward models with dynamic tool selection/synthesis (code verifiers, wrapped reward checkpoints) plus verification gates.
- Stabilize async RL by controlling gradient geometry (cosine-alignment projection/skip regimes).
- For diffusion LMs, avoid intractable likelihoods via logit/velocity-field objectives and variance-reduction sampling.
- Open questions / failure modes:
- Reward-tool synthesis introduces new attack surfaces (e.g., retrieval/README manipulation).
- Async stabilization shown on limited setups; behavior at large multi-node scale remains untested in the provided analyses.
Theme: Evaluation & governance infrastructure (fairness, leakage, representativeness)
- Why it matters: Benchmarks increasingly drive deployment decisions; leakage and representativeness failures can mislead progress claims and misallocate effort.
- Representative papers:
- Common approach:
- Move evaluation logic/answers server-side to reduce leakage and standardize pipelines (protocols + toolkits).
- Evaluate fairness synchronously across tasks (generation + understanding) and across multiple fairness philosophies.
- Map benchmarks to external taxonomies (O*NET) to quantify coverage skew and define autonomy vs complexity.
- Convert heterogeneous evaluation artifacts into query-conditioned recommendation supervision for deployable agents.
- Open questions / failure modes:
- Protocol adoption is a coordination problem: value depends on the number of packaged servers/benchmarks.
- Automated annotators (e.g., demographic classifiers) can inject measurement bias; steerability metrics need stronger validation.
Theme: Long-horizon agents: memory, context, and cost
- Why it matters: Persistent agents hit hard limits from context windows, cost, and long-horizon temporal reasoning; multiple papers propose structured memory and cost-aware context acquisition.
- Representative papers:
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
- Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
- FastCode: Fast and Cost-Efficient Code Understanding and Reasoning
- MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
- Common approach:
- Hierarchical memory (verbatim→gist) with uncertainty/entropy-gated retrieval to control compute.
- Offline corpora/search engines to enable cheap RL and multi-turn tool learning.
- “Scouting-first” metadata/graph navigation to reduce repeated full-text ingestion in code agents.
- Explicit cost models (prompt caching) to compute break-even turns for memory vs long-context.
- Open questions / failure modes:
- Flat fact extraction can lose temporal/coreference cues; memory accuracy lags long-context on some benchmarks.
- Offline search introduces an offline/online gap; corpus staleness can cap performance.
3) Technical synthesis
- Several works converge on structured intermediate representations as the lever for robustness: CC-BOS uses an 8D prompt strategy vector; TARSE uses step-indexed LogicalChains + skills; FastCode uses multi-layer code graphs; MM-Mem uses sensory/episodic/schema layers; EchoGuard uses episodic/semantic KGs.
- Optimization is moving “inside the loop”: CC-BOS optimizes prompts black-box; ∇-Reasoner optimizes logits at test time; LFPO optimizes diffusion logits/velocity fields; GAC modifies gradients during training to prevent collapse.
- Verification gates are becoming standard: RLAR’s EvalTool verification, RepoLaunch’s Verify Agent, PARCER’s validation gates, and VANGUARD’s confidence score all encode “don’t trust the model by default.”
- Cost/latency is treated as a first-class metric (not an afterthought): RLAR reports large token/GPU-hour reductions vs judge-based RLAIF; MM-DeepResearch quantifies online vs offline cost/time; memory-vs-long-context work gives explicit break-even turns; FastCode targets single-ingestion context assembly.
- Cross-lingual and long-tail generalization is repeatedly shown to be weak: BLUFF quantifies large F1 drops for long-tail languages; CC-BOS shows archaic-language bypass; both imply safety and detection tooling must be evaluated beyond high-resource languages.
- “Proxy alignment” failures are empirically visible: classroom transcript study shows FM agreement and even expert-rubric alignment can diverge from intended impact (student learning gains), warning against over-reliance on proxy metrics.
- Asynchrony introduces a distinct RL failure mode (stale-aligned gradients) that is not just “off-policy”: GAC targets gradient geometry rather than only distribution correction.
- Tool augmentation beats end-to-end VLM reasoning for metric quantities: VANGUARD’s deterministic GSD estimation outperforms VLM area estimation, reinforcing a design pattern for embodied safety.
4) Top 5 papers (with “why now”)
1) Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
- Shows classical/archaic language is a major safety blind spot; reports 100% ASR across several frontier models in their setting.
- Provides a structured 8D prompt strategy space + black-box FOA optimizer with very low reported query counts.
- Demonstrates cross-model transferability and applicability to other classical languages (Latin, Sanskrit).
- Skepticism: results rely on selected benchmark subsets and closed-source victims; combined defenses/translation filtering can reduce ASR.
2) ROKA: Robust Knowledge Unlearning against Adversaries
- Introduces the indirect unlearning attack: unlearning requests can be used to degrade other security-critical classes.
- Proposes Neural Healing / contribution re-allocation with targeted/non-targeted stochastic algorithms to preserve retained knowledge.
- Evaluated across vision, multimodal, and LLMs (including Llama 3.2 on MMLU) with improved stability/balance vs GA unlearning.
- Skepticism: exact re-allocation is infeasible; effectiveness depends on sibling/retain data representativeness.
3) RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
- Makes reward modeling adaptive and tool-based (wrap reward checkpoints; generate code verifiers) rather than static.
- Reports strong multi-domain RL gains (e.g., GSM8K improvements in Table 2) and large cost reductions vs GPT-5 judge RLAIF.
- Reward routing accuracy on REWARDBENCH-V2 is high (90.44% avg precision).
- Skepticism: relies on web retrieval and repository documentation; vulnerable to “readme hacking” per authors.
4) GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control
- Identifies a concrete instability mechanism in async RL: persistently aligned consecutive gradients preceding collapse.
- Provides a low-overhead projection/skip control that largely closes the gap to synchronized GRPO under staleness (Table 1).
- Backed by theory linking projection to bias reduction in convergence bounds.
- Skepticism: experiments reported on a single-machine 8-GPU setup; large-scale distributed behavior not shown.
5) BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
- Delivers a large multilingual benchmark (201K samples, 78 languages) with controlled manipulations and authorship types.
- Quantifies cross-lingual transfer degradation up to 25.3 F1 points for long-tail languages; decoder zero-shot often near random on multiclass.
- Provides an agentic generation pipeline (AXL-CoI) and a heavy multilingual quality filter (mPURIFY) with reported retention stats.
- Skepticism: geographic/syntactic coverage gaps remain; decoder models only evaluated zero-shot.
5) Practical next steps
- Add archaic/style-shifted red-teaming to your safety eval suite (e.g., classical Chinese–style compression/ambiguity) and measure transfer across models and defenses.
- For unlearning pipelines, explicitly test indirect unlearning attacks: request forgetting of benign/unrelated classes and measure degradation on security-critical classes; track prediction-distribution imbalance.
- If doing RL post-training at scale, instrument gradient cosine similarity over time in async setups; trial GAC-style projection/skip controls before chasing reward-model fixes.
- Replace monolithic reward models with a reward toolset: integrate code verifiers for deterministic tasks and add a verification gate before admitting new reward tools (RLAR pattern).
- For persistent agents, compute your cost break-even (turn count × context length) using your provider’s caching rules; decide when to switch from long-context to memory, and measure the accuracy hit.
- For embodied/metric tasks, prefer deterministic perception skills + confidence gating (VANGUARD pattern) over VLM-only numeric estimation; route uncertain cases to fallback behaviors.
- For multilingual disinformation/synthetic detection, evaluate detectors in big-head→long-tail transfer settings (BLUFF-style) rather than only multilingual in-domain splits.
- If you publish benchmarks, consider server-side evaluation (DEP-style) to reduce leakage/contamination and lower integration cost for users.
Generated from per-paper analyses; no external browsing.
