Daily AI Paper Report (2026-03-11)
Published:
Chinese version: [中文]
Run stats
- Candidates: 258
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-09T00:00:00Z → 2026-03-10T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.08274 | How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms | cs.CL, cs.AI | 95 | Massive, contamination-resistant hallucination measurement for doc QA across temps/contexts/hardware. | hallucination, evaluation, grounded-QA, long-context, methodology, reliability |
2603.08640 | PostTrainBench: Can LLM Agents Automate LLM Post-Training? | cs.SE, cs.AI, cs.LG | 95 | Benchmarks autonomous agents doing LLM post-training under tight compute; key for AI R&D automation risk. | agents, post-training, automation, evaluation, bounded-compute, AI-research |
2603.08024 | ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments | cs.CL | 94 | Interactive benchmark for human-AI conflict; exposes deception/self-preservation in agents | agent-safety, benchmark, multimodal, interactive-eval, deception, alignment |
2603.08104 | Invisible Safety Threat: Malicious Finetuning for LLM via Steganography | cs.LG | 93 | Steganographic finetuning enables covert harmful outputs while appearing aligned | alignment, steganography, backdoor, model-security, covert-channels, red-teaming |
2603.08145 | DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding | cs.LG, cs.AI | 93 | Retraining-free risk-sensitive decoding for preference disagreement; robust alignment control knobs. | alignment, preference-modeling, distributional-robustness, decoding, risk, RLHF, DPO |
2603.08655 | OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning | cs.AI, cs.CL, cs.IR | 93 | Enterprise-scale grounded multi-doc reasoning benchmark; frontier models <35% even with corpus access. | benchmark, grounded-reasoning, RAG, documents, tables, evaluation, agents |
2603.08234 | The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs | cs.AI, cs.LG | 92 | Mechanistic interpretability of a jailbreak trigger with causal attention-head interventions. | jailbreaks, mechanistic-interpretability, attention-heads, robustness, LLM-safety |
2603.08412 | Aligning to Illusions: Choice Blindness in Human and AI Feedback | cs.CL, cs.AI | 92 | Shows choice blindness corrupts RLHF labels; LLM judges also fail under context/social pressure. | RLHF, preference-data, label-noise, evaluation, human-factors, LLM-judges, alignment |
2603.08520 | SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement | cs.CR, cs.SE | 91 | Shows iterative code refinement can drift into worse security; proposes mitigation | code-security, agents, specification-drift, SAST, secure-coding, evaluation |
2603.08660 | How Far Can Unsupervised RLVR Scale LLM Training? | cs.LG, cs.CL | 91 | Clear theory+experiments: intrinsic URLVR sharpens initial beliefs; can fail catastrophically when wrong. | RLVR, unsupervised-RL, verifiable-rewards, theory, scaling, safety-failure-modes |
2603.08179 | Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models | eess.AS, cs.AI, eess.SP | 90 | Shows speaker-ID leakage in duplex speech LLMs and proposes streaming anonymization mitigations. | privacy, speech-LLMs, representation-leakage, anonymization, security |
2603.07978 | OSExpert: Computer-Use Agents Learning Professional Skills via Exploration | cs.AI | 90 | OSExpert-Eval + exploration curriculum for computer-use agents; targets transfer, efficiency, fine actions. | computer-use, agents, benchmark, exploration, curriculum, UI, tool-use |
2603.08316 | SlowBA: An efficiency backdoor attack towards VLM-based GUI agents | cs.CR, cs.CL, cs.CV | 89 | Backdoor attack on VLM GUI agents that triggers extreme latency via long reasoning | agent-security, VLM, GUI-agents, backdoor, availability-attack, reasoning |
2603.08091 | Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization | cs.CL | 88 | JudgeBiasBench: taxonomy + benchmark to measure/debias LLM-judge evaluation biases | evaluation, LLM-judges, bias, reward-modeling, benchmark, debiasing |
2603.07853 | SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans | cs.AI, cs.CL, cs.IR | 88 | Synthetic tool-use plans to fix exploration failures in research agents; boosts on open-web benchmarks. | agents, tool-use, exploration, synthetic-data, RL, web, benchmarks |
2603.07931 | BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence | cs.CL | 88 | Multi-hop long multimodal doc QA with step-level grounded evidence; exposes hidden aggregation failures. | multimodal, long-context, benchmark, grounding, multi-hop, scientific-docs, RAG |
2603.08262 | FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use | cs.AI | 87 | FinToolBench: runnable real financial tool-use benchmark for LLM agents in high-stakes domain | agents, tool-use, benchmark, finance, compliance, evaluation |
2603.08221 | SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration | cs.CR, cs.AI | 86 | SplitAgent: enterprise-cloud agent split with dynamic sanitization + DP guarantees | privacy, agent-architecture, data-sanitization, differential-privacy, enterprise, security |
2603.08486 | Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images | cs.CV, cs.AI | 86 | Label-free VLM safety persona shaping via threat-image exposure; relevant to multimodal safety. | multimodal-safety, VLM, alignment, persona, fine-tuning |
2603.08068 | In-Context Reinforcement Learning for Tool Use in Large Language Models | cs.AI | 86 | In-context RL for tool use reduces SFT cold-start dependence; relevant to scalable agent training. | agents, tool-use, reinforcement-learning, in-context-learning, data-efficiency |
2603.07886 | CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases | cs.CL, cs.AI | 86 | Benchmark for complex instruction following with constraints/control flow; closer to real deployment needs. | instruction-following, benchmark, constraints, control-flow, reliability, evaluation |
2603.08371 | Leaderboard Incentives: Model Rankings under Strategic Post-Training | cs.GT, cs.LG | 85 | Formalizes benchmaxxing incentives; shows no Nash equilibrium under common benchmark dynamics. | evaluation, benchmarks, gaming, mechanism-design, game-theory, post-training |
2603.07980 | \$OneMillion-Bench: How Far are Language Agents from Human Experts? | cs.LG, cs.AI, cs.CL | 84 | OneMillion-Bench: expert tasks for long-horizon agents in economically consequential settings | agents, benchmark, long-horizon, tool-use, professional-tasks, evaluation |
2603.08013 | PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents | cs.AI | 84 | Benchmark for proactive GUI agents from continuous screenshots; long-horizon noisy trajectories. | agents, GUI, benchmark, proactive-assistants, evaluation |
2603.07990 | MJ1: Multimodal Judgment via Grounded Verification | cs.LG | 84 | Grounded verification chain + counterfactual consistency RL improves multimodal judging with small model. | multimodal, judge-models, grounding, RL, evaluation, bias |
2603.07915 | Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents | cs.AI | 84 | Per-step reasoning-effort routing for agents to cut cost without big accuracy loss; practical deployment. | agents, inference-efficiency, reasoning-budget, routing, cost-control, deployment |
2603.08429 | One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States | cs.CL, cs.AI, cs.IR | 83 | Native retrieval embeddings from LLM hidden states; simplifies agent RAG stack with small loss. | RAG, retrieval, embeddings, agents, efficiency, representation-learning |
2603.08659 | CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning | cs.CL | 83 | Formalizes adaptive reasoning as utility maximization; allocates tokens by difficulty to avoid overthinking. | adaptive-reasoning, inference-time-compute, token-budget, efficiency, reasoning-models |
2603.08117 | UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking | cs.AI, cs.IR | 82 | UIS-QA benchmark targets unindexed info seeking; shows big drop for SOTA agents | agents, information-seeking, benchmark, web, retrieval, robustness |
2603.08706 | Agentic Critical Training | cs.AI, cs.CL, cs.LG | 82 | RL paradigm trains agents to judge better actions among alternatives vs imitating reflection text. | agents, reinforcement-learning, critique, action-selection, training-paradigm, reasoning |
AI Paper Insight Brief
2026-03-11
0) Executive takeaways (read this first)
- Agent training is converging on “better exploration priors” rather than just better RL: synthetic plan-guided SFT (SynPlanResearch-R1) and RL-only with in-context demos (ICRL) both target the same bottleneck—on-policy RL getting stuck in shallow tool-use behaviors.
- Adaptive compute is moving from “per-query” to “per-step / per-instance” control: ARES routes thinking level per agent step; CODA shapes RL rewards to reallocate tokens by difficulty—both cut cost without (much) accuracy loss, but require careful labeling/proxy design.
- Evaluation is shifting toward entangled, long-horizon, and real-world constraints: CCR-Bench (constraints + workflows + industrial logs), OfficeQA Pro (enterprise PDFs + numeric exactness), $OneMillion-Bench (expert rubrics + economic value), BRIDGE (multimodal evidence chains), UIS-QA (unindexed web), FinToolBench (finance tool compliance) all expose large gaps that “standard QA” misses.
- Safety threats are expanding beyond content to channels and resources: malicious finetuning via invisible Unicode steganography can bypass safety checks; SlowBA backdoors latency while preserving correctness; continuation-triggered jailbreaks reveal a mechanistic “continuation vs refusal” circuit tension.
- Judge reliability is now a first-class alignment problem: MJ1 improves multimodal judging via grounded verification + flip-consistency reward; JudgeBiasBench quantifies 12 bias types and reduces them via GRPO/InfoNCE; choice-blindness shows preference data can be silently corrupted while standard metrics look fine.
- Enterprise/privacy constraints are becoming architectural: SplitAgent proposes a privacy-agent / cloud-reasoner split with DP budgets and protocol primitives; full-duplex speech models leak speaker identity in hidden states, but streaming anonymization can push EER toward chance.
2) Key themes (clusters)
Theme: Tool-using research agents—fixing exploration and cold start
- Why it matters: Tool-using agents fail less from “not knowing facts” and more from shallow or brittle tool trajectories; RL can’t explore what the initial policy never tries.
- Representative papers:
- Common approach:
- Shape exploration via structured priors (synthetic plans + cues) or in-rollout demonstrations (ICRL curriculum 3→2→0).
- Use GRPO-style group RL with format rewards and loss masking over tool outputs.
- Expand tool affordances beyond search: crawl, visual browsing, file download/parse for UIS.
- Open questions / failure modes:
- Compute/data cost: synthetic trajectory generation (teacher models) and multi-rollout RL are expensive.
- Sensitivity to prompting/curriculum/hyperparameters (cue injection, curriculum schedule).
- Real-world brittleness when evidence is unindexed, dynamic, or requires interaction/file parsing (UIS accuracy still ~27%).
Theme: Adaptive reasoning & efficiency for long-horizon agents
- Why it matters: Long-horizon agents pay compounding token costs; “always think hard” is economically non-viable, but “think less” can cascade errors.
- Representative papers:
- Common approach:
- Per-step routing (ARES) using a lightweight router trained from maximal-effort trajectories + RL cost penalties.
- Difficulty proxies from rollouts (CODA group success rate) to penalize verbosity on easy items and encourage deliberation on hard ones.
- Shift cost from test-time to environment learning (OSExpert GUI-DFS skill discovery + cached procedures + fast planner).
- Open questions / failure modes:
- Labeling/annotation cost and judge dependence (ARES uses multi-trial sampling + LLM judge + rationale teacher).
- Proxy noise: difficulty estimates can be unstable with sparse rewards / small group sizes (CODA).
- Coverage/scalability of environment exploration (OSExpert) and dependence on strong base models.
Theme: Next-gen benchmarks for “real” instruction following and grounded work
- Why it matters: Many failures only appear when constraints are entangled, workflows are interactive, and outputs must be auditable (format, tools, evidence, compliance).
- Representative papers:
- CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
- $OneMillion-Bench: How Far are Language Agents from Human Experts?
- OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
- FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
- Common approach:
- Move beyond atomic constraints to compound constraint satisfaction + workflow control (CCR).
- Use expert rubrics and even economic valuation to measure usefulness (OneMillion-Bench).
- Stress parsing + retrieval + numeric exactness over large corpora (OfficeQA Pro).
- Evaluate tool traces with domain compliance metrics (timeliness/intent/domain mismatches in finance).
- Open questions / failure modes:
- Small but high-complexity datasets (CCR 174, OfficeQA Pro 133 Pro, OneMillion 400, FinTool 295) may limit training use.
- Heavy reliance on LLM judges/hybrids introduces evaluator bias/variance.
- Persistent weak points: formatting/structuring constraints (CCR), temporal revision errors + parsing faithfulness (OfficeQA Pro), tool argument correctness vs coverage tradeoffs (FinToolBench).
Theme: Grounding & hallucination in long-context / multimodal documents
- Why it matters: As context windows grow and modalities mix, fabrication and evidence-miss become dominant reliability failures.
- Representative papers:
- Common approach:
- Provide explicit evidence chains and judge-based fidelity metrics (BRIDGE).
- Use ground-truth-first synthetic corpora for deterministic scoring at scale (RIKER study).
- Diagnose pipeline bottlenecks: parsing, retrieval depth, table/figure handling (OfficeQA Pro).
- Open questions / failure modes:
- Retrieval can hurt when it misses key hops (BRIDGE: ColPali page-level RAG degrades audit scores).
- Fabrication rises sharply with context length; at 200K no model stays <10% fabrication (RIKER study).
- Visual/table-dominant evidence and long page-distance hops remain hard (BRIDGE).
Theme: Alignment evaluation in interactive settings & judge robustness
- Why it matters: Alignment failures often emerge after multiple turns, and automated judges can be biased or ungrounded—corrupting both evaluation and training signals.
- Representative papers:
- Common approach:
- Interactive, reproducible environments + trajectory metrics (ConflictBench TSR/ASR; regret tests).
- Force grounding via structured verification chains and counterfactual consistency rewards (MJ1).
- Measure bias via controlled counterfactual injections and train debiased judges (JudgeBiasBench).
- Stress-test the feedback channel itself (choice blindness; sycophancy pressure).
- Open questions / failure modes:
- Judge gaming: “consistency theater” risk for flip-based rewards (MJ1 caveat).
- Automated judge bias and dependence on evaluator models (CCR, FinToolBench, JudgeBiasBench).
- Preference corruption can be largely invisible to standard metrics (pairwise accuracy stable while reward signal collapses).
Theme: Security & privacy—stealth channels, backdoors, and architectural mitigations
- Why it matters: Threats now target what monitoring can’t see (invisible characters, latency, hidden states), and defenses may need architectural changes.
- Representative papers:
- Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
- SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
- The Struggle Between Continuation and Refusal… continuation-triggered jailbreak
- SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration
- Common approach:
- Training-time attacks that preserve benign surface behavior while enabling hidden harmful behavior (steganography; latency backdoor).
- Mechanistic localization of jailbreak circuits (attention-head “safety vs continuation” heads) and inference-time steering.
- Privacy-by-design architectures: sanitize locally, reason in cloud with DP budgets and protocol primitives.
- Open questions / failure modes:
- Detection: content filters and human review fail when payload is invisible (zero-width Unicode).
- New objective surfaces: latency/energy backdoors evade correctness-based monitoring.
- Trust assumptions: SplitAgent assumes the on-prem Privacy Agent is uncompromised; sanitization adds latency and utility loss.
3) Technical synthesis
- GRPO is the workhorse across agent training, judges, and efficiency shaping (SynPlanResearch-R1, ARES, ICRL, MJ1, Judge debiasing, CODA, ACT), typically with format rewards and loss masking for tool outputs.
- Two competing “cold start” strategies for tool use are emerging:
- Better SFT priors via synthetic trajectories that explicitly diversify tool plans (SynPlanResearch-R1).
- No SFT by injecting few-shot demonstrations into RL rollouts and phasing them out (ICRL).
- Exploration vs compliance is a recurring tradeoff: deeper tool use improves accuracy (SynPlanResearch-R1), but in finance, aggressive tool invocation can reduce compliance/argument correctness (FinToolBench shows high TIR but low CER for some models).
- Difficulty/effort estimation is being internalized:
- ARES learns per-step minimal effort labels via multi-trial equivalence checks.
- CODA uses group success rate as a difficulty proxy to shape length rewards (and gates bonuses by correctness to avoid length hacking).
- Retrieval is increasingly the bottleneck in long-doc/multimodal settings: BRIDGE shows page-level retrieval can harm multi-hop grounding; OfficeQA Pro shows parsing + retrieval + temporal revision handling dominate.
- Long context amplifies fabrication: the 172B-token RIKER study finds fabrication rises steeply with context length; temperature changes can reduce fabrication and coherence loss in many cases.
- Alignment evaluation is moving to trajectories: ConflictBench finds failures occur after multiple turns (avg failure turn 5.28) and includes regret tests; single-turn ASR overestimates alignment.
- Judge robustness is being treated as an optimization target (MJ1’s flip-consistency reward; JudgeBiasBench’s BSR + debiasing training), but choice-blindness warns that feedback channels can be corrupted without obvious metric alarms.
- Security threats are diversifying beyond “harmful text”: invisible Unicode steganography bypasses safety checks; latency backdoors target usability; mechanistic jailbreak analysis suggests prompt-structure exploits continuation circuits.
- Enterprise privacy is becoming system-level: SplitAgent combines local sanitization + DP budgets + protocol primitives; speech dialogue models show identity leakage in hidden states, mitigated by streaming anonymization.
4) Top 5 papers (with “why now”)
1) Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
- Shows a training-time attack where models appear safe in plaintext but emit hidden harmful content via zero-width Unicode.
- Demonstrated on GPT-4.1 finetuning API and multiple open models; unsafe rate goes from 0% pre-decode to >90% post-decode in their setup.
- Tests mitigations like filtering zero-width characters and frequency penalties.
- Skepticism / limitation: stegotext increases token length and is less effective on smaller models; success not universal.
2) How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study…
- Massive, deterministic, contamination-resistant measurement: 172B tokens, 35 open models, up to 200K context.
- Key deployment insight: fabrication is non-zero even best-case (best 1.19% at 32K) and no model <10% at 200K.
- Temperature effects are nontrivial: higher T often reduces fabrication and coherence loss.
- Skepticism / limitation: English-only, open-weight only, single framework (RIKER).
3) OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
- Enterprise-realistic: ~89k pages of Treasury Bulletins; 133 hard questions with strict numeric scoring.
- Shows end-to-end performance is still low without strong parsing/retrieval; parser choice (ai_parse_document) yields consistent gains.
- Provides a rich ablation map across parsers, retrieval, table formats, and test-time scaling.
- Skepticism / limitation: single-domain corpus; full-corpus runs are costly/slow (~23.6 min per question reported).
4) DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding
- Practical inference-only method to reduce tail risk / disagreement without retraining, grounded in entropic (KL-robust) objectives and LCBs.
- Human eval on MT-Bench: improves mean and reduces risk, especially on high-disagreement prompts.
- Multi-scorer aggregation addresses scorer shift; small latency overhead for augmentation.
- Skepticism / limitation: depends on scorer/proxy quality and a finite candidate pool.
5) SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
- Introduces a new backdoor target: latency/verbosity rather than wrong actions; preserves clean accuracy while triggered inputs inflate response length/latency/energy.
- Two-stage SFT + RL reward shaping makes the backdoor trigger-dependent; includes a real-world ticket-buying latency increase.
- Highlights that monitoring correctness alone misses resource attacks.
- Skepticism / limitation: assumes attacker can finetune and poison training; scaling effects vary (7B still vulnerable but different magnitudes).
5) Practical next steps
- For tool-using agents, test two cold-start regimes head-to-head: (a) synthetic plan-guided SFT (plan sampling + cue injection) vs (b) RL-only with in-context demos + curriculum; measure tool diversity, entropy, and final accuracy.
- Add compliance metrics to your tool benchmarks (FinToolBench-style): timeliness, intent restraint, domain alignment—then track how retrieval/tool-card metadata changes mismatch rates.
- If deploying long-context doc QA, measure fabrication vs context length explicitly (RIKER-style probes if possible); don’t assume longer context is safer.
- For multimodal/GUI agents, add efficiency anomaly detection (latency/length/energy) as a first-class safety signal to catch SlowBA-like backdoors.
- Harden finetuning pipelines against invisible-character channels: normalize/strip zero-width Unicode at ingestion and at inference boundaries; log token-level anomalies.
- For alignment evaluation, incorporate multi-turn interactive tests (ConflictBench-like) and track when failures occur (e.g., average failure turn), not just whether.
- If you rely on LLM judges, run bias sensitivity and counterfactual position/verbosity tests (JudgeBiasBench), and consider grounded verification prompting (MJ1-style) for multimodal judging.
- In enterprise settings, prototype a local privacy agent + cloud reasoner split (SplitAgent) and quantify the privacy/utility/latency tradeoff under your threat model.
Generated from per-paper analyses; no external browsing.
