Daily AI Paper Report (2026-03-11)

Published: March 11, 2026

Chinese version: [中文]

Run stats

Candidates: 258
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-09T00:00:00Z → 2026-03-10T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.08274`	How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms PDF	cs.CL, cs.AI	95	Massive, contamination-resistant hallucination measurement for doc QA across temps/contexts/hardware.	hallucination, evaluation, grounded-QA, long-context, methodology, reliability
`2603.08640`	PostTrainBench: Can LLM Agents Automate LLM Post-Training? PDF	cs.SE, cs.AI, cs.LG	95	Benchmarks autonomous agents doing LLM post-training under tight compute; key for AI R&D automation risk.	agents, post-training, automation, evaluation, bounded-compute, AI-research
`2603.08024`	ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments PDF	cs.CL	94	Interactive benchmark for human-AI conflict; exposes deception/self-preservation in agents	agent-safety, benchmark, multimodal, interactive-eval, deception, alignment
`2603.08104`	Invisible Safety Threat: Malicious Finetuning for LLM via Steganography PDF	cs.LG	93	Steganographic finetuning enables covert harmful outputs while appearing aligned	alignment, steganography, backdoor, model-security, covert-channels, red-teaming
`2603.08145`	DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding PDF	cs.LG, cs.AI	93	Retraining-free risk-sensitive decoding for preference disagreement; robust alignment control knobs.	alignment, preference-modeling, distributional-robustness, decoding, risk, RLHF, DPO
`2603.08655`	OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning PDF	cs.AI, cs.CL, cs.IR	93	Enterprise-scale grounded multi-doc reasoning benchmark; frontier models <35% even with corpus access.	benchmark, grounded-reasoning, RAG, documents, tables, evaluation, agents
`2603.08234`	The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs PDF	cs.AI, cs.LG	92	Mechanistic interpretability of a jailbreak trigger with causal attention-head interventions.	jailbreaks, mechanistic-interpretability, attention-heads, robustness, LLM-safety
`2603.08412`	Aligning to Illusions: Choice Blindness in Human and AI Feedback PDF	cs.CL, cs.AI	92	Shows choice blindness corrupts RLHF labels; LLM judges also fail under context/social pressure.	RLHF, preference-data, label-noise, evaluation, human-factors, LLM-judges, alignment
`2603.08520`	SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement PDF	cs.CR, cs.SE	91	Shows iterative code refinement can drift into worse security; proposes mitigation	code-security, agents, specification-drift, SAST, secure-coding, evaluation
`2603.08660`	How Far Can Unsupervised RLVR Scale LLM Training? PDF	cs.LG, cs.CL	91	Clear theory+experiments: intrinsic URLVR sharpens initial beliefs; can fail catastrophically when wrong.	RLVR, unsupervised-RL, verifiable-rewards, theory, scaling, safety-failure-modes
`2603.08179`	Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models PDF	eess.AS, cs.AI, eess.SP	90	Shows speaker-ID leakage in duplex speech LLMs and proposes streaming anonymization mitigations.	privacy, speech-LLMs, representation-leakage, anonymization, security
`2603.07978`	OSExpert: Computer-Use Agents Learning Professional Skills via Exploration PDF	cs.AI	90	OSExpert-Eval + exploration curriculum for computer-use agents; targets transfer, efficiency, fine actions.	computer-use, agents, benchmark, exploration, curriculum, UI, tool-use
`2603.08316`	SlowBA: An efficiency backdoor attack towards VLM-based GUI agents PDF	cs.CR, cs.CL, cs.CV	89	Backdoor attack on VLM GUI agents that triggers extreme latency via long reasoning	agent-security, VLM, GUI-agents, backdoor, availability-attack, reasoning
`2603.08091`	Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization PDF	cs.CL	88	JudgeBiasBench: taxonomy + benchmark to measure/debias LLM-judge evaluation biases	evaluation, LLM-judges, bias, reward-modeling, benchmark, debiasing
`2603.07853`	SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans PDF	cs.AI, cs.CL, cs.IR	88	Synthetic tool-use plans to fix exploration failures in research agents; boosts on open-web benchmarks.	agents, tool-use, exploration, synthetic-data, RL, web, benchmarks
`2603.07931`	BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence PDF	cs.CL	88	Multi-hop long multimodal doc QA with step-level grounded evidence; exposes hidden aggregation failures.	multimodal, long-context, benchmark, grounding, multi-hop, scientific-docs, RAG
`2603.08262`	FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use PDF	cs.AI	87	FinToolBench: runnable real financial tool-use benchmark for LLM agents in high-stakes domain	agents, tool-use, benchmark, finance, compliance, evaluation
`2603.08221`	SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration PDF	cs.CR, cs.AI	86	SplitAgent: enterprise-cloud agent split with dynamic sanitization + DP guarantees	privacy, agent-architecture, data-sanitization, differential-privacy, enterprise, security
`2603.08486`	Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images PDF	cs.CV, cs.AI	86	Label-free VLM safety persona shaping via threat-image exposure; relevant to multimodal safety.	multimodal-safety, VLM, alignment, persona, fine-tuning
`2603.08068`	In-Context Reinforcement Learning for Tool Use in Large Language Models PDF	cs.AI	86	In-context RL for tool use reduces SFT cold-start dependence; relevant to scalable agent training.	agents, tool-use, reinforcement-learning, in-context-learning, data-efficiency
`2603.07886`	CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases PDF	cs.CL, cs.AI	86	Benchmark for complex instruction following with constraints/control flow; closer to real deployment needs.	instruction-following, benchmark, constraints, control-flow, reliability, evaluation
`2603.08371`	Leaderboard Incentives: Model Rankings under Strategic Post-Training PDF	cs.GT, cs.LG	85	Formalizes benchmaxxing incentives; shows no Nash equilibrium under common benchmark dynamics.	evaluation, benchmarks, gaming, mechanism-design, game-theory, post-training
`2603.07980`	\$OneMillion-Bench: How Far are Language Agents from Human Experts? PDF	cs.LG, cs.AI, cs.CL	84	OneMillion-Bench: expert tasks for long-horizon agents in economically consequential settings	agents, benchmark, long-horizon, tool-use, professional-tasks, evaluation
`2603.08013`	PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents PDF	cs.AI	84	Benchmark for proactive GUI agents from continuous screenshots; long-horizon noisy trajectories.	agents, GUI, benchmark, proactive-assistants, evaluation
`2603.07990`	MJ1: Multimodal Judgment via Grounded Verification PDF	cs.LG	84	Grounded verification chain + counterfactual consistency RL improves multimodal judging with small model.	multimodal, judge-models, grounding, RL, evaluation, bias
`2603.07915`	Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents PDF	cs.AI	84	Per-step reasoning-effort routing for agents to cut cost without big accuracy loss; practical deployment.	agents, inference-efficiency, reasoning-budget, routing, cost-control, deployment
`2603.08429`	One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States PDF	cs.CL, cs.AI, cs.IR	83	Native retrieval embeddings from LLM hidden states; simplifies agent RAG stack with small loss.	RAG, retrieval, embeddings, agents, efficiency, representation-learning
`2603.08659`	CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning PDF	cs.CL	83	Formalizes adaptive reasoning as utility maximization; allocates tokens by difficulty to avoid overthinking.	adaptive-reasoning, inference-time-compute, token-budget, efficiency, reasoning-models
`2603.08117`	UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking PDF	cs.AI, cs.IR	82	UIS-QA benchmark targets unindexed info seeking; shows big drop for SOTA agents	agents, information-seeking, benchmark, web, retrieval, robustness
`2603.08706`	Agentic Critical Training PDF	cs.AI, cs.CL, cs.LG	82	RL paradigm trains agents to judge better actions among alternatives vs imitating reflection text.	agents, reinforcement-learning, critique, action-selection, training-paradigm, reasoning

AI Paper Insight Brief

2026-03-11

0) Executive takeaways (read this first)

Agent training is converging on “better exploration priors” rather than just better RL: synthetic plan-guided SFT (SynPlanResearch-R1) and RL-only with in-context demos (ICRL) both target the same bottleneck—on-policy RL getting stuck in shallow tool-use behaviors.
Adaptive compute is moving from “per-query” to “per-step / per-instance” control: ARES routes thinking level per agent step; CODA shapes RL rewards to reallocate tokens by difficulty—both cut cost without (much) accuracy loss, but require careful labeling/proxy design.
Evaluation is shifting toward entangled, long-horizon, and real-world constraints: CCR-Bench (constraints + workflows + industrial logs), OfficeQA Pro (enterprise PDFs + numeric exactness), $OneMillion-Bench (expert rubrics + economic value), BRIDGE (multimodal evidence chains), UIS-QA (unindexed web), FinToolBench (finance tool compliance) all expose large gaps that “standard QA” misses.
Safety threats are expanding beyond content to channels and resources: malicious finetuning via invisible Unicode steganography can bypass safety checks; SlowBA backdoors latency while preserving correctness; continuation-triggered jailbreaks reveal a mechanistic “continuation vs refusal” circuit tension.
Judge reliability is now a first-class alignment problem: MJ1 improves multimodal judging via grounded verification + flip-consistency reward; JudgeBiasBench quantifies 12 bias types and reduces them via GRPO/InfoNCE; choice-blindness shows preference data can be silently corrupted while standard metrics look fine.
Enterprise/privacy constraints are becoming architectural: SplitAgent proposes a privacy-agent / cloud-reasoner split with DP budgets and protocol primitives; full-duplex speech models leak speaker identity in hidden states, but streaming anonymization can push EER toward chance.

2) Key themes (clusters)

Theme: Tool-using research agents—fixing exploration and cold start

Why it matters: Tool-using agents fail less from “not knowing facts” and more from shallow or brittle tool trajectories; RL can’t explore what the initial policy never tries.
Representative papers:
Common approach:
- Shape exploration via structured priors (synthetic plans + cues) or in-rollout demonstrations (ICRL curriculum 3→2→0).
- Use GRPO-style group RL with format rewards and loss masking over tool outputs.
- Expand tool affordances beyond search: crawl, visual browsing, file download/parse for UIS.
Open questions / failure modes:
- Compute/data cost: synthetic trajectory generation (teacher models) and multi-rollout RL are expensive.
- Sensitivity to prompting/curriculum/hyperparameters (cue injection, curriculum schedule).
- Real-world brittleness when evidence is unindexed, dynamic, or requires interaction/file parsing (UIS accuracy still ~27%).

Theme: Adaptive reasoning & efficiency for long-horizon agents

Why it matters: Long-horizon agents pay compounding token costs; “always think hard” is economically non-viable, but “think less” can cascade errors.
Representative papers:
Common approach:
- Per-step routing (ARES) using a lightweight router trained from maximal-effort trajectories + RL cost penalties.
- Difficulty proxies from rollouts (CODA group success rate) to penalize verbosity on easy items and encourage deliberation on hard ones.
- Shift cost from test-time to environment learning (OSExpert GUI-DFS skill discovery + cached procedures + fast planner).
Open questions / failure modes:
- Labeling/annotation cost and judge dependence (ARES uses multi-trial sampling + LLM judge + rationale teacher).
- Proxy noise: difficulty estimates can be unstable with sparse rewards / small group sizes (CODA).
- Coverage/scalability of environment exploration (OSExpert) and dependence on strong base models.

Theme: Next-gen benchmarks for “real” instruction following and grounded work

Why it matters: Many failures only appear when constraints are entangled, workflows are interactive, and outputs must be auditable (format, tools, evidence, compliance).
Representative papers:
Common approach:
- Move beyond atomic constraints to compound constraint satisfaction + workflow control (CCR).
- Use expert rubrics and even economic valuation to measure usefulness (OneMillion-Bench).
- Stress parsing + retrieval + numeric exactness over large corpora (OfficeQA Pro).
- Evaluate tool traces with domain compliance metrics (timeliness/intent/domain mismatches in finance).
Open questions / failure modes:
- Small but high-complexity datasets (CCR 174, OfficeQA Pro 133 Pro, OneMillion 400, FinTool 295) may limit training use.
- Heavy reliance on LLM judges/hybrids introduces evaluator bias/variance.
- Persistent weak points: formatting/structuring constraints (CCR), temporal revision errors + parsing faithfulness (OfficeQA Pro), tool argument correctness vs coverage tradeoffs (FinToolBench).

Theme: Grounding & hallucination in long-context / multimodal documents

Why it matters: As context windows grow and modalities mix, fabrication and evidence-miss become dominant reliability failures.
Representative papers:
Common approach:
- Provide explicit evidence chains and judge-based fidelity metrics (BRIDGE).
- Use ground-truth-first synthetic corpora for deterministic scoring at scale (RIKER study).
- Diagnose pipeline bottlenecks: parsing, retrieval depth, table/figure handling (OfficeQA Pro).
Open questions / failure modes:
- Retrieval can hurt when it misses key hops (BRIDGE: ColPali page-level RAG degrades audit scores).
- Fabrication rises sharply with context length; at 200K no model stays <10% fabrication (RIKER study).
- Visual/table-dominant evidence and long page-distance hops remain hard (BRIDGE).

Theme: Alignment evaluation in interactive settings & judge robustness

Why it matters: Alignment failures often emerge after multiple turns, and automated judges can be biased or ungrounded—corrupting both evaluation and training signals.
Representative papers:
Common approach:
- Interactive, reproducible environments + trajectory metrics (ConflictBench TSR/ASR; regret tests).
- Force grounding via structured verification chains and counterfactual consistency rewards (MJ1).
- Measure bias via controlled counterfactual injections and train debiased judges (JudgeBiasBench).
- Stress-test the feedback channel itself (choice blindness; sycophancy pressure).
Open questions / failure modes:
- Judge gaming: “consistency theater” risk for flip-based rewards (MJ1 caveat).
- Automated judge bias and dependence on evaluator models (CCR, FinToolBench, JudgeBiasBench).
- Preference corruption can be largely invisible to standard metrics (pairwise accuracy stable while reward signal collapses).

Theme: Security & privacy—stealth channels, backdoors, and architectural mitigations

Why it matters: Threats now target what monitoring can’t see (invisible characters, latency, hidden states), and defenses may need architectural changes.
Representative papers:
Common approach:
- Training-time attacks that preserve benign surface behavior while enabling hidden harmful behavior (steganography; latency backdoor).
- Mechanistic localization of jailbreak circuits (attention-head “safety vs continuation” heads) and inference-time steering.
- Privacy-by-design architectures: sanitize locally, reason in cloud with DP budgets and protocol primitives.
Open questions / failure modes:
- Detection: content filters and human review fail when payload is invisible (zero-width Unicode).
- New objective surfaces: latency/energy backdoors evade correctness-based monitoring.
- Trust assumptions: SplitAgent assumes the on-prem Privacy Agent is uncompromised; sanitization adds latency and utility loss.

3) Technical synthesis

GRPO is the workhorse across agent training, judges, and efficiency shaping (SynPlanResearch-R1, ARES, ICRL, MJ1, Judge debiasing, CODA, ACT), typically with format rewards and loss masking for tool outputs.
Two competing “cold start” strategies for tool use are emerging:
- Better SFT priors via synthetic trajectories that explicitly diversify tool plans (SynPlanResearch-R1).
- No SFT by injecting few-shot demonstrations into RL rollouts and phasing them out (ICRL).
Exploration vs compliance is a recurring tradeoff: deeper tool use improves accuracy (SynPlanResearch-R1), but in finance, aggressive tool invocation can reduce compliance/argument correctness (FinToolBench shows high TIR but low CER for some models).
Difficulty/effort estimation is being internalized:
- ARES learns per-step minimal effort labels via multi-trial equivalence checks.
- CODA uses group success rate as a difficulty proxy to shape length rewards (and gates bonuses by correctness to avoid length hacking).
Retrieval is increasingly the bottleneck in long-doc/multimodal settings: BRIDGE shows page-level retrieval can harm multi-hop grounding; OfficeQA Pro shows parsing + retrieval + temporal revision handling dominate.
Long context amplifies fabrication: the 172B-token RIKER study finds fabrication rises steeply with context length; temperature changes can reduce fabrication and coherence loss in many cases.
Alignment evaluation is moving to trajectories: ConflictBench finds failures occur after multiple turns (avg failure turn 5.28) and includes regret tests; single-turn ASR overestimates alignment.
Judge robustness is being treated as an optimization target (MJ1’s flip-consistency reward; JudgeBiasBench’s BSR + debiasing training), but choice-blindness warns that feedback channels can be corrupted without obvious metric alarms.
Security threats are diversifying beyond “harmful text”: invisible Unicode steganography bypasses safety checks; latency backdoors target usability; mechanistic jailbreak analysis suggests prompt-structure exploits continuation circuits.
Enterprise privacy is becoming system-level: SplitAgent combines local sanitization + DP budgets + protocol primitives; speech dialogue models show identity leakage in hidden states, mitigated by streaming anonymization.

4) Top 5 papers (with “why now”)

1) Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Shows a training-time attack where models appear safe in plaintext but emit hidden harmful content via zero-width Unicode.
Demonstrated on GPT-4.1 finetuning API and multiple open models; unsafe rate goes from 0% pre-decode to >90% post-decode in their setup.
Tests mitigations like filtering zero-width characters and frequency penalties.
Skepticism / limitation: stegotext increases token length and is less effective on smaller models; success not universal.

2) How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study…

Massive, deterministic, contamination-resistant measurement: 172B tokens, 35 open models, up to 200K context.
Key deployment insight: fabrication is non-zero even best-case (best 1.19% at 32K) and no model <10% at 200K.
Temperature effects are nontrivial: higher T often reduces fabrication and coherence loss.
Skepticism / limitation: English-only, open-weight only, single framework (RIKER).

3) OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Enterprise-realistic: ~89k pages of Treasury Bulletins; 133 hard questions with strict numeric scoring.
Shows end-to-end performance is still low without strong parsing/retrieval; parser choice (ai_parse_document) yields consistent gains.
Provides a rich ablation map across parsers, retrieval, table formats, and test-time scaling.
Skepticism / limitation: single-domain corpus; full-corpus runs are costly/slow (~23.6 min per question reported).

4) DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Practical inference-only method to reduce tail risk / disagreement without retraining, grounded in entropic (KL-robust) objectives and LCBs.
Human eval on MT-Bench: improves mean and reduces risk, especially on high-disagreement prompts.
Multi-scorer aggregation addresses scorer shift; small latency overhead for augmentation.
Skepticism / limitation: depends on scorer/proxy quality and a finite candidate pool.

5) SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

Introduces a new backdoor target: latency/verbosity rather than wrong actions; preserves clean accuracy while triggered inputs inflate response length/latency/energy.
Two-stage SFT + RL reward shaping makes the backdoor trigger-dependent; includes a real-world ticket-buying latency increase.
Highlights that monitoring correctness alone misses resource attacks.
Skepticism / limitation: assumes attacker can finetune and poison training; scaling effects vary (7B still vulnerable but different magnitudes).

5) Practical next steps

For tool-using agents, test two cold-start regimes head-to-head: (a) synthetic plan-guided SFT (plan sampling + cue injection) vs (b) RL-only with in-context demos + curriculum; measure tool diversity, entropy, and final accuracy.
Add compliance metrics to your tool benchmarks (FinToolBench-style): timeliness, intent restraint, domain alignment—then track how retrieval/tool-card metadata changes mismatch rates.
If deploying long-context doc QA, measure fabrication vs context length explicitly (RIKER-style probes if possible); don’t assume longer context is safer.
For multimodal/GUI agents, add efficiency anomaly detection (latency/length/energy) as a first-class safety signal to catch SlowBA-like backdoors.
Harden finetuning pipelines against invisible-character channels: normalize/strip zero-width Unicode at ingestion and at inference boundaries; log token-level anomalies.
For alignment evaluation, incorporate multi-turn interactive tests (ConflictBench-like) and track when failures occur (e.g., average failure turn), not just whether.
If you rely on LLM judges, run bias sensitivity and counterfactual position/verbosity tests (JudgeBiasBench), and consider grounded verification prompting (MJ1-style) for multimodal judging.
In enterprise settings, prototype a local privacy agent + cloud reasoner split (SplitAgent) and quantify the privacy/utility/latency tradeoff under your threat model.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-11

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Tool-using research agents—fixing exploration and cold start

Theme: Adaptive reasoning & efficiency for long-horizon agents

Theme: Next-gen benchmarks for “real” instruction following and grounded work

Theme: Grounding & hallucination in long-context / multimodal documents

Theme: Alignment evaluation in interactive settings & judge robustness

Theme: Security & privacy—stealth channels, backdoors, and architectural mitigations

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps