Daily AI Paper Report (2026-03-11)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 258
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-09T00:00:00Z → 2026-03-10T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.08274How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms
PDF
cs.CL, cs.AI95Massive, contamination-resistant hallucination measurement for doc QA across temps/contexts/hardware.hallucination, evaluation, grounded-QA, long-context, methodology, reliability
2603.08640PostTrainBench: Can LLM Agents Automate LLM Post-Training?
PDF
cs.SE, cs.AI, cs.LG95Benchmarks autonomous agents doing LLM post-training under tight compute; key for AI R&D automation risk.agents, post-training, automation, evaluation, bounded-compute, AI-research
2603.08024ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments
PDF
cs.CL94Interactive benchmark for human-AI conflict; exposes deception/self-preservation in agentsagent-safety, benchmark, multimodal, interactive-eval, deception, alignment
2603.08104Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
PDF
cs.LG93Steganographic finetuning enables covert harmful outputs while appearing alignedalignment, steganography, backdoor, model-security, covert-channels, red-teaming
2603.08145DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding
PDF
cs.LG, cs.AI93Retraining-free risk-sensitive decoding for preference disagreement; robust alignment control knobs.alignment, preference-modeling, distributional-robustness, decoding, risk, RLHF, DPO
2603.08655OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
PDF
cs.AI, cs.CL, cs.IR93Enterprise-scale grounded multi-doc reasoning benchmark; frontier models <35% even with corpus access.benchmark, grounded-reasoning, RAG, documents, tables, evaluation, agents
2603.08234The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs
PDF
cs.AI, cs.LG92Mechanistic interpretability of a jailbreak trigger with causal attention-head interventions.jailbreaks, mechanistic-interpretability, attention-heads, robustness, LLM-safety
2603.08412Aligning to Illusions: Choice Blindness in Human and AI Feedback
PDF
cs.CL, cs.AI92Shows choice blindness corrupts RLHF labels; LLM judges also fail under context/social pressure.RLHF, preference-data, label-noise, evaluation, human-factors, LLM-judges, alignment
2603.08520SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement
PDF
cs.CR, cs.SE91Shows iterative code refinement can drift into worse security; proposes mitigationcode-security, agents, specification-drift, SAST, secure-coding, evaluation
2603.08660How Far Can Unsupervised RLVR Scale LLM Training?
PDF
cs.LG, cs.CL91Clear theory+experiments: intrinsic URLVR sharpens initial beliefs; can fail catastrophically when wrong.RLVR, unsupervised-RL, verifiable-rewards, theory, scaling, safety-failure-modes
2603.08179Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models
PDF
eess.AS, cs.AI, eess.SP90Shows speaker-ID leakage in duplex speech LLMs and proposes streaming anonymization mitigations.privacy, speech-LLMs, representation-leakage, anonymization, security
2603.07978OSExpert: Computer-Use Agents Learning Professional Skills via Exploration
PDF
cs.AI90OSExpert-Eval + exploration curriculum for computer-use agents; targets transfer, efficiency, fine actions.computer-use, agents, benchmark, exploration, curriculum, UI, tool-use
2603.08316SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
PDF
cs.CR, cs.CL, cs.CV89Backdoor attack on VLM GUI agents that triggers extreme latency via long reasoningagent-security, VLM, GUI-agents, backdoor, availability-attack, reasoning
2603.08091Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization
PDF
cs.CL88JudgeBiasBench: taxonomy + benchmark to measure/debias LLM-judge evaluation biasesevaluation, LLM-judges, bias, reward-modeling, benchmark, debiasing
2603.07853SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans
PDF
cs.AI, cs.CL, cs.IR88Synthetic tool-use plans to fix exploration failures in research agents; boosts on open-web benchmarks.agents, tool-use, exploration, synthetic-data, RL, web, benchmarks
2603.07931BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence
PDF
cs.CL88Multi-hop long multimodal doc QA with step-level grounded evidence; exposes hidden aggregation failures.multimodal, long-context, benchmark, grounding, multi-hop, scientific-docs, RAG
2603.08262FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
PDF
cs.AI87FinToolBench: runnable real financial tool-use benchmark for LLM agents in high-stakes domainagents, tool-use, benchmark, finance, compliance, evaluation
2603.08221SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration
PDF
cs.CR, cs.AI86SplitAgent: enterprise-cloud agent split with dynamic sanitization + DP guaranteesprivacy, agent-architecture, data-sanitization, differential-privacy, enterprise, security
2603.08486Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
PDF
cs.CV, cs.AI86Label-free VLM safety persona shaping via threat-image exposure; relevant to multimodal safety.multimodal-safety, VLM, alignment, persona, fine-tuning
2603.08068In-Context Reinforcement Learning for Tool Use in Large Language Models
PDF
cs.AI86In-context RL for tool use reduces SFT cold-start dependence; relevant to scalable agent training.agents, tool-use, reinforcement-learning, in-context-learning, data-efficiency
2603.07886CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
PDF
cs.CL, cs.AI86Benchmark for complex instruction following with constraints/control flow; closer to real deployment needs.instruction-following, benchmark, constraints, control-flow, reliability, evaluation
2603.08371Leaderboard Incentives: Model Rankings under Strategic Post-Training
PDF
cs.GT, cs.LG85Formalizes benchmaxxing incentives; shows no Nash equilibrium under common benchmark dynamics.evaluation, benchmarks, gaming, mechanism-design, game-theory, post-training
2603.07980\$OneMillion-Bench: How Far are Language Agents from Human Experts?
PDF
cs.LG, cs.AI, cs.CL84OneMillion-Bench: expert tasks for long-horizon agents in economically consequential settingsagents, benchmark, long-horizon, tool-use, professional-tasks, evaluation
2603.08013PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents
PDF
cs.AI84Benchmark for proactive GUI agents from continuous screenshots; long-horizon noisy trajectories.agents, GUI, benchmark, proactive-assistants, evaluation
2603.07990MJ1: Multimodal Judgment via Grounded Verification
PDF
cs.LG84Grounded verification chain + counterfactual consistency RL improves multimodal judging with small model.multimodal, judge-models, grounding, RL, evaluation, bias
2603.07915Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents
PDF
cs.AI84Per-step reasoning-effort routing for agents to cut cost without big accuracy loss; practical deployment.agents, inference-efficiency, reasoning-budget, routing, cost-control, deployment
2603.08429One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States
PDF
cs.CL, cs.AI, cs.IR83Native retrieval embeddings from LLM hidden states; simplifies agent RAG stack with small loss.RAG, retrieval, embeddings, agents, efficiency, representation-learning
2603.08659CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning
PDF
cs.CL83Formalizes adaptive reasoning as utility maximization; allocates tokens by difficulty to avoid overthinking.adaptive-reasoning, inference-time-compute, token-budget, efficiency, reasoning-models
2603.08117UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
PDF
cs.AI, cs.IR82UIS-QA benchmark targets unindexed info seeking; shows big drop for SOTA agentsagents, information-seeking, benchmark, web, retrieval, robustness
2603.08706Agentic Critical Training
PDF
cs.AI, cs.CL, cs.LG82RL paradigm trains agents to judge better actions among alternatives vs imitating reflection text.agents, reinforcement-learning, critique, action-selection, training-paradigm, reasoning

AI Paper Insight Brief

2026-03-11

0) Executive takeaways (read this first)

  • Agent training is converging on “better exploration priors” rather than just better RL: synthetic plan-guided SFT (SynPlanResearch-R1) and RL-only with in-context demos (ICRL) both target the same bottleneck—on-policy RL getting stuck in shallow tool-use behaviors.
  • Adaptive compute is moving from “per-query” to “per-step / per-instance” control: ARES routes thinking level per agent step; CODA shapes RL rewards to reallocate tokens by difficulty—both cut cost without (much) accuracy loss, but require careful labeling/proxy design.
  • Evaluation is shifting toward entangled, long-horizon, and real-world constraints: CCR-Bench (constraints + workflows + industrial logs), OfficeQA Pro (enterprise PDFs + numeric exactness), $OneMillion-Bench (expert rubrics + economic value), BRIDGE (multimodal evidence chains), UIS-QA (unindexed web), FinToolBench (finance tool compliance) all expose large gaps that “standard QA” misses.
  • Safety threats are expanding beyond content to channels and resources: malicious finetuning via invisible Unicode steganography can bypass safety checks; SlowBA backdoors latency while preserving correctness; continuation-triggered jailbreaks reveal a mechanistic “continuation vs refusal” circuit tension.
  • Judge reliability is now a first-class alignment problem: MJ1 improves multimodal judging via grounded verification + flip-consistency reward; JudgeBiasBench quantifies 12 bias types and reduces them via GRPO/InfoNCE; choice-blindness shows preference data can be silently corrupted while standard metrics look fine.
  • Enterprise/privacy constraints are becoming architectural: SplitAgent proposes a privacy-agent / cloud-reasoner split with DP budgets and protocol primitives; full-duplex speech models leak speaker identity in hidden states, but streaming anonymization can push EER toward chance.

2) Key themes (clusters)

Theme: Tool-using research agents—fixing exploration and cold start

Theme: Adaptive reasoning & efficiency for long-horizon agents

  • Why it matters: Long-horizon agents pay compounding token costs; “always think hard” is economically non-viable, but “think less” can cascade errors.
  • Representative papers:
  • Common approach:
    • Per-step routing (ARES) using a lightweight router trained from maximal-effort trajectories + RL cost penalties.
    • Difficulty proxies from rollouts (CODA group success rate) to penalize verbosity on easy items and encourage deliberation on hard ones.
    • Shift cost from test-time to environment learning (OSExpert GUI-DFS skill discovery + cached procedures + fast planner).
  • Open questions / failure modes:
    • Labeling/annotation cost and judge dependence (ARES uses multi-trial sampling + LLM judge + rationale teacher).
    • Proxy noise: difficulty estimates can be unstable with sparse rewards / small group sizes (CODA).
    • Coverage/scalability of environment exploration (OSExpert) and dependence on strong base models.

Theme: Next-gen benchmarks for “real” instruction following and grounded work

Theme: Grounding & hallucination in long-context / multimodal documents

Theme: Alignment evaluation in interactive settings & judge robustness

Theme: Security & privacy—stealth channels, backdoors, and architectural mitigations

3) Technical synthesis

  • GRPO is the workhorse across agent training, judges, and efficiency shaping (SynPlanResearch-R1, ARES, ICRL, MJ1, Judge debiasing, CODA, ACT), typically with format rewards and loss masking for tool outputs.
  • Two competing “cold start” strategies for tool use are emerging:
    • Better SFT priors via synthetic trajectories that explicitly diversify tool plans (SynPlanResearch-R1).
    • No SFT by injecting few-shot demonstrations into RL rollouts and phasing them out (ICRL).
  • Exploration vs compliance is a recurring tradeoff: deeper tool use improves accuracy (SynPlanResearch-R1), but in finance, aggressive tool invocation can reduce compliance/argument correctness (FinToolBench shows high TIR but low CER for some models).
  • Difficulty/effort estimation is being internalized:
    • ARES learns per-step minimal effort labels via multi-trial equivalence checks.
    • CODA uses group success rate as a difficulty proxy to shape length rewards (and gates bonuses by correctness to avoid length hacking).
  • Retrieval is increasingly the bottleneck in long-doc/multimodal settings: BRIDGE shows page-level retrieval can harm multi-hop grounding; OfficeQA Pro shows parsing + retrieval + temporal revision handling dominate.
  • Long context amplifies fabrication: the 172B-token RIKER study finds fabrication rises steeply with context length; temperature changes can reduce fabrication and coherence loss in many cases.
  • Alignment evaluation is moving to trajectories: ConflictBench finds failures occur after multiple turns (avg failure turn 5.28) and includes regret tests; single-turn ASR overestimates alignment.
  • Judge robustness is being treated as an optimization target (MJ1’s flip-consistency reward; JudgeBiasBench’s BSR + debiasing training), but choice-blindness warns that feedback channels can be corrupted without obvious metric alarms.
  • Security threats are diversifying beyond “harmful text”: invisible Unicode steganography bypasses safety checks; latency backdoors target usability; mechanistic jailbreak analysis suggests prompt-structure exploits continuation circuits.
  • Enterprise privacy is becoming system-level: SplitAgent combines local sanitization + DP budgets + protocol primitives; speech dialogue models show identity leakage in hidden states, mitigated by streaming anonymization.

4) Top 5 papers (with “why now”)

1) Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

  • Shows a training-time attack where models appear safe in plaintext but emit hidden harmful content via zero-width Unicode.
  • Demonstrated on GPT-4.1 finetuning API and multiple open models; unsafe rate goes from 0% pre-decode to >90% post-decode in their setup.
  • Tests mitigations like filtering zero-width characters and frequency penalties.
  • Skepticism / limitation: stegotext increases token length and is less effective on smaller models; success not universal.

2) How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study…

  • Massive, deterministic, contamination-resistant measurement: 172B tokens, 35 open models, up to 200K context.
  • Key deployment insight: fabrication is non-zero even best-case (best 1.19% at 32K) and no model <10% at 200K.
  • Temperature effects are nontrivial: higher T often reduces fabrication and coherence loss.
  • Skepticism / limitation: English-only, open-weight only, single framework (RIKER).

3) OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

  • Enterprise-realistic: ~89k pages of Treasury Bulletins; 133 hard questions with strict numeric scoring.
  • Shows end-to-end performance is still low without strong parsing/retrieval; parser choice (ai_parse_document) yields consistent gains.
  • Provides a rich ablation map across parsers, retrieval, table formats, and test-time scaling.
  • Skepticism / limitation: single-domain corpus; full-corpus runs are costly/slow (~23.6 min per question reported).

4) DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

  • Practical inference-only method to reduce tail risk / disagreement without retraining, grounded in entropic (KL-robust) objectives and LCBs.
  • Human eval on MT-Bench: improves mean and reduces risk, especially on high-disagreement prompts.
  • Multi-scorer aggregation addresses scorer shift; small latency overhead for augmentation.
  • Skepticism / limitation: depends on scorer/proxy quality and a finite candidate pool.

5) SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

  • Introduces a new backdoor target: latency/verbosity rather than wrong actions; preserves clean accuracy while triggered inputs inflate response length/latency/energy.
  • Two-stage SFT + RL reward shaping makes the backdoor trigger-dependent; includes a real-world ticket-buying latency increase.
  • Highlights that monitoring correctness alone misses resource attacks.
  • Skepticism / limitation: assumes attacker can finetune and poison training; scaling effects vary (7B still vulnerable but different magnitudes).

5) Practical next steps

  • For tool-using agents, test two cold-start regimes head-to-head: (a) synthetic plan-guided SFT (plan sampling + cue injection) vs (b) RL-only with in-context demos + curriculum; measure tool diversity, entropy, and final accuracy.
  • Add compliance metrics to your tool benchmarks (FinToolBench-style): timeliness, intent restraint, domain alignment—then track how retrieval/tool-card metadata changes mismatch rates.
  • If deploying long-context doc QA, measure fabrication vs context length explicitly (RIKER-style probes if possible); don’t assume longer context is safer.
  • For multimodal/GUI agents, add efficiency anomaly detection (latency/length/energy) as a first-class safety signal to catch SlowBA-like backdoors.
  • Harden finetuning pipelines against invisible-character channels: normalize/strip zero-width Unicode at ingestion and at inference boundaries; log token-level anomalies.
  • For alignment evaluation, incorporate multi-turn interactive tests (ConflictBench-like) and track when failures occur (e.g., average failure turn), not just whether.
  • If you rely on LLM judges, run bias sensitivity and counterfactual position/verbosity tests (JudgeBiasBench), and consider grounded verification prompting (MJ1-style) for multimodal judging.
  • In enterprise settings, prototype a local privacy agent + cloud reasoner split (SplitAgent) and quantify the privacy/utility/latency tradeoff under your threat model.

Generated from per-paper analyses; no external browsing.