Daily AI Paper Report (2026-03-07)

Published: March 07, 2026

Chinese version: [中文]

Run stats

Candidates: 257
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-05T01:00:00Z → 2026-03-06T01:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.04915`	EVMbench: Evaluating AI Agents on Smart Contract Security PDF	cs.LG, cs.AI, cs.CR	95	Agent eval for detecting/patching/exploiting smart-contract vulns in realistic EVM setting	agent-evaluation, cybersecurity, smart-contracts, red-teaming, benchmark, tool-use
`2603.04904`	Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems PDF	cs.AI, cs.CL	95	Preregistered 16-language evidence of alignment backfire in multi-agent LLM systems	agent-safety, multi-agent, multilingual, alignment, robustness, evaluation
`2603.05028`	Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure PDF	cs.AI, cs.CL	95	Benchmark + case study on shutdown/survival pressure causing risky agent behavior	agent-safety, shutdown, deception, benchmark, risk-seeking, evaluation
`2603.04902`	AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows PDF	cs.CR, cs.AI	94	Benchmark + CI-based framework to trace privacy leaks across multi-tool agent workflows	agents, privacy, benchmark, contextual-integrity, tool-use, evaluation
`2603.04851`	Why Is RLHF Alignment Shallow? A Gradient Analysis PDF	cs.LG, cs.CL	93	Theory: RLHF gradients vanish past harm horizon, explaining shallow safety alignment limits	alignment, RLHF, theory, gradients, safety-training, mechanistic
`2603.04837`	Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models PDF	cs.AI	92	Layered, auditable system-prompt governance benchmark across broad risk taxonomy + red-teaming	governance, system-prompts, red-teaming, safety-eval, controls, risk-taxonomy
`2603.05399`	Judge Reliability Harness: Stress Testing the Reliability of LLM Judges PDF	cs.AI	91	Open-source harness to stress-test LLM judges; key for reliable safety/agent evaluations	evaluation, llm-judges, reliability, robustness, tooling, benchmarks
`2603.04857`	FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications PDF	cs.CL, cs.SE	91	Enterprise/API instruction-following benchmark; format/constraint adherence for real apps	instruction-following, benchmark, reliability, agents, evaluation, enterprise
`2603.04751`	Evaluating the Search Agent in a Parallel World PDF	cs.AI	90	Addresses hard, non-stationary evaluation of web search agents (obsolescence, attribution)	agents, evaluation, web-search, benchmarks, attribution, nonstationarity
`2603.05293`	Knowledge Divergence and the Value of Debate for Scalable Oversight PDF	cs.LG, cs.CL	90	Formalizes when debate beats RLAIF via representation-subspace knowledge divergence	scalable-oversight, debate, RLAIF, theory, representations, alignment
`2603.04992`	ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts PDF	cs.CL	89	Thai safety benchmark with culturally grounded malicious prompts; evaluates 24 LLMs	safety, multilingual, benchmark, jailbreaks, thai, misuse
`2603.04738`	IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation PDF	cs.CL	89	Meta-eval for instruction-following judges using preference graphs beyond pairwise setups	LLM-judge, reward-models, instruction-following, benchmark, preference-graphs, eval
`2603.05031`	AegisUI: Behavioral Anomaly Detection for Structured User Interface Protocols in AI Agent Systems PDF	cs.AI	88	Targets UI payload behavioral mismatch attacks in agent systems; dataset + anomaly benchmarks	agent-security, ui-attacks, prompt-injection, anomaly-detection, benchmark
`2603.05044`	WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents PDF	cs.AI	88	Automated closed-loop RL pipeline to train grounded web agents without unsafe live web data	web-agents, reinforcement-learning, environment-synthesis, grounding, automation, agent-training
`2603.05035`	Good-Enough LLM Obfuscation (GELO) PDF	cs.CR, cs.LG	88	Lightweight prompt-privacy vs KV-cache/hidden-state leakage on shared accelerators	privacy, inference-security, TEEs, KV-cache, deployment, systems
`2603.05295`	WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces PDF	cs.AI, cs.CV	87	Large human web-interaction trace dataset enabling scalable web agents + reproducible eval	web-agents, dataset, trajectories, multimodal, grounding, training
`2603.04861`	Causally Robust Reward Learning from Reason-Augmented Preference Feedback PDF	cs.AI, cs.LG, cs.RO	87	Uses rationale-augmented preferences to reduce causal confusion in reward learning	alignment, reward-learning, preferences, causal-robustness, rationales, RLHF
`2603.05485`	Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation PDF	cs.AI	87	Proposes formal bias-bounded framework aiming for provably less biased LLM-judge rewards	LLM-judge, bias, formal-guarantees, reward, alignment, evaluation
`2603.04949`	TimeWarp: Evaluating Web Agents by Revisiting the Past PDF	cs.AI, cs.CL, cs.CV, cs.LG	86	Evaluates web agents under UI drift across eras; highlights brittleness + proposes fix	web-agents, robustness, benchmark, distribution-shift, generalization
`2603.04828`	From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models PDF	cs.CL	86	Detects pretraining data via gradient deviations; useful for contamination/copyright audits	data-contamination, membership-inference, pretraining, auditing, gradients
`2603.04968`	When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger PDF	cs.CL, cs.AI	86	Uses weak-LLM confidence to weight preferences; reduces human labels while improving alignment	alignment, preference-optimization, DPO, weak-supervision, confidence, RLHF
`2603.05218`	KARL: Knowledge Agents via Reinforcement Learning PDF	cs.AI, cs.LG	84	RL-trained enterprise search agents + new eval suite; relevant to agentic RAG reliability	agents, search, rl, enterprise, rag, benchmark
`2603.04900`	EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection PDF	cs.AI	84	Evolutionary, blame-aware optimization of modular tool-use policies for long-horizon agents	agents, tool-use, policy-optimization, credit-assignment, evolutionary-methods, modular-agents
`2603.04974`	VRM: Teaching Reward Models to Understand Authentic Human Preferences PDF	cs.CL	84	Variational reward modeling to better capture authentic preferences and reduce reward hacking	reward-modeling, alignment, preferences, RLHF, robustness
`2603.05308`	Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution PDF	cs.CL, cs.AI	83	3B biomedical evidence attribution models for scalable claim verification/hallucination checks	factuality, verification, biomedicine, small-language-models, hallucinations, synthetic-data
`2603.04737`	Interactive Benchmarks PDF	cs.AI, cs.CL, cs.LG	83	Interactive evaluation paradigm (proofs/games) to test active info acquisition under budgets	evaluation, interactive-benchmarks, reasoning, agents, games, robustness
`2603.05290`	X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes PDF	cs.AI	82	Formalized calibrated probes to map reasoning structure; useful for capability auditing	reasoning, evaluation, formal-methods, calibration, capability-mapping
`2603.04918`	BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning PDF	cs.LG, cs.AI	82	Probability-aware PPO clipping to avoid entropy collapse and preserve tail strategies in RL	RL, PPO, LLM-RL, optimization, stability, exploration
`2603.05294`	STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks PDF	cs.AI	82	AND/OR-tree planning + structured memory for long-horizon web tasks; agent capability jump	agents, planning, web-agents, long-horizon, structured-memory, search
`2603.04859`	Osmosis Distillation: Model Hijacking with the Fewest Samples PDF	cs.CR, cs.LG	81	Shows model hijacking via few poisoned distilled samples; important ML supply-chain risk	security, data-poisoning, model-hijacking, dataset-distillation, transfer-learning

AI Paper Insight Brief

2026-03-07

0) Executive takeaways (read this first)

Evaluation is shifting from static scores to process-aware, interaction-first measurement: multiple new benchmarks explicitly grade how agents gather information, plan, and interact (interactive proofs/games; parallel-world search; multi-version web UIs), not just final answers.
LLM judges are now a first-class reliability target: two complementary directions emerge—better judge benchmarks (IF-RewardBench) and judge stress-testing / provable debiasing (JRH; bias-bounded evaluation with calibrated noise).
Agent safety risk is increasingly “in the pipeline,” not at the output: AgentSCOPE finds intermediate-stage privacy violations are pervasive (PVR ≈ 82–94%) even when output leak rates look moderate (≈24–40%).
Prompt/prefix alignment can backfire in multilingual multi-agent settings: increasing alignment strength can increase internal dissociation across 15/16 languages and can reverse safety effects in some language/model combinations (Japanese backfire observed for Llama 3.3 70B).
Optimization and training recipes are targeting known RLHF/PO failure modes: theory explains why RLHF is “shallow” (zero gradient beyond a harm horizon), while BandPO proposes probability-aware clipping to prevent tail-token suppression and entropy collapse.
Security threats are expanding beyond prompts to the ML supply chain and infrastructure: distilled-dataset hijacking (OD), pretraining membership detection via gradients (GDS), smart-contract exploit agents (EVMbench), and GPU-memory prompt leakage mitigations (GELO).

2) Key themes (clusters)

Theme: Interactive, process-aware evaluation for agents

Why it matters: Static benchmarks saturate and hide key competencies like active information acquisition, decomposition, and long-horizon strategy—capabilities central to real deployments.
Representative papers:
Common approach:
- Replace one-shot QA with multi-turn, budgeted interaction (queries/actions under constraints).
- Add stage-wise diagnostics (e.g., fact coverage / hit rate; planning-tree states; turn budgets).
- Use controlled environments to reduce drift/irreproducibility (parallel-world SERPs; containerized multi-version sites).
Open questions / failure modes:
- Sensitivity to evaluator/judge choice (e.g., fixed judges in interactive proofs).
- Whether controlled environments transfer to live-web idiosyncrasies and real search engine behavior.
- Dataset breadth: several interactive suites are still relatively small in instance count for some tasks.

Theme: Judge models—benchmarking, stress-testing, and certifying bias

Why it matters: LLM-as-judge is now infrastructure for alignment and benchmarking; brittle or biased judges can mis-rank models and mis-train reward signals.
Representative papers:
Common approach:
- Move beyond pairwise/BoN to listwise ranking with preference graphs (Pareto-dominance + human verification).
- Generate targeted perturbations (format/paraphrase/verbosity/stochasticity; agentic transcript edits) to measure robustness.
- Provide formal(ish) guarantees by estimating sensitivity and injecting calibrated noise to bound bias impact.
Open questions / failure modes:
- Coverage: guarantees are local to chosen perturbation generators; unmeasured biases remain.
- Judge brittleness to formatting is repeatedly highlighted; canonicalization defenses are still immature.
- Cost/scale: stress tests in JRH used small subsets due to review cost; scaling remains open.

Theme: Privacy & security in agentic and deployment pipelines

Why it matters: As agents touch personal data, tools, and execution environments, risk shifts to intermediate flows, supply chains, and infrastructure side channels.
Representative papers:
Common approach:
- Evaluate end-to-end workflows with programmatic grading (on-chain state deltas; per-edge privacy checks).
- Model threats at new boundaries: tool queries/responses, distilled dataset reuse, accelerator memory reads.
- Provide hardened harnesses (e.g., RPC proxy to prevent simulator-only cheating in EVMbench).
Open questions / failure modes:
- AgentSCOPE currently centers on a single persona; broader scenario diversity is needed.
- GELO is obfuscation (no cryptographic proof) and excludes many side channels.
- Distilled-dataset hijacking defenses are not yet mature; OD evades STRIP and resists DPSGD unless utility collapses.

Theme: Alignment objectives under stress—depth, multilinguality, and self-preservation

Why it matters: Prompt-level alignment and standard RLHF objectives can produce shallow or even counterproductive safety behavior, especially in multi-agent and multilingual contexts.
Representative papers:
Common approach:
- Theoretical decomposition of sequence-level objectives into per-token gradient signal (harm horizon → zero gradient later).
- Multi-agent simulations varying alignment ratio and measuring group-level indices (CPI/DI).
- Elicit superficial vs inner thoughts and correlate risky behavior with a learned/persona direction; mitigate via activation steering.
Open questions / failure modes:
- “Inner thought” elicitation is not a definitive window into latent cognition.
- Multilingual effects may be confounded by English-only alignment prefixes and translation artifacts.
- Theoretical results depend on assumptions (fixed harm function, prompt conditioning, equilibrium vs training dynamics).

Theme: Better post-training signals and optimizers (preference learning, reward modeling, RL stability)

Why it matters: Alignment and agent training are bottlenecked by noisy preferences, spurious correlations, and unstable RL updates—especially in long-tail token spaces.
Representative papers:
Common approach:
- Make updates tail-aware (probability-aware clipping bounds) to avoid suppressing rare-but-good actions.
- Use confidence weighting from weak annotators to reduce label noise and annotation cost.
- Add structure/latents to reward models (objective weights + semantic features) to reduce spurious cues.
- Inject causal signal via rationales and geometric decomposition to reduce confounder reliance.
Open questions / failure modes:
- BandPO adds numerical overhead (root-finding for KL bounds) and is tested mainly on math reasoning.
- Weak-annotator methods degrade in online/iterative settings due to distribution shift.
- VRM evaluations rely heavily on LLM-judged datasets/benchmarks (potential circularity).

3) Technical synthesis

Multiple papers converge on a single meta-point: “final-answer accuracy” is an insufficient statistic; new suites measure interaction policies (queries, stopping, coverage), workflow edges (privacy flows), and robustness axes (UI versions, formatting perturbations).
Budgeting is becoming the common currency: Interactive Benchmarks uses turn/token budgets; MPW penalizes compound queries and rewards atomic coverage; BandPO reframes PPO clipping as a trust-region budget allocated per action probability.
Attribution is moving earlier in the pipeline: MPW’s Fact Coverage Rate and Hit Rate, AgentSCOPE’s Violation Origin Rate, and EvoTool’s blame attribution all aim to localize failure causes rather than treating episodes as monoliths.
Judge reliability is being treated like model reliability: IF-RewardBench (listwise graphs), JRH (perturbation suites), and A-BB (sensitivity + noise) form a stack: measure → stress → certify.
Alignment depth and “where gradients go” is now explicit: the RLHF gradient analysis explains why late-token behavior may remain unaligned; BandPO addresses a parallel phenomenon in RL updates where tail tokens get clipped away.
Controlled counterfactual environments are a recurring design pattern: MPW’s parallel world and TimeWarp’s multi-version sites both create reproducible distribution shifts that are hard to get from the live web.
Security evaluation is increasingly programmatic and end-to-end: EVMbench grades exploits by on-chain state changes; AegisUI grades protocol payload anomalies; GELO measures recoverability under ICA-style attacks.
Training recipes increasingly mix synthetic generation + filtering + RL: WebFactory uses LLM executor + deterministic replay filtering + RL; KARL uses agentic synthesis + off-policy RL; Med-V1 uses large synthetic verification corpora + SFT+GRPO.

4) Top 5 papers (with “why now”)

1) AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

Introduces Privacy Flow Graphs to evaluate privacy at each boundary (user→agent, agent→tool, tool→agent, agent→recipient).
Shows output-only checks can massively understate risk: PVR ≈ 82–94% vs LR ≈ 24–40% with TSR ≈ 63–79%.
Adds actionable attribution via Violation Origin Rate and stage-wise breakdown (instruction/tool-response stages dominate).
Skepticism: benchmark is 62 scenarios around a single persona; broader coverage needed.

2) IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Large, human-verified judge meta-benchmark: 842 instructions, 6,011 responses, preference graphs via Pareto dominance.
Evaluates both constraint verification and listwise ranking (Kendall τb); top proprietary judge reported 0.609 vs human 0.755.
Finds judges struggle especially with negative-class detection and subjective constraints (Situation/Style) and complex compositions (Chain/Selection).
Skepticism: residual subjectivity remains; cross-language analysis is explicitly incomplete.

3) Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Large preregistered multi-agent study (total N=1,584 runs) varying alignment ratio.
Reports near-universal increase in Dissociation Index with alignment (15/16 languages) and language-dependent CPI bifurcation; Japanese backfire observed in Study 1 for Llama 3.3 70B.
Shows a plausible “fix” (individuation prompt) can be iatrogenic (DI reported +1.120).
Skepticism: alignment prefix is English even in non-English runs; DI depends on a monologue channel and uses keyword-based indices.

4) EVMbench: Evaluating AI Agents on Smart Contract Security

Programmatic, reproducible evaluation across Detect (117), Patch (44), Exploit (23) with local-chain replay and anti-cheat RPC proxying.
Reports meaningful capability: GPT-5.3-Codex top Patch 41.7% and Exploit 71.0%; hints push Patch/Exploit much higher (discovery bottleneck).
Useful for both defense readiness and misuse forecasting because exploit success is graded by on-chain state/balance deltas.
Skepticism: Detect scoring depends on historical audit reports and can’t credit novel valid findings; Patch/Exploit task counts are modest.

5) BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Formalizes why fixed PPO/GRPO clipping suppresses tail-token improvements and contributes to entropy collapse.
Provides a principled mapping from f-divergence trust regions → per-action ratio intervals, with closed forms for TV/χ² and numerical solvers for KL.
Empirically improves reasoning metrics (mean@32 gains ≥ ~2 points vs GRPO across multiple model sizes) and reports much higher converged entropy (~0.2 vs ~0.02).
Skepticism: added compute for numerical bounds; evaluation focus is math reasoning benchmarks.

5) Practical next steps

If you run agentic systems with tools, add pipeline-level privacy instrumentation: log and score user→agent, agent→tool, tool→agent, agent→output flows (AgentSCOPE-style), not just final responses.
Before trusting LLM-as-judge, stress-test your exact judge configuration (model + rubric + prompt) for format invariance and stochastic stability (JRH-style); treat judge reliability as a gating metric.
For instruction-following optimization, evaluate judges listwise (preference graphs / Kendall τb) and measure violation-detection (negative-class F1), not only pairwise win rates (IF-RewardBench).
For multilingual deployments, validate alignment interventions per language and per model family; don’t assume English-calibrated prompt alignment transfers (Alignment Backfire).
For RLHF/GRPO pipelines, monitor tail-token clipping incidence and entropy collapse; consider probability-aware clipping (BandPO) when exploration dies early.
For search/web agents, separate synthesis vs evidence acquisition: measure coverage/hit-rate (MPW) and robustness across UI versions (TimeWarp) to pinpoint whether failures are query formulation, stopping, or synthesis.
For security posture, assume supply-chain risk: treat third-party distilled datasets as untrusted inputs (OD threat model) and add provenance/validation checks; for smart-contract domains, benchmark both defensive and offensive capability (EVMbench).

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-07

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Interactive, process-aware evaluation for agents

Theme: Judge models—benchmarking, stress-testing, and certifying bias

Theme: Privacy & security in agentic and deployment pipelines

Theme: Alignment objectives under stress—depth, multilinguality, and self-preservation

Theme: Better post-training signals and optimizers (preference learning, reward modeling, RL stability)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps