Daily AI Paper Report (2026-03-07)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 257
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-05T01:00:00Z → 2026-03-06T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.04915EVMbench: Evaluating AI Agents on Smart Contract Security
PDF
cs.LG, cs.AI, cs.CR95Agent eval for detecting/patching/exploiting smart-contract vulns in realistic EVM settingagent-evaluation, cybersecurity, smart-contracts, red-teaming, benchmark, tool-use
2603.04904Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
PDF
cs.AI, cs.CL95Preregistered 16-language evidence of alignment backfire in multi-agent LLM systemsagent-safety, multi-agent, multilingual, alignment, robustness, evaluation
2603.05028Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure
PDF
cs.AI, cs.CL95Benchmark + case study on shutdown/survival pressure causing risky agent behavioragent-safety, shutdown, deception, benchmark, risk-seeking, evaluation
2603.04902AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows
PDF
cs.CR, cs.AI94Benchmark + CI-based framework to trace privacy leaks across multi-tool agent workflowsagents, privacy, benchmark, contextual-integrity, tool-use, evaluation
2603.04851Why Is RLHF Alignment Shallow? A Gradient Analysis
PDF
cs.LG, cs.CL93Theory: RLHF gradients vanish past harm horizon, explaining shallow safety alignment limitsalignment, RLHF, theory, gradients, safety-training, mechanistic
2603.04837Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models
PDF
cs.AI92Layered, auditable system-prompt governance benchmark across broad risk taxonomy + red-teaminggovernance, system-prompts, red-teaming, safety-eval, controls, risk-taxonomy
2603.05399Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
PDF
cs.AI91Open-source harness to stress-test LLM judges; key for reliable safety/agent evaluationsevaluation, llm-judges, reliability, robustness, tooling, benchmarks
2603.04857FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications
PDF
cs.CL, cs.SE91Enterprise/API instruction-following benchmark; format/constraint adherence for real appsinstruction-following, benchmark, reliability, agents, evaluation, enterprise
2603.04751Evaluating the Search Agent in a Parallel World
PDF
cs.AI90Addresses hard, non-stationary evaluation of web search agents (obsolescence, attribution)agents, evaluation, web-search, benchmarks, attribution, nonstationarity
2603.05293Knowledge Divergence and the Value of Debate for Scalable Oversight
PDF
cs.LG, cs.CL90Formalizes when debate beats RLAIF via representation-subspace knowledge divergencescalable-oversight, debate, RLAIF, theory, representations, alignment
2603.04992ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
PDF
cs.CL89Thai safety benchmark with culturally grounded malicious prompts; evaluates 24 LLMssafety, multilingual, benchmark, jailbreaks, thai, misuse
2603.04738IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
PDF
cs.CL89Meta-eval for instruction-following judges using preference graphs beyond pairwise setupsLLM-judge, reward-models, instruction-following, benchmark, preference-graphs, eval
2603.05031AegisUI: Behavioral Anomaly Detection for Structured User Interface Protocols in AI Agent Systems
PDF
cs.AI88Targets UI payload behavioral mismatch attacks in agent systems; dataset + anomaly benchmarksagent-security, ui-attacks, prompt-injection, anomaly-detection, benchmark
2603.05044WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
PDF
cs.AI88Automated closed-loop RL pipeline to train grounded web agents without unsafe live web dataweb-agents, reinforcement-learning, environment-synthesis, grounding, automation, agent-training
2603.05035Good-Enough LLM Obfuscation (GELO)
PDF
cs.CR, cs.LG88Lightweight prompt-privacy vs KV-cache/hidden-state leakage on shared acceleratorsprivacy, inference-security, TEEs, KV-cache, deployment, systems
2603.05295WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
PDF
cs.AI, cs.CV87Large human web-interaction trace dataset enabling scalable web agents + reproducible evalweb-agents, dataset, trajectories, multimodal, grounding, training
2603.04861Causally Robust Reward Learning from Reason-Augmented Preference Feedback
PDF
cs.AI, cs.LG, cs.RO87Uses rationale-augmented preferences to reduce causal confusion in reward learningalignment, reward-learning, preferences, causal-robustness, rationales, RLHF
2603.05485Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
PDF
cs.AI87Proposes formal bias-bounded framework aiming for provably less biased LLM-judge rewardsLLM-judge, bias, formal-guarantees, reward, alignment, evaluation
2603.04949TimeWarp: Evaluating Web Agents by Revisiting the Past
PDF
cs.AI, cs.CL, cs.CV, cs.LG86Evaluates web agents under UI drift across eras; highlights brittleness + proposes fixweb-agents, robustness, benchmark, distribution-shift, generalization
2603.04828From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models
PDF
cs.CL86Detects pretraining data via gradient deviations; useful for contamination/copyright auditsdata-contamination, membership-inference, pretraining, auditing, gradients
2603.04968When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
PDF
cs.CL, cs.AI86Uses weak-LLM confidence to weight preferences; reduces human labels while improving alignmentalignment, preference-optimization, DPO, weak-supervision, confidence, RLHF
2603.05218KARL: Knowledge Agents via Reinforcement Learning
PDF
cs.AI, cs.LG84RL-trained enterprise search agents + new eval suite; relevant to agentic RAG reliabilityagents, search, rl, enterprise, rag, benchmark
2603.04900EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection
PDF
cs.AI84Evolutionary, blame-aware optimization of modular tool-use policies for long-horizon agentsagents, tool-use, policy-optimization, credit-assignment, evolutionary-methods, modular-agents
2603.04974VRM: Teaching Reward Models to Understand Authentic Human Preferences
PDF
cs.CL84Variational reward modeling to better capture authentic preferences and reduce reward hackingreward-modeling, alignment, preferences, RLHF, robustness
2603.05308Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
PDF
cs.CL, cs.AI833B biomedical evidence attribution models for scalable claim verification/hallucination checksfactuality, verification, biomedicine, small-language-models, hallucinations, synthetic-data
2603.04737Interactive Benchmarks
PDF
cs.AI, cs.CL, cs.LG83Interactive evaluation paradigm (proofs/games) to test active info acquisition under budgetsevaluation, interactive-benchmarks, reasoning, agents, games, robustness
2603.05290X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes
PDF
cs.AI82Formalized calibrated probes to map reasoning structure; useful for capability auditingreasoning, evaluation, formal-methods, calibration, capability-mapping
2603.04918BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
PDF
cs.LG, cs.AI82Probability-aware PPO clipping to avoid entropy collapse and preserve tail strategies in RLRL, PPO, LLM-RL, optimization, stability, exploration
2603.05294STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
PDF
cs.AI82AND/OR-tree planning + structured memory for long-horizon web tasks; agent capability jumpagents, planning, web-agents, long-horizon, structured-memory, search
2603.04859Osmosis Distillation: Model Hijacking with the Fewest Samples
PDF
cs.CR, cs.LG81Shows model hijacking via few poisoned distilled samples; important ML supply-chain risksecurity, data-poisoning, model-hijacking, dataset-distillation, transfer-learning

AI Paper Insight Brief

2026-03-07

0) Executive takeaways (read this first)

  • Evaluation is shifting from static scores to process-aware, interaction-first measurement: multiple new benchmarks explicitly grade how agents gather information, plan, and interact (interactive proofs/games; parallel-world search; multi-version web UIs), not just final answers.
  • LLM judges are now a first-class reliability target: two complementary directions emerge—better judge benchmarks (IF-RewardBench) and judge stress-testing / provable debiasing (JRH; bias-bounded evaluation with calibrated noise).
  • Agent safety risk is increasingly “in the pipeline,” not at the output: AgentSCOPE finds intermediate-stage privacy violations are pervasive (PVR ≈ 82–94%) even when output leak rates look moderate (≈24–40%).
  • Prompt/prefix alignment can backfire in multilingual multi-agent settings: increasing alignment strength can increase internal dissociation across 15/16 languages and can reverse safety effects in some language/model combinations (Japanese backfire observed for Llama 3.3 70B).
  • Optimization and training recipes are targeting known RLHF/PO failure modes: theory explains why RLHF is “shallow” (zero gradient beyond a harm horizon), while BandPO proposes probability-aware clipping to prevent tail-token suppression and entropy collapse.
  • Security threats are expanding beyond prompts to the ML supply chain and infrastructure: distilled-dataset hijacking (OD), pretraining membership detection via gradients (GDS), smart-contract exploit agents (EVMbench), and GPU-memory prompt leakage mitigations (GELO).

2) Key themes (clusters)

Theme: Interactive, process-aware evaluation for agents

  • Why it matters: Static benchmarks saturate and hide key competencies like active information acquisition, decomposition, and long-horizon strategy—capabilities central to real deployments.
  • Representative papers:
  • Common approach:
    • Replace one-shot QA with multi-turn, budgeted interaction (queries/actions under constraints).
    • Add stage-wise diagnostics (e.g., fact coverage / hit rate; planning-tree states; turn budgets).
    • Use controlled environments to reduce drift/irreproducibility (parallel-world SERPs; containerized multi-version sites).
  • Open questions / failure modes:
    • Sensitivity to evaluator/judge choice (e.g., fixed judges in interactive proofs).
    • Whether controlled environments transfer to live-web idiosyncrasies and real search engine behavior.
    • Dataset breadth: several interactive suites are still relatively small in instance count for some tasks.

Theme: Judge models—benchmarking, stress-testing, and certifying bias

  • Why it matters: LLM-as-judge is now infrastructure for alignment and benchmarking; brittle or biased judges can mis-rank models and mis-train reward signals.
  • Representative papers:
  • Common approach:
    • Move beyond pairwise/BoN to listwise ranking with preference graphs (Pareto-dominance + human verification).
    • Generate targeted perturbations (format/paraphrase/verbosity/stochasticity; agentic transcript edits) to measure robustness.
    • Provide formal(ish) guarantees by estimating sensitivity and injecting calibrated noise to bound bias impact.
  • Open questions / failure modes:
    • Coverage: guarantees are local to chosen perturbation generators; unmeasured biases remain.
    • Judge brittleness to formatting is repeatedly highlighted; canonicalization defenses are still immature.
    • Cost/scale: stress tests in JRH used small subsets due to review cost; scaling remains open.

Theme: Privacy & security in agentic and deployment pipelines

Theme: Alignment objectives under stress—depth, multilinguality, and self-preservation

Theme: Better post-training signals and optimizers (preference learning, reward modeling, RL stability)

3) Technical synthesis

  • Multiple papers converge on a single meta-point: “final-answer accuracy” is an insufficient statistic; new suites measure interaction policies (queries, stopping, coverage), workflow edges (privacy flows), and robustness axes (UI versions, formatting perturbations).
  • Budgeting is becoming the common currency: Interactive Benchmarks uses turn/token budgets; MPW penalizes compound queries and rewards atomic coverage; BandPO reframes PPO clipping as a trust-region budget allocated per action probability.
  • Attribution is moving earlier in the pipeline: MPW’s Fact Coverage Rate and Hit Rate, AgentSCOPE’s Violation Origin Rate, and EvoTool’s blame attribution all aim to localize failure causes rather than treating episodes as monoliths.
  • Judge reliability is being treated like model reliability: IF-RewardBench (listwise graphs), JRH (perturbation suites), and A-BB (sensitivity + noise) form a stack: measure → stress → certify.
  • Alignment depth and “where gradients go” is now explicit: the RLHF gradient analysis explains why late-token behavior may remain unaligned; BandPO addresses a parallel phenomenon in RL updates where tail tokens get clipped away.
  • Controlled counterfactual environments are a recurring design pattern: MPW’s parallel world and TimeWarp’s multi-version sites both create reproducible distribution shifts that are hard to get from the live web.
  • Security evaluation is increasingly programmatic and end-to-end: EVMbench grades exploits by on-chain state changes; AegisUI grades protocol payload anomalies; GELO measures recoverability under ICA-style attacks.
  • Training recipes increasingly mix synthetic generation + filtering + RL: WebFactory uses LLM executor + deterministic replay filtering + RL; KARL uses agentic synthesis + off-policy RL; Med-V1 uses large synthetic verification corpora + SFT+GRPO.

4) Top 5 papers (with “why now”)

1) AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

  • Introduces Privacy Flow Graphs to evaluate privacy at each boundary (user→agent, agent→tool, tool→agent, agent→recipient).
  • Shows output-only checks can massively understate risk: PVR ≈ 82–94% vs LR ≈ 24–40% with TSR ≈ 63–79%.
  • Adds actionable attribution via Violation Origin Rate and stage-wise breakdown (instruction/tool-response stages dominate).
  • Skepticism: benchmark is 62 scenarios around a single persona; broader coverage needed.

2) IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

  • Large, human-verified judge meta-benchmark: 842 instructions, 6,011 responses, preference graphs via Pareto dominance.
  • Evaluates both constraint verification and listwise ranking (Kendall τb); top proprietary judge reported 0.609 vs human 0.755.
  • Finds judges struggle especially with negative-class detection and subjective constraints (Situation/Style) and complex compositions (Chain/Selection).
  • Skepticism: residual subjectivity remains; cross-language analysis is explicitly incomplete.

3) Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

  • Large preregistered multi-agent study (total N=1,584 runs) varying alignment ratio.
  • Reports near-universal increase in Dissociation Index with alignment (15/16 languages) and language-dependent CPI bifurcation; Japanese backfire observed in Study 1 for Llama 3.3 70B.
  • Shows a plausible “fix” (individuation prompt) can be iatrogenic (DI reported +1.120).
  • Skepticism: alignment prefix is English even in non-English runs; DI depends on a monologue channel and uses keyword-based indices.

4) EVMbench: Evaluating AI Agents on Smart Contract Security

  • Programmatic, reproducible evaluation across Detect (117), Patch (44), Exploit (23) with local-chain replay and anti-cheat RPC proxying.
  • Reports meaningful capability: GPT-5.3-Codex top Patch 41.7% and Exploit 71.0%; hints push Patch/Exploit much higher (discovery bottleneck).
  • Useful for both defense readiness and misuse forecasting because exploit success is graded by on-chain state/balance deltas.
  • Skepticism: Detect scoring depends on historical audit reports and can’t credit novel valid findings; Patch/Exploit task counts are modest.

5) BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

  • Formalizes why fixed PPO/GRPO clipping suppresses tail-token improvements and contributes to entropy collapse.
  • Provides a principled mapping from f-divergence trust regions → per-action ratio intervals, with closed forms for TV/χ² and numerical solvers for KL.
  • Empirically improves reasoning metrics (mean@32 gains ≥ ~2 points vs GRPO across multiple model sizes) and reports much higher converged entropy (~0.2 vs ~0.02).
  • Skepticism: added compute for numerical bounds; evaluation focus is math reasoning benchmarks.

5) Practical next steps

  • If you run agentic systems with tools, add pipeline-level privacy instrumentation: log and score user→agent, agent→tool, tool→agent, agent→output flows (AgentSCOPE-style), not just final responses.
  • Before trusting LLM-as-judge, stress-test your exact judge configuration (model + rubric + prompt) for format invariance and stochastic stability (JRH-style); treat judge reliability as a gating metric.
  • For instruction-following optimization, evaluate judges listwise (preference graphs / Kendall τb) and measure violation-detection (negative-class F1), not only pairwise win rates (IF-RewardBench).
  • For multilingual deployments, validate alignment interventions per language and per model family; don’t assume English-calibrated prompt alignment transfers (Alignment Backfire).
  • For RLHF/GRPO pipelines, monitor tail-token clipping incidence and entropy collapse; consider probability-aware clipping (BandPO) when exploration dies early.
  • For search/web agents, separate synthesis vs evidence acquisition: measure coverage/hit-rate (MPW) and robustness across UI versions (TimeWarp) to pinpoint whether failures are query formulation, stopping, or synthesis.
  • For security posture, assume supply-chain risk: treat third-party distilled datasets as untrusted inputs (OD threat model) and add provenance/validation checks; for smart-contract domains, benchmark both defensive and offensive capability (EVMbench).

Generated from per-paper analyses; no external browsing.