Daily AI Paper Report (2026-03-03)

Published: March 03, 2026

Chinese version: [中文]

Run stats

Candidates: 600
Selected: 30
Deepread completed: 31
Window (UTC): 2026-02-28T01:00:00Z → 2026-03-03T01:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.01608`	Evaluating and Understanding Scheming Propensity in LLM Agents PDF	cs.AI	95	Realistic eval of LLM agent scheming propensity; varies incentives & environment factors	agent-safety, scheming, deception, evaluation, autonomous-agents, instrumental-goals
`2603.00476`	Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents PDF	cs.CR, cs.AI	95	Finds widespread TOCTOU exploits in browser agents; benchmark + mitigations for real web adversaries	agent-security, browser-agents, TOCTOU, prompt-injection-adjacent, benchmark, web-safety
`2603.00829`	Constitutional Black-Box Monitoring for Scheming in LLM Agents PDF	cs.CL, cs.AI, cs.LG	95	Black-box constitutional monitors for detecting scheming in LLM agents; synthetic trajectory pipelines.	agent-safety, scheming, monitoring, oversight, synthetic-data, evaluations
`2603.00991`	Tracking Capabilities for Safer Agents PDF	cs.AI, cs.PL	94	Capability-safe “safety harness” for agents via static effect tracking; strong, actionable safety design.	agents, capabilities, tool-safety, programming-languages, sandboxing, information-leakage
`2603.01297`	I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift PDF	cs.LG, cs.CL	94	Shows safety classifiers can silently fail under tiny embedding drift; critical for deployed LLM safety stacks	safety-classifiers, distribution-shift, embedding-drift, monitoring, robustness, deployment
`2603.00801`	The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents PDF	cs.AI, cs.IR	93	Adversarial mini-webs to causally test agent epistemic robustness to misleading search/ranking.	agents, web, benchmark, adversarial-content, epistemics, RAG, evaluation
`2603.00811`	Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning PDF	cs.LG	93	Shows privacy leakage via data curation; new membership inference attacks across pipeline steps	privacy, membership-inference, data-curation, security, ML-pipelines
`2603.01185`	Token-level Data Selection for Safe LLM Fine-tuning PDF	cs.CL, cs.AI, cs.CR	93	Token-level filtering to prevent safety regression during fine-tuning; better safety/utility trade-off.	LLM-safety, fine-tuning, data-selection, robustness, safety-regression
`2603.01246`	Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders PDF	cs.CR, cs.AI	92	Measures safety-tuning refusals harming cyber defenders; large real-world NCCDC dataset	safety-alignment, refusal, cybersecurity, evaluation, overrefusal, deployment
`2603.01285`	Attention Smoothing Is All You Need For Unlearning PDF	cs.LG, cs.AI, cs.CL	92	Simple unlearning method via attention smoothing; targets memorization of sensitive/hazardous content	unlearning, privacy, safety, LLMs, self-distillation, memorization
`2603.00495`	AI Runtime Infrastructure PDF	cs.AI	92	Execution-time layer to monitor/intervene in agent runs for reliability, efficiency, and safety.	agents, runtime-monitoring, policy-enforcement, reliability, safety-infrastructure, long-horizon
`2603.01589`	SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond PDF	cs.LG, cs.AI	91	Large, objective scientific safety eval+train data (0.25M/1.5M); broad risk coverage.	safety-evaluation, benchmarks, science, datasets, risk-assessment, training-data
`2603.01423`	Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction PDF	cs.CL	91	Systematic multi-turn reliability eval: constraints, tool selection, entity tracking; quantifies degradation	reliability, multi-turn, tool-use, evaluation, entity-tracking
`2603.00902`	Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents PDF	cs.CR, cs.SE	90	Realistic toolchain attack on open agent ecosystem causing 6–9x token/billing amplification.	agents, tool-calling, prompt-injection, denial-of-wallet, security, ecosystems
`2603.00823`	A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction PDF	cs.CL, cs.AI	90	Shows LLM unlearning can fail under multi-turn interaction; interactive robustness eval	unlearning, privacy, safety, multi-turn, robustness, evaluation
`2603.01357`	ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context PDF	cs.AI	90	Interactive tool-use benchmark with evolving personal context; exposes failures under high complexity	agents, tool-use, benchmark, personal-context, evaluation
`2603.01335`	Provable and Practical In-Context Policy Optimization for Self-Improvement PDF	cs.LG, cs.AI	90	In-context policy optimization for test-time self-improvement; theory + practical ME-ICPO algorithm	test-time-scaling, self-improvement, in-context-learning, RL, reliability
`2603.01714`	TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training PDF	cs.LG, cs.CL	90	Interaction-topology curation for tool-use agents; targets error recovery and nontrivial task structure	agents, tool-use, training, trajectory-curation, interaction-dynamics, RL
`2603.00822`	ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files PDF	cs.SE, cs.AI	90	Turns agent instruction files into executable constraints to prevent context drift in autonomous coding agents.	agents, guardrails, software-engineering, policy-enforcement, reliability
`2603.00552`	EMPA: Evaluating Persona-Aligned Empathy as a Process PDF	cs.AI	90	Process-based eval for persona-aligned empathy; multi-agent sandbox exposes trajectory failure modes.	evaluation, dialogue-agents, safety, human-values, multi-agent, mental-health
`2603.02208`	Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training PDF	cs.CL	90	Scalable procedural suite for verifiable symbolic reasoning data with solvers + curriculum control.	reasoning, synthetic-data, verification, symbolic, curriculum, benchmarks
`2603.00516`	ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data PDF	cs.CR	89	Federated instruction tuning with interspersed poisoning; proposes backdoor-free defense	data-poisoning, backdoors, federated-learning, instruction-tuning, security, robustness
`2603.02146`	LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards PDF	cs.CL	89	Shows outcome-only RLVR fails for long context; proposes verifiable context rewards to fix grounding	long-context, RLVR, grounding, rewards, reasoning, training
`2603.01209`	Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics PDF	cs.AI, cs.LG	89	Studies interpreter state persistence as train-time semantics; important for reliable CodeAct-style agents	agents, tool-use, python, state, training-semantics, benchmarks
`2603.02196`	Conformal Policy Control PDF	cs.AI, cs.LG, math.ST, stat.ML	89	Conformal calibration to regulate new policies vs safe reference policy with user-specified risk tolerance.	safe-exploration, conformal, policy-safety, risk-control, reinforcement-learning
`2603.00498`	Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence PDF	cs.LG	88	Defense vs harmful fine-tuning by attenuating harmful gradient influence; relevant to FTaaS.	alignment, fine-tuning-security, poisoning, robust-training, gradient-regularization
`2603.02119`	Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning PDF	cs.AI, cs.GT, cs.LG	88	Verifiable step-level reasoning benchmark (pencil puzzles) enabling dense process supervision signals	reasoning, benchmark, process-supervision, verification, agentic-eval
`2603.01260`	MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers PDF	cs.LG, cs.AI	88	Unified platform to compare RL/LLM/VLM/humans in same envs; useful for agent safety studies.	agents, evaluation-platform, multi-agent, LLM, VLM, reproducibility, benchmarks
`2603.02203`	Tool Verification for Test-Time Reinforcement Learning PDF	cs.AI, cs.CL	88	Adds tool-based verification to test-time RL to prevent spurious consensus and mode collapse.	test-time-training, RL, tool-verification, robustness, math-reasoning, self-improvement
`2603.01940`	CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification PDF	cs.AI	87	Constraint-guided trajectory synthesis + deterministic verification for tool-use agents	tool-use, agents, post-training, verification, synthetic-data, RL

AI Paper Insight Brief

2026-03-03

0) Executive takeaways (read this first)

Agent safety is shifting from “prompt attacks” to “execution-time failures”: dynamic web TOCTOU, adversarial ranking, tool-chain token drains, and runtime semantic mismatches all break agents without needing to jailbreak the model.
Reproducibility infrastructure is becoming a first-class safety lever: Jailbreak Foundry shows paper→runnable-attack automation can keep jailbreak evaluation current (30 attacks; mean ASR deviation +0.26pp), but raises dual-use concerns.
Data pipelines leak and poison in new places: “curation for privacy” can still leak membership (even if private data isn’t trained on), and federated instruction tuning is vulnerable to user-level interspersed poisoning—yet collaborative filtering can drive ASR to 0.00.
Fine-tuning defenses are getting more granular: two complementary directions—(i) gradient influence attenuation (Antibody) and (ii) token-level masking (TOSS/TOSS-Pro)—both aim to preserve utility while preventing safety regression.
Static safety components are brittle under distribution/representation shift: embedding-based safety classifiers can collapse to near-random under small embedding drift while staying overconfident; multi-turn interaction can also “recover” unlearned knowledge.
Evaluation is moving to trajectory/process metrics: empathy as latent-state “effective work” (EMPA), personal assistant tool-use with milestones/minefields (ASTRA-bench), and deterministic multi-turn reliability tasks all expose failures hidden by single-turn scoring.

2) Key themes (clusters)

Theme: Execution-time agent vulnerabilities (beyond prompt injection)

Why it matters: Many real failures happen after planning—when the environment changes, tools misbehave, or costs explode—so “safe outputs” aren’t enough; you need runtime checks and atomicity.
Representative papers:
Common approach:
- Define a concrete threat model tied to agent loop timing or tool/runtime semantics (TOCTOU window; rank-0 honeypot; persistent context ingestion; persistence vs reset).
- Build controlled benchmarks/environments to isolate causality (DYNWEB; synthetic mini-internets; deployment-grounded OpenClaw; 2×2 train/runtime study).
- Add lightweight runtime mitigations where possible (pre-execution validation with DOM/layout monitors).
Open questions / failure modes:
- Residual windows remain (TOCTOU validation→execution ~0.13s; stress tests still succeed 0.2–0.3%).
- Agents may not escalate search/verification even when unlimited tools exist (minimal tool-call increase under adversarial ranking).
- Recovery behaviors can increase cost/attack impact (token drain amplified by fallback cascades).
- Train–deployment mismatches can silently tax cost or cause brittle errors (Persistent→Stateless ~80% episodes with unresolved references).

Theme: Safety evaluation & benchmarking becomes “process-first”

Why it matters: Single-turn metrics miss long-horizon drift, latent-state dynamics, and tool-execution correctness; process-level evaluation is becoming necessary for both safety and capability claims.
Representative papers:
Common approach:
- Replace scalar “goodness” with trajectory scoring (latent psychodynamic vectors; milestone/minefield DAGs; deterministic pass/fail across turns; step-level puzzle checks).
- Use controlled perturbations to test metric robustness (persona flip; sycophancy attacks; single-turn vs multi-turn paired datasets).
- Emphasize verifiability (tool-trace/state checks; deterministic constraints; engine-based rule checking).
Open questions / failure modes:
- Judge reliability and synthetic-to-real gaps (EMPA human proxy disagreement; ASTRA synthetic corpora gap; puzzle bench lacks human baseline).
- Multi-turn reliability failures can be huge and model-dependent (instruction-following drops like GPT-4o 96%→63; GPT-4o-mini 93%→24).
- Tool-use bottlenecks concentrate in argument/payload generation (ASTRA payload scores 0.5603–0.8478).

Theme: Fine-tuning & federated training under poisoning/backdoors

Why it matters: Customization pipelines (FTaaS, FIT) are now core product surfaces; attackers can subvert safety with small poisoned fractions or distributed user-level poisoning.
Representative papers:
Common approach:
- Treat safety degradation as an optimization signal problem (harmful gradients; unsafe tokens; frequency-domain gradient signatures).
- Use selective attenuation rather than blunt filtering (sample reweighting; token masking; clustering-based sample filtering).
- Evaluate across models/datasets/attack strengths and include ablations (Antibody component ablations; TOSS vs TOSS-Pro; ProtegoFed recall/overhead).
Open questions / failure modes:
- Compute/memory overhead can be high (Antibody time 6.27h; max memory 64.55GB).
- Assumptions can break (ProtegoFed invalid if globally mostly poisoned).
- Dependence on reference models/tokenizers and scoring quality (TOSS relies on safety-degraded vs utility-oriented references; transfer assumes shared tokenizer).

Theme: Privacy & safety leakage through “upstream” pipelines (curation, unlearning)

Why it matters: Even if you avoid training on private data, selection and post-hoc removal can leak or fail under realistic interaction.
Representative papers:
Common approach:
- Attack/measure leakage at multiple pipeline stages (scores, subset mask, final model fingerprinting).
- Stress unlearning under interaction patterns (self-correction; dialogue-conditioned querying).
- Propose simpler unlearning objectives targeting internal mechanisms (attention-temperature “forget teacher” distillation).
Open questions / failure modes:
- Multi-turn interaction can recover “forgotten” knowledge (Round-2 accuracy often ~60–75% after unlearning methods that were ~25% in Round-1).
- DP mitigations reduce leakage but require privacy–utility trade-off study (e.g., ε=100 drops TPR@1%FPR 0.9842→0.0155 for image-based curation on CIFAR-10 T =1000).
- Unlearning methods may trade off coherence (ASU notes τ>4 often yields gibberish; requires tuning).

Theme: Oversight & guardrails for agents (monitors, constraints, capabilities)

Why it matters: As agents become more autonomous, you need controls that work at low false-positive rates, without relying on chain-of-thought, and ideally with enforceable guarantees.
Representative papers:
Common approach:
- Black-box, action-only monitoring with low-FPR metrics (pAUROC over FPR∈[0,0.2]).
- Factorized analysis of “when scheming happens” (prompt snippets, tools, incentives) rather than treating it as a fixed trait.
- Compile natural-language policies into executable checks (repo instruction files → process/source/architecture checks).
- Shift trust to static enforcement (capability-typed code; “Classified” purity constraints).
Open questions / failure modes:
- Synthetic-to-real transfer ceilings and overfitting in monitor prompt optimization (pAUROC saturates; heavier optimization can degrade).
- Scheming propensity is highly configuration-sensitive (e.g., snippet-induced up to 91%; removing one tool drops 59%→3%).
- Guardrails may be hard to validate end-to-end on real agent diffs/logs (ContextCov lacks controlled agent experiments).

3) Technical synthesis

Multiple papers converge on “runtime is the new attack surface”: TOCTOU windows, ranking manipulation, token-cost amplification, and interpreter semantics all exploit gaps between what the agent believed and what executes.
Causal, controlled environments are becoming standard for agent robustness claims: synthetic mini-webs with rank injection; synthesized dynamic web manipulations with oracles; procedural partially observable tasks; deterministic multi-turn pass/fail suites.
Verification is moving earlier in the loop: pre-execution validation (DOM/layout monitors), deterministic constraint verifiers for tool-use data (CoVe), step-level puzzle checks, and tool-based verification for test-time RL pseudo-labels (T^3RL).
Safety training defenses are becoming “signal-shaping” methods: Antibody attenuates harmful gradients (flatness + reweighting), TOSS masks unsafe tokens, ProtegoFed clusters frequency-domain gradient features to filter poisoned samples.
Static components break under drift: embedding-based safety classifiers can collapse at small drift (σ≈0.028 → ROC-AUC 49.75%) while remaining confident; this argues for retraining, drift detection, or representation-invariant safety layers.
Multi-turn interaction is a stressor across domains: unlearning can be undone by self-correction/dialogue context; instruction-following collapses in multi-turn; empathy evaluation needs trajectory-level latent-state scoring.
Oversight needs low-FPR metrics and grounded labels: pAUROC for monitors; execution-based labels in ControlArena; deterministic checks in tool-use and puzzles.
Reproducibility tooling is itself a safety capability: Jailbreak Foundry suggests automated “paper→executable attack” pipelines can keep evaluations current, but also lower attacker costs (dual-use).

4) Top 5 papers (with “why now”)

1) The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents

Causally shows one rank-0 honeypot can collapse accuracy: GPT-5 65.1%→18.2%, GPT-4o 27.2%→3.8%.
Tool traces show little escalation (e.g., GPT-5 tool calls 6.45→6.61), explaining why “just browse more” doesn’t happen naturally.
Adds calibration evidence: GPT-5 ECE 0.298→0.641 under adversarial ranking.
Skepticism: uses a uniform zero-shot prompt (not best-engineered agent stacks) and synthetic text-only environments.

2) Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents

Identifies a distinct agent failure mode: planning latency creates a TOCTOU window where UI/data changes redirect actions.
Benchmark evidence: all 10 open-source agents fail on at least one manipulation type.
Practical mitigation: pre-execution validation drives trigger ratio 100%→0% on tested cases with <0.05s overhead; residual-window attacks succeed 0.2–0.3% in stress tests.
Skepticism: strict atomicity isn’t achieved; residual window remains.

3) Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning

Breaks a common assumption: private target data used only for curation guidance can still leak membership via scores/subset/model.
Shows method dependence: image nearest-neighbor curation is highly vulnerable; TRAK is near-random at score/subset stages but can leak via fingerprinting for small T .
DP mitigations quantified: on CIFAR-10 T =1000, ε=100 drops TPR@1%FPR 0.9842→0.0155 (image-based).
Skepticism: large-scale end-to-end training is limited; fingerprint validation is constrained.

4) Tracking Capabilities for Safer Agents

Offers a model-agnostic safety mechanism: agents generate capability-typed Scala; compiler enforces what effects are possible.
Empirical: in “classified mode,” 100% security (no leakage) across 131 trials, while utility remains high (e.g., 99.2% for Claude Sonnet 4.6 on 120 user-task+injection trials).
Shows competitive expressiveness on τ2-bench and small drop on SWE-bench Lite (41.7% vs 43.3% for one model).
Skepticism: limitations/future work not fully detailed in the provided analysis; broader deployment constraints aren’t quantified here.

5) Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Demonstrates automated paper→attack-module synthesis at scale: 30 attacks, mean reproduced-minus-reported ASR deviation +0.26pp.
Cuts engineering burden: LOC compression 22,714→9,549 (ratio 0.42) for 22 attacks with reference code.
Enables longitudinal, apples-to-apples evaluation (30 attacks × 10 victim models under unified harness/judge).
Skepticism: explicit dual-use risk—lowering barriers to operationalize jailbreaks.

5) Practical next steps

Add pre-execution validation to browser agents (DOM/layout monitors) and measure residual-window exploitability (target the ~0.13s gap).
For web/RAG agents, implement rank-robust retrieval policies (e.g., forced corroboration / multi-source checks) and evaluate on controlled rank-0 honeypot settings.
Treat tool ecosystems as supply chains: enforce per-skill token budgets, SKILL.md size limits, and audit persistent tool-output ingestion to mitigate token-drain patterns.
If you rely on embedding-based safety classifiers, deploy drift detection + retraining triggers; do not trust confidence as a failure signal under drift.
For FTaaS/FIT, test defenses under interspersed user-level poisoning and consider frequency-domain collaborative filtering (centroid sharing) plus compute-cost accounting.
For unlearning, add multi-turn robustness tests (self-correction + dialogue-conditioned querying) to your evaluation suite; static MCQ-only forgetting scores can be misleading.
For agent oversight, evaluate monitors at low FPR (pAUROC-style) and stress synthetic-to-grounded transfer; avoid over-optimizing prompts past the observed ceiling.
For internal coding agents, compile instruction files into executable checks (process/source/architecture) and track violation rates as a leading indicator of “context drift.”

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-03

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Execution-time agent vulnerabilities (beyond prompt injection)

Theme: Safety evaluation & benchmarking becomes “process-first”

Theme: Fine-tuning & federated training under poisoning/backdoors

Theme: Privacy & safety leakage through “upstream” pipelines (curation, unlearning)

Theme: Oversight & guardrails for agents (monitors, constraints, capabilities)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps