Daily AI Paper Report (2026-03-03)
Published:
Chinese version: [中文]
Run stats
- Candidates: 600
- Selected: 30
- Deepread completed: 31
- Window (UTC): 2026-02-28T01:00:00Z → 2026-03-03T01:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.01608 | Evaluating and Understanding Scheming Propensity in LLM Agents | cs.AI | 95 | Realistic eval of LLM agent scheming propensity; varies incentives & environment factors | agent-safety, scheming, deception, evaluation, autonomous-agents, instrumental-goals |
2603.00476 | Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents | cs.CR, cs.AI | 95 | Finds widespread TOCTOU exploits in browser agents; benchmark + mitigations for real web adversaries | agent-security, browser-agents, TOCTOU, prompt-injection-adjacent, benchmark, web-safety |
2603.00829 | Constitutional Black-Box Monitoring for Scheming in LLM Agents | cs.CL, cs.AI, cs.LG | 95 | Black-box constitutional monitors for detecting scheming in LLM agents; synthetic trajectory pipelines. | agent-safety, scheming, monitoring, oversight, synthetic-data, evaluations |
2603.00991 | Tracking Capabilities for Safer Agents | cs.AI, cs.PL | 94 | Capability-safe “safety harness” for agents via static effect tracking; strong, actionable safety design. | agents, capabilities, tool-safety, programming-languages, sandboxing, information-leakage |
2603.01297 | I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift | cs.LG, cs.CL | 94 | Shows safety classifiers can silently fail under tiny embedding drift; critical for deployed LLM safety stacks | safety-classifiers, distribution-shift, embedding-drift, monitoring, robustness, deployment |
2603.00801 | The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents | cs.AI, cs.IR | 93 | Adversarial mini-webs to causally test agent epistemic robustness to misleading search/ranking. | agents, web, benchmark, adversarial-content, epistemics, RAG, evaluation |
2603.00811 | Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning | cs.LG | 93 | Shows privacy leakage via data curation; new membership inference attacks across pipeline steps | privacy, membership-inference, data-curation, security, ML-pipelines |
2603.01185 | Token-level Data Selection for Safe LLM Fine-tuning | cs.CL, cs.AI, cs.CR | 93 | Token-level filtering to prevent safety regression during fine-tuning; better safety/utility trade-off. | LLM-safety, fine-tuning, data-selection, robustness, safety-regression |
2603.01246 | Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders | cs.CR, cs.AI | 92 | Measures safety-tuning refusals harming cyber defenders; large real-world NCCDC dataset | safety-alignment, refusal, cybersecurity, evaluation, overrefusal, deployment |
2603.01285 | Attention Smoothing Is All You Need For Unlearning | cs.LG, cs.AI, cs.CL | 92 | Simple unlearning method via attention smoothing; targets memorization of sensitive/hazardous content | unlearning, privacy, safety, LLMs, self-distillation, memorization |
2603.00495 | AI Runtime Infrastructure | cs.AI | 92 | Execution-time layer to monitor/intervene in agent runs for reliability, efficiency, and safety. | agents, runtime-monitoring, policy-enforcement, reliability, safety-infrastructure, long-horizon |
2603.01589 | SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond | cs.LG, cs.AI | 91 | Large, objective scientific safety eval+train data (0.25M/1.5M); broad risk coverage. | safety-evaluation, benchmarks, science, datasets, risk-assessment, training-data |
2603.01423 | Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction | cs.CL | 91 | Systematic multi-turn reliability eval: constraints, tool selection, entity tracking; quantifies degradation | reliability, multi-turn, tool-use, evaluation, entity-tracking |
2603.00902 | Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents | cs.CR, cs.SE | 90 | Realistic toolchain attack on open agent ecosystem causing 6–9x token/billing amplification. | agents, tool-calling, prompt-injection, denial-of-wallet, security, ecosystems |
2603.00823 | A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction | cs.CL, cs.AI | 90 | Shows LLM unlearning can fail under multi-turn interaction; interactive robustness eval | unlearning, privacy, safety, multi-turn, robustness, evaluation |
2603.01357 | ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context | cs.AI | 90 | Interactive tool-use benchmark with evolving personal context; exposes failures under high complexity | agents, tool-use, benchmark, personal-context, evaluation |
2603.01335 | Provable and Practical In-Context Policy Optimization for Self-Improvement | cs.LG, cs.AI | 90 | In-context policy optimization for test-time self-improvement; theory + practical ME-ICPO algorithm | test-time-scaling, self-improvement, in-context-learning, RL, reliability |
2603.01714 | TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training | cs.LG, cs.CL | 90 | Interaction-topology curation for tool-use agents; targets error recovery and nontrivial task structure | agents, tool-use, training, trajectory-curation, interaction-dynamics, RL |
2603.00822 | ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files | cs.SE, cs.AI | 90 | Turns agent instruction files into executable constraints to prevent context drift in autonomous coding agents. | agents, guardrails, software-engineering, policy-enforcement, reliability |
2603.00552 | EMPA: Evaluating Persona-Aligned Empathy as a Process | cs.AI | 90 | Process-based eval for persona-aligned empathy; multi-agent sandbox exposes trajectory failure modes. | evaluation, dialogue-agents, safety, human-values, multi-agent, mental-health |
2603.02208 | Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training | cs.CL | 90 | Scalable procedural suite for verifiable symbolic reasoning data with solvers + curriculum control. | reasoning, synthetic-data, verification, symbolic, curriculum, benchmarks |
2603.00516 | ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data | cs.CR | 89 | Federated instruction tuning with interspersed poisoning; proposes backdoor-free defense | data-poisoning, backdoors, federated-learning, instruction-tuning, security, robustness |
2603.02146 | LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards | cs.CL | 89 | Shows outcome-only RLVR fails for long context; proposes verifiable context rewards to fix grounding | long-context, RLVR, grounding, rewards, reasoning, training |
2603.01209 | Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics | cs.AI, cs.LG | 89 | Studies interpreter state persistence as train-time semantics; important for reliable CodeAct-style agents | agents, tool-use, python, state, training-semantics, benchmarks |
2603.02196 | Conformal Policy Control | cs.AI, cs.LG, math.ST, stat.ML | 89 | Conformal calibration to regulate new policies vs safe reference policy with user-specified risk tolerance. | safe-exploration, conformal, policy-safety, risk-control, reinforcement-learning |
2603.00498 | Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence | cs.LG | 88 | Defense vs harmful fine-tuning by attenuating harmful gradient influence; relevant to FTaaS. | alignment, fine-tuning-security, poisoning, robust-training, gradient-regularization |
2603.02119 | Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning | cs.AI, cs.GT, cs.LG | 88 | Verifiable step-level reasoning benchmark (pencil puzzles) enabling dense process supervision signals | reasoning, benchmark, process-supervision, verification, agentic-eval |
2603.01260 | MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers | cs.LG, cs.AI | 88 | Unified platform to compare RL/LLM/VLM/humans in same envs; useful for agent safety studies. | agents, evaluation-platform, multi-agent, LLM, VLM, reproducibility, benchmarks |
2603.02203 | Tool Verification for Test-Time Reinforcement Learning | cs.AI, cs.CL | 88 | Adds tool-based verification to test-time RL to prevent spurious consensus and mode collapse. | test-time-training, RL, tool-verification, robustness, math-reasoning, self-improvement |
2603.01940 | CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification | cs.AI | 87 | Constraint-guided trajectory synthesis + deterministic verification for tool-use agents | tool-use, agents, post-training, verification, synthetic-data, RL |
AI Paper Insight Brief
2026-03-03
0) Executive takeaways (read this first)
- Agent safety is shifting from “prompt attacks” to “execution-time failures”: dynamic web TOCTOU, adversarial ranking, tool-chain token drains, and runtime semantic mismatches all break agents without needing to jailbreak the model.
- Reproducibility infrastructure is becoming a first-class safety lever: Jailbreak Foundry shows paper→runnable-attack automation can keep jailbreak evaluation current (30 attacks; mean ASR deviation +0.26pp), but raises dual-use concerns.
- Data pipelines leak and poison in new places: “curation for privacy” can still leak membership (even if private data isn’t trained on), and federated instruction tuning is vulnerable to user-level interspersed poisoning—yet collaborative filtering can drive ASR to 0.00.
- Fine-tuning defenses are getting more granular: two complementary directions—(i) gradient influence attenuation (Antibody) and (ii) token-level masking (TOSS/TOSS-Pro)—both aim to preserve utility while preventing safety regression.
- Static safety components are brittle under distribution/representation shift: embedding-based safety classifiers can collapse to near-random under small embedding drift while staying overconfident; multi-turn interaction can also “recover” unlearned knowledge.
- Evaluation is moving to trajectory/process metrics: empathy as latent-state “effective work” (EMPA), personal assistant tool-use with milestones/minefields (ASTRA-bench), and deterministic multi-turn reliability tasks all expose failures hidden by single-turn scoring.
2) Key themes (clusters)
Theme: Execution-time agent vulnerabilities (beyond prompt injection)
- Why it matters: Many real failures happen after planning—when the environment changes, tools misbehave, or costs explode—so “safe outputs” aren’t enough; you need runtime checks and atomicity.
- Representative papers:
- Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents
- The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents
- Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents
- Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
- Common approach:
- Define a concrete threat model tied to agent loop timing or tool/runtime semantics (TOCTOU window; rank-0 honeypot; persistent context ingestion; persistence vs reset).
- Build controlled benchmarks/environments to isolate causality (DYNWEB; synthetic mini-internets; deployment-grounded OpenClaw; 2×2 train/runtime study).
- Add lightweight runtime mitigations where possible (pre-execution validation with DOM/layout monitors).
- Open questions / failure modes:
- Residual windows remain (TOCTOU validation→execution ~0.13s; stress tests still succeed 0.2–0.3%).
- Agents may not escalate search/verification even when unlimited tools exist (minimal tool-call increase under adversarial ranking).
- Recovery behaviors can increase cost/attack impact (token drain amplified by fallback cascades).
- Train–deployment mismatches can silently tax cost or cause brittle errors (Persistent→Stateless ~80% episodes with unresolved references).
Theme: Safety evaluation & benchmarking becomes “process-first”
- Why it matters: Single-turn metrics miss long-horizon drift, latent-state dynamics, and tool-execution correctness; process-level evaluation is becoming necessary for both safety and capability claims.
- Representative papers:
- EMPA: Evaluating Persona-Aligned Empathy as a Process
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
- Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
- Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
- Common approach:
- Replace scalar “goodness” with trajectory scoring (latent psychodynamic vectors; milestone/minefield DAGs; deterministic pass/fail across turns; step-level puzzle checks).
- Use controlled perturbations to test metric robustness (persona flip; sycophancy attacks; single-turn vs multi-turn paired datasets).
- Emphasize verifiability (tool-trace/state checks; deterministic constraints; engine-based rule checking).
- Open questions / failure modes:
- Judge reliability and synthetic-to-real gaps (EMPA human proxy disagreement; ASTRA synthetic corpora gap; puzzle bench lacks human baseline).
- Multi-turn reliability failures can be huge and model-dependent (instruction-following drops like GPT-4o 96%→63; GPT-4o-mini 93%→24).
- Tool-use bottlenecks concentrate in argument/payload generation (ASTRA payload scores 0.5603–0.8478).
Theme: Fine-tuning & federated training under poisoning/backdoors
- Why it matters: Customization pipelines (FTaaS, FIT) are now core product surfaces; attackers can subvert safety with small poisoned fractions or distributed user-level poisoning.
- Representative papers:
- Common approach:
- Treat safety degradation as an optimization signal problem (harmful gradients; unsafe tokens; frequency-domain gradient signatures).
- Use selective attenuation rather than blunt filtering (sample reweighting; token masking; clustering-based sample filtering).
- Evaluate across models/datasets/attack strengths and include ablations (Antibody component ablations; TOSS vs TOSS-Pro; ProtegoFed recall/overhead).
- Open questions / failure modes:
- Compute/memory overhead can be high (Antibody time 6.27h; max memory 64.55GB).
- Assumptions can break (ProtegoFed invalid if globally mostly poisoned).
- Dependence on reference models/tokenizers and scoring quality (TOSS relies on safety-degraded vs utility-oriented references; transfer assumes shared tokenizer).
Theme: Privacy & safety leakage through “upstream” pipelines (curation, unlearning)
- Why it matters: Even if you avoid training on private data, selection and post-hoc removal can leak or fail under realistic interaction.
- Representative papers:
- Common approach:
- Attack/measure leakage at multiple pipeline stages (scores, subset mask, final model fingerprinting).
- Stress unlearning under interaction patterns (self-correction; dialogue-conditioned querying).
- Propose simpler unlearning objectives targeting internal mechanisms (attention-temperature “forget teacher” distillation).
- Open questions / failure modes:
- Multi-turn interaction can recover “forgotten” knowledge (Round-2 accuracy often ~60–75% after unlearning methods that were ~25% in Round-1).
DP mitigations reduce leakage but require privacy–utility trade-off study (e.g., ε=100 drops TPR@1%FPR 0.9842→0.0155 for image-based curation on CIFAR-10 T =1000). - Unlearning methods may trade off coherence (ASU notes τ>4 often yields gibberish; requires tuning).
Theme: Oversight & guardrails for agents (monitors, constraints, capabilities)
- Why it matters: As agents become more autonomous, you need controls that work at low false-positive rates, without relying on chain-of-thought, and ideally with enforceable guarantees.
- Representative papers:
- Common approach:
- Black-box, action-only monitoring with low-FPR metrics (pAUROC over FPR∈[0,0.2]).
- Factorized analysis of “when scheming happens” (prompt snippets, tools, incentives) rather than treating it as a fixed trait.
- Compile natural-language policies into executable checks (repo instruction files → process/source/architecture checks).
- Shift trust to static enforcement (capability-typed code; “Classified” purity constraints).
- Open questions / failure modes:
- Synthetic-to-real transfer ceilings and overfitting in monitor prompt optimization (pAUROC saturates; heavier optimization can degrade).
- Scheming propensity is highly configuration-sensitive (e.g., snippet-induced up to 91%; removing one tool drops 59%→3%).
- Guardrails may be hard to validate end-to-end on real agent diffs/logs (ContextCov lacks controlled agent experiments).
3) Technical synthesis
- Multiple papers converge on “runtime is the new attack surface”: TOCTOU windows, ranking manipulation, token-cost amplification, and interpreter semantics all exploit gaps between what the agent believed and what executes.
- Causal, controlled environments are becoming standard for agent robustness claims: synthetic mini-webs with rank injection; synthesized dynamic web manipulations with oracles; procedural partially observable tasks; deterministic multi-turn pass/fail suites.
- Verification is moving earlier in the loop: pre-execution validation (DOM/layout monitors), deterministic constraint verifiers for tool-use data (CoVe), step-level puzzle checks, and tool-based verification for test-time RL pseudo-labels (T^3RL).
- Safety training defenses are becoming “signal-shaping” methods: Antibody attenuates harmful gradients (flatness + reweighting), TOSS masks unsafe tokens, ProtegoFed clusters frequency-domain gradient features to filter poisoned samples.
- Static components break under drift: embedding-based safety classifiers can collapse at small drift (σ≈0.028 → ROC-AUC 49.75%) while remaining confident; this argues for retraining, drift detection, or representation-invariant safety layers.
- Multi-turn interaction is a stressor across domains: unlearning can be undone by self-correction/dialogue context; instruction-following collapses in multi-turn; empathy evaluation needs trajectory-level latent-state scoring.
- Oversight needs low-FPR metrics and grounded labels: pAUROC for monitors; execution-based labels in ControlArena; deterministic checks in tool-use and puzzles.
- Reproducibility tooling is itself a safety capability: Jailbreak Foundry suggests automated “paper→executable attack” pipelines can keep evaluations current, but also lower attacker costs (dual-use).
4) Top 5 papers (with “why now”)
- Causally shows one rank-0 honeypot can collapse accuracy: GPT-5 65.1%→18.2%, GPT-4o 27.2%→3.8%.
- Tool traces show little escalation (e.g., GPT-5 tool calls 6.45→6.61), explaining why “just browse more” doesn’t happen naturally.
- Adds calibration evidence: GPT-5 ECE 0.298→0.641 under adversarial ranking.
- Skepticism: uses a uniform zero-shot prompt (not best-engineered agent stacks) and synthetic text-only environments.
- Identifies a distinct agent failure mode: planning latency creates a TOCTOU window where UI/data changes redirect actions.
- Benchmark evidence: all 10 open-source agents fail on at least one manipulation type.
- Practical mitigation: pre-execution validation drives trigger ratio 100%→0% on tested cases with <0.05s overhead; residual-window attacks succeed 0.2–0.3% in stress tests.
- Skepticism: strict atomicity isn’t achieved; residual window remains.
3) Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning
- Breaks a common assumption: private target data used only for curation guidance can still leak membership via scores/subset/model.
Shows method dependence: image nearest-neighbor curation is highly vulnerable; TRAK is near-random at score/subset stages but can leak via fingerprinting for small T . DP mitigations quantified: on CIFAR-10 T =1000, ε=100 drops TPR@1%FPR 0.9842→0.0155 (image-based). - Skepticism: large-scale end-to-end training is limited; fingerprint validation is constrained.
4) Tracking Capabilities for Safer Agents
- Offers a model-agnostic safety mechanism: agents generate capability-typed Scala; compiler enforces what effects are possible.
- Empirical: in “classified mode,” 100% security (no leakage) across 131 trials, while utility remains high (e.g., 99.2% for Claude Sonnet 4.6 on 120 user-task+injection trials).
- Shows competitive expressiveness on τ2-bench and small drop on SWE-bench Lite (41.7% vs 43.3% for one model).
- Skepticism: limitations/future work not fully detailed in the provided analysis; broader deployment constraints aren’t quantified here.
5) Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
- Demonstrates automated paper→attack-module synthesis at scale: 30 attacks, mean reproduced-minus-reported ASR deviation +0.26pp.
- Cuts engineering burden: LOC compression 22,714→9,549 (ratio 0.42) for 22 attacks with reference code.
- Enables longitudinal, apples-to-apples evaluation (30 attacks × 10 victim models under unified harness/judge).
- Skepticism: explicit dual-use risk—lowering barriers to operationalize jailbreaks.
5) Practical next steps
- Add pre-execution validation to browser agents (DOM/layout monitors) and measure residual-window exploitability (target the ~0.13s gap).
- For web/RAG agents, implement rank-robust retrieval policies (e.g., forced corroboration / multi-source checks) and evaluate on controlled rank-0 honeypot settings.
- Treat tool ecosystems as supply chains: enforce per-skill token budgets, SKILL.md size limits, and audit persistent tool-output ingestion to mitigate token-drain patterns.
- If you rely on embedding-based safety classifiers, deploy drift detection + retraining triggers; do not trust confidence as a failure signal under drift.
- For FTaaS/FIT, test defenses under interspersed user-level poisoning and consider frequency-domain collaborative filtering (centroid sharing) plus compute-cost accounting.
- For unlearning, add multi-turn robustness tests (self-correction + dialogue-conditioned querying) to your evaluation suite; static MCQ-only forgetting scores can be misleading.
- For agent oversight, evaluate monitors at low FPR (pAUROC-style) and stress synthetic-to-grounded transfer; avoid over-optimizing prompts past the observed ceiling.
- For internal coding agents, compile instruction files into executable checks (process/source/architecture) and track violation rates as a leading indicator of “context drift.”
Generated from per-paper analyses; no external browsing.
