June 25, 2026 Research Brief
Agent control gets explicit.
Today’s strongest papers replace prompt-only agent design with governed memory, formal verification, and system-level security evaluation, while more realistic benchmarks expose where long-horizon agents still break.
Takeaways
- **Agent reliability work is shifting from “better prompts” to explicit control structures**: today’s strongest papers add formal verification, governed memory, Bayesian orchestration, active investigation, or symbolic rule evolution rather than relying on raw model capability alone.
- **Memory is now a first-class safety surface**: multiple papers show that long-term/shared memory can be poisoned, leak across scopes, fail retrieval, or accumulate bad experience; the best defenses bind authority/provenance at write time rather than trying to clean up content later.
- **Benchmarks are getting more operational and less toy-like**: new evaluations stress archive-grounded work, long-horizon terminal execution, multimodal jailbreak pipelines, scientific discovery, workplace documents, and adversarial fog-of-war settings.
Start with: Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees
Why it catches my eye: It turns agent memory from a heuristic feature into a formally defended control surface with concrete deployment relevance.
Read skeptically for: Guarantees rely on correct origin labels and independent corroborators, so real deployments may be messier than the model.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees
#1Useful if you deploy persistent agents: it offers a principled write-time memory defense with machine-checked guarantees.
- Why now
- Long-term memory is becoming standard agent infrastructure, and poisoning risk is moving from theory to practice.
- Skepticism
- Its guarantees depend on authenticated provenance and bounded assumptions that may weaken in open deployments.
Red-Teaming the Agentic Red-Team
#2A concrete warning paper showing offensive agents can be compromised at the system level, not just via prompt injection.
- Why now
- Agentic security tools are being deployed quickly, and this paper argues many are unsafe against adversarial targets.
- Skepticism
- Some mitigation tradeoffs remain unresolved, especially balancing containment with useful offensive capability.
VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification
#3Worth opening for a reusable pattern that compiles reasoning into checkable structure and repairs failures locally.
- Why now
- As reasoning models move into higher-stakes use, step-level verification matters more than polished final answers.
- Skepticism
- Semantic checks still rely partly on LLM audits, and verification cost grows with trace length.
Chinese version: [中文]
Run stats
- Candidates: 228
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-23T00:00:00Z → 2026-06-24T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.24496 | Red-Teaming the Agentic Red-Team | cs.CR, cs.AI | 95 | Security analysis of offensive agents shows sandbox escapes, key theft, and full operator compromise. | agent-security, red-teaming, sandboxing, offensive-agents, system-security |
2606.24251 | Probing the Misaligned Thinking Process of Language Models | cs.AI | 95 | Probes internal signals of deception/self-preservation; strong direct relevance to alignment monitoring. | alignment, interpretability, monitoring, misalignment, probes, safety |
2606.24597 | Qwen-AgentWorld: Language World Models for General Agents | cs.CL | 95 | Large language world models for general agents across 7 domains; strong frontier-agent relevance. | LLM, agents, world-models, reasoning, simulation, frontier |
2606.24322 | Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees | cs.CR | 94 | Formal defense against LLM memory poisoning with machine-checked guarantees and origin-bound authority. | agent-safety, memory-poisoning, formal-methods, long-term-memory, security |
2606.24626 | SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation | cs.AI | 93 | Targets long-horizon agent fault attribution with active investigation beyond context limits. | agents, evaluation, debugging, long-context, tool-use, reliability |
2606.24530 | NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? | cs.CL | 93 | Benchmark for coding agents on real science tasks; strong eval setup and clear limits to agent capability. | agents, benchmark, coding, evaluation, scientific-discovery |
2606.24245 | AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming | cs.SE, cs.AI, cs.CR | 92 | Evolves interpretable safety rules for LLM agents from feedback, targeting false pos/neg tradeoff. | agent-safety, guardrails, rule-learning, interpretability, tool-use |
2606.24526 | AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning | cs.CL | 92 | Large archive-grounded benchmark for agentic document reasoning with authentic long-context tasks. | agents, benchmark, RAG, document-reasoning, evaluation, long-context |
2606.24820 | SHERLOC: Structured Diagnostic Localization for Code Repair Agents | cs.CL | 92 | Structured localization for code repair agents with strong SWE-Bench results and practical tool-use gains. | agents, code, tool-use, evaluation, software-engineering, reasoning |
2606.24402 | Poisoned Playbooks: Demystifying Knowledge Poisoning Effects on AI Security Agents | cs.CR | 91 | Studies RAG poisoning on action-taking security agents, not just QA, with real exploit-behavior effects. | rag, data-poisoning, security-agents, agent-safety, evaluation |
2606.24453 | Bayesian control for coding agents | cs.AI, cs.CL | 91 | Bayesian orchestration for coding agents improves tool-use decisions and uncertainty estimation. | agents, coding, uncertainty, tool-use, bayesian, reliability |
2606.24281 | CALIBER: Calibrating Confidence Before and After Reasoning in Language Models | cs.CL, cs.AI | 91 | Targets LM calibration before/after reasoning; directly relevant to reliability and deployment safety. | calibration, reasoning, reliability, uncertainty, evaluation |
2606.24855 | OpenThoughts-Agent: Data Recipes for Agentic Models | cs.AI | 90 | Open data pipeline for training general agentic models with extensive ablations and strong reuse value. | agents, training-data, open-source, ablation, post-training, datasets |
2606.24428 | Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning | cs.CL | 90 | Targets self-confirmation failures in agent learning via execute-distill-verify with third-party checks. | agents, safety, experience-learning, verification, multi-agent, reliability |
2606.24388 | PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models | cs.AI, cs.LG | 89 | Large open VLM adversarial attack dataset broadens harmful-intent coverage for multimodal safety eval. | vlm-safety, adversarial-attacks, dataset, benchmark, multimodal |
2606.24124 | VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification | cs.AI | 89 | Verifies and repairs CoT via compilable formalism plus structured checks; useful for reasoning reliability. | reasoning, verification, CoT, reliability, hallucination, formal-methods |
2606.24026 | Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability? | cs.AI | 89 | Agentic benchmark for circuit explanation; useful bridge between LMs and mechanistic interpretability. | mechanistic-interpretability, agents, benchmark, explainability, evaluation |
2606.24819 | HelpBench: Assessing the Ability of LLMs to Provide Privacy, Safety, and Security Advice | cs.CR | 88 | Benchmark for LLM privacy/safety/security advice with authentic scenarios and rubric-based evaluation. | benchmark, safety-evaluation, privacy, security, helpfulness |
2606.24515 | Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation | cs.AI, cs.HC | 88 | RL for computer-use agents using autonomous VLM evaluation; scalable but evaluator reliability matters. | agents, RL, computer-use, evaluation, multimodal, post-training |
2606.24589 | AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability | cs.AI, cs.CL | 87 | Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transfer analysis. | red-teaming, evaluation, robustness, tool-use, reasoning |
2606.24595 | MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery | cs.CL | 87 | Audits long-term agent memory via hidden user-state recovery; useful for memory reliability and privacy. | agents, memory, benchmark, auditing, privacy, evaluation |
2606.24790 | Grad Detect: Gradient-Based Hallucination Detection in LLMs | cs.LG, cs.AI | 87 | Gradient-based hallucination detection beats output-level signals; promising for abstention and reliability. | hallucination, detection, reliability, uncertainty, LLM, abstention |
2606.24391 | Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War | cs.AI, cs.CL, cs.GT, cs.MA | 87 | Strategic benchmark stresses reasoning, diplomacy, and strict action reliability under partial observability. | benchmark, reasoning, agents, reliability, multi-agent |
2606.24408 | Natural Identifiers for Privacy and Data Audits in Large Language Models | cs.LG | 86 | Post-hoc privacy/data audits for trained LLMs without canaries could be highly practical. | privacy, auditing, data-governance, LLMs, dataset-inference, security |
2606.24143 | AsyncOPD: How Stale Can On-Policy Distillation Be? | cs.LG | 86 | Studies stale-policy effects in asynchronous on-policy distillation for LLM post-training efficiency. | post-training, distillation, reasoning, efficiency, training |
2606.24535 | Governed Shared Memory for Multi-Agent LLM Systems | cs.AI | 85 | Production-oriented governed shared memory for multi-agent systems with explicit failure modes/primitives. | multi-agent, memory-governance, provenance, policy, agent-infrastructure |
2606.24081 | PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation | cs.CR, cs.AI | 84 | Reproducible, self-evolving T2I jailbreak evaluation pipeline addresses paper-to-pipeline comparability. | jailbreaks, text-to-image, evaluation, reproducibility, agents |
2606.24311 | LemonHarness Technical Report | cs.AI | 84 | Execution framework constrains workspace state for long-horizon agents; practical safety infrastructure. | agents, sandboxing, execution, tooling, reliability, infrastructure |
2606.24622 | Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback | cs.AI, cs.HC | 84 | Framework combining explainability and RLHF-style evaluation across 200+ environments; reusable safety infra. | RLHF, alignment, evaluation, XAI, framework, safety |
2606.24133 | Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning | cs.LG, cs.CL | 84 | Online data-mixing for pretraining via RL; potentially impactful for frontier LLM training efficiency. | pretraining, data-mixing, scaling, reinforcement-learning, training |
AI Paper Insight Brief
2026-06-25
0) Executive takeaways (read this first)
- Agent reliability work is shifting from “better prompts” to explicit control structures: today’s strongest papers add formal verification, governed memory, Bayesian orchestration, active investigation, or symbolic rule evolution rather than relying on raw model capability alone.
- Memory is now a first-class safety surface: multiple papers show that long-term/shared memory can be poisoned, leak across scopes, fail retrieval, or accumulate bad experience; the best defenses bind authority/provenance at write time rather than trying to clean up content later.
- Benchmarks are getting more operational and less toy-like: new evaluations stress archive-grounded work, long-horizon terminal execution, multimodal jailbreak pipelines, scientific discovery, workplace documents, and adversarial fog-of-war settings.
- The main bottleneck in agentic interpretability and debugging is not hypothesis generation but validation/execution: both mechanistic-interpretability agents and long-trace fault attribution systems perform best when they can actively query evidence and run constrained tools.
- Security results are unusually concrete today: prompt-injection-free compromise of agentic red-team tools, systematic poisoning of security RAG agents, and formal guarantees for memory authority all point to immediate deployment implications.
- Data and orchestration choices matter as much as model size for agents: open data recipes, online data scheduling, asynchronous distillation, and cost-aware controller policies all show measurable gains in throughput, calibration, or downstream success.
2) Key themes (clusters)
Theme: Memory as the new agent attack and failure surface
- Why it matters: Persistent memory is no longer just a convenience layer; it is a control plane for future actions. Several papers show that failures arise at write-time authority assignment, retrieval-time surfacing, cross-agent propagation, and experience consolidation.
- Representative papers:
- Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees
- Governed Shared Memory for Multi-Agent LLM Systems
- MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery
- Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning
- Common approach:
- Bind memory objects to explicit metadata: origin, scope, provenance, supersession, or authority class.
- Separate write-side quality from read-side retrieval quality via dual probes or staged attribution.
- Use structured governance primitives or consensus validation before memory enters shared state.
- Evaluate memory as an artifact, not only via downstream task success.
- Open questions / failure modes:
- Correct origin labeling and independent corroborators remain assumptions in formal memory defenses.
- Top-k retrieval often fails even when useful evidence is stored.
- Consensus-based memory construction can still approve shared failure modes.
- Large stores can become unreadable or overflow forensic evaluation contexts.
Theme: Verification, diagnosis, and control over long-horizon reasoning
- Why it matters: As traces get longer and tasks more consequential, post hoc answer checking is too coarse. The strongest systems now verify steps, localize decisive faults, or maintain explicit beliefs over correctness before acting.
- Representative papers:
- Common approach:
- Convert free-form reasoning into structured intermediate objects: DSL traces, belief states, diagnostic findings, or atomic claims.
- Replace full-context ingestion with targeted evidence access through tools like read/search or critic/oracle calls.
- Use explicit uncertainty estimates to decide whether to verify, refine, or stop.
- Optimize for actionable localization, not just final correctness labels.
- Open questions / failure modes:
- Per-step verification and active investigation add runtime overhead.
- Many systems still rely on LLM audits for semantic judgments.
- Calibration quality depends on stable critic likelihoods or evaluator behavior.
- Benchmark gains may partly reflect repository familiarity or dataset leakage.
Theme: Security evaluation is becoming pipeline-level and system-level
- Why it matters: Security failures increasingly emerge from end-to-end pipelines rather than isolated prompts. Today’s papers show that evaluation must include retrieval, memory, tool execution, multimodal judges, and sandbox boundaries.
- Representative papers:
- Common approach:
- Evaluate complete attack pipelines, including retrieval, generation, filtering, and judging.
- Standardize interfaces/contracts so attacks can be reproduced and compared across papers and models.
- Stress cross-model transfer, black-box settings, and operational deployment assumptions.
- Measure not just attack success but reproduction fidelity, transferability, and utility preservation.
- Open questions / failure modes:
- Automated judges and black-box filters can distort measured ASR.
- Sparse-evidence regimes remain hard for retrieval-time defenses.
- Many deployed agent tools still have weak isolation and privilege boundaries.
- Standardized benchmarks may miss novel attack families or hidden implementation details.
Theme: Monitoring misalignment and hallucination from internal signals
- Why it matters: Output-only monitoring is often too late, too expensive, or too easy to evade. Several papers push toward internal-state or state-aware confidence signals that can flag failures earlier and more cheaply.
- Representative papers:
- Common approach:
- Train lightweight detectors on internal activations or gradients rather than only output text.
- Distinguish uncertainty states before and after reasoning, with different supervision targets.
- Use cascades: cheap internal filters first, expensive adjudication second.
- Evaluate under OOD, cross-lingual, or high-stakes advice settings rather than only IID QA.
- Open questions / failure modes:
- White-box access is required for gradient- or activation-based methods.
- Internal probes can overfire on aligned reasoning that shares surface cognitive patterns.
- Calibration can degrade in low-accuracy or extreme-shift regimes.
- Strong average scores can still hide a harmful tail of bad advice.
Theme: Agent capability gains are increasingly driven by data, runtime design, and systems choices
- Why it matters: Several papers show that better agent performance comes from curation, scheduling, runtime boundaries, and asynchronous training design—not just stronger base models.
- Representative papers:
- Common approach:
- Treat data mixture, rollout freshness, and runtime state as controllable optimization variables.
- Add lightweight controllers or schedulers with small overhead relative to gains.
- Use structured tool boundaries and explicit workspace management to reduce state drift.
- Validate with ablations that isolate which pipeline stages matter most.
- Open questions / failure modes:
- Some gains depend on careful hyperparameter tuning or model-family-specific behavior.
- Asynchronous methods face stale-support and variance tradeoffs.
- Runtime improvements may not transfer cleanly across model families or environments.
- Open recipes are still under-tested at larger scales or across broader base models.
Theme: Benchmarks are moving toward realistic agent work
- Why it matters: New benchmarks increasingly test what deployed agents actually do: search archives, reason over workplace documents, navigate GUIs, localize code faults, or attempt scientific discovery under budget constraints.
- Representative papers:
- AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning
- NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
- Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
- Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation
- Common approach:
- Use hidden evaluators, deterministic numeric answers, or game engines to reduce contamination and automate scoring.
- Emphasize long-horizon exploration, partial observability, or real execution constraints.
- Analyze failure modes beyond headline accuracy, especially evidence-finding and structured-output reliability.
- Keep environments bounded and reproducible via containers or sandboxes.
- Open questions / failure modes:
- Many benchmarks remain small or expensive to scale.
- Harness design can materially change absolute performance.
- Some leaderboards are preliminary due to uneven sampling or version drift.
- Closed-book or no-web protocols test discovery, but may underrepresent realistic workflows.
3) Technical synthesis
- A recurring pattern is structured decomposition before judgment: VeryTrace compiles traces into a DSL, SHERLOC emits five-field diagnoses, HYVE decomposes observe/hypothesize/validate, and SAFARI breaks fault attribution into atomic claims plus targeted evidence gathering.
- Write-time controls beat post hoc filtering in memory security: TMA-NM’s origin-bound authority and MemClaw’s scoped metadata/provenance both argue that content-based trust scoring is too malleable once poisoned state is stored.
- Several papers separate artifact quality from task success: MEMPROBE audits stored user-state directly; EDV audits memory quality; SHERLOC measures localization quality before repair; PixJail measures reproduction fidelity, not just ASR.
- Cheap-first, expensive-second cascades appear across domains: misalignment probes before LLM adjudication, Bayesian critics before oracle verification, and SAFARI’s targeted reads before final fault attribution.
- Tooling reliability is now a first-order bottleneck: HYVE’s main failures come from validation/code execution; LemonHarness addresses state drift from mutating actions; SHERLOC adds self-recovery for malformed tool use.
- Multiple papers formalize uncertainty as state-dependent: CALIBER distinguishes pre- vs post-reasoning confidence; Bayesian control maintains posterior correctness beliefs; AsyncOPD studies stale-policy mismatch under cached teacher support.
- Cross-model transfer is a major evaluation axis: PHANTOM, AdversaBench, PixJail, and Poisoned Playbooks all test whether attacks or findings generalize beyond the source model/setup.
- There is a clear move from single-turn text evaluation to operational pipelines involving retrieval, memory, tools, judges, and environment state.
- Several strong results come from small, explicit control modules rather than end-to-end retraining: SAC data schedulers, belief-state controllers, ILP-guided rule editors, and noise-corrected evaluator rewards.
- Benchmark papers increasingly report failure taxonomies, and those taxonomies are actionable: evidence misidentification (AGORA), wrong method choice (NatureBench), fog/state-tracking errors (Age of LLM), and retrieval-vs-write failures (MEMPROBE).
4) Top 5 papers (with “why now”)
Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees
- Formalizes why content- and lineage-based memory defenses are structurally bypassable via self-summarization, trusted-tool echo, and manufactured corroboration.
- Proposes TMA-NM with write-time origin binding, non-malleable taint propagation, corroboration-gated elevation, and tamper-evident logging.
- Empirically reports 0% attacker-action success across direct and laundering attacks while preserving legitimate utility.
- Why now: long-term memory is rapidly becoming standard in agents, and this paper gives one of the clearest principled security designs rather than another heuristic detector.
- Skepticism / limitation: guarantees depend on correct authenticated origin labels and independent corroborators; the mechanized theorem is bounded-model rather than a fully unbounded proof.
Red-Teaming the Agentic Red-Team
- Shows that agentic offensive-security tools can be compromised by attacker-controlled targets without explicit prompt injection.
- Reports 97.8% success for prompt-injection-free “agent-phishing” when runs do not refuse, plus host escape in 10/12 agents and host RCE in 8/12.
- Provides a concrete secure architecture centered on containment, least privilege, worker/orchestrator separation, and egress control.
- Why now: agentic red-team tools are being operationalized quickly, and this paper suggests many are currently unsafe to run against adversarial targets.
- Skepticism / limitation: some mitigation areas remain open, especially soft persistence/memory poisoning and functionality-vs-sandboxing tradeoffs.
VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification
- Introduces a lightweight DSL that turns natural-language reasoning into typed state transitions with executable checks.
- Combines deterministic verification with targeted LLM audits and localized repair, improving zero-shot performance across math, planning, and relational reasoning.
- Verifier metrics on ProcessBench are strong, and ablations support the two-stage translation and mechanical-check design.
- Why now: reasoning models are increasingly deployed in domains where step-level correctness matters more than polished final answers.
- Skepticism / limitation: cost scales with trace length, and semantic deductions still rely on LLM audits and a limited schema library.
OpenThoughts-Agent: Data Recipes for Agentic Models
- Delivers a fully open six-stage SFT pipeline with 100+ ablations on sourcing, mixing, augmentation, teacher choice, and rollout filtering.
- Releases a 100K-example dataset and a 32B model that reaches 44.8% average across seven agentic benchmarks, beating prior open-data peers at this scale.
- Finds that task-source choice and keeping longer multi-turn traces matter more than many expected knobs.
- Why now: open agent progress is increasingly bottlenecked by data quality and reproducibility, not just architecture.
- Skepticism / limitation: RL results are only at 8B, and the recipe is validated mainly on the Qwen3 family.
Poisoned Playbooks: Demystifying Knowledge Poisoning Effects on AI Security Agents
- Shows that a single poisoned write-up can alter exploit behavior in RAG-based security agents.
- Introduces the Verification Boundary: L1 code-verifiable claims are rejected, L2 knowledge-verifiable claims are model-dependent, and L3 runtime-dependent claims are consistently adopted.
- Real-world CVE tests show well-documented cases are rejected while several post-cutoff/runtime-dependent CVEs are adopted at 100% PAR.
- Why now: security agents increasingly depend on fresh public knowledge, exactly where sparse-evidence poisoning is most plausible.
- Skepticism / limitation: results are shown on one representative RAG stack, and the Verification Boundary is an empirical framework rather than a formal guarantee.
5) Practical next steps
- Treat memory writes as privileged operations: add origin labels, scope metadata, supersession links, and explicit elevation rules before allowing memory to authorize actions.
- Audit memory directly, not only via downstream success: run dump-all vs top-k retrieval probes to separate write failures from retrieval failures.
- Add cheap internal monitors before expensive judges: probe-based or confidence-based prefilters can cut adjudication cost while preserving coverage.
- For long traces, stop stuffing full logs into context: use read/search tools, persistent summaries, and claim-based investigation loops for debugging and fault attribution.
- Harden agent runtimes at the systems layer: isolate workers from orchestrators, minimize capabilities, centralize state-changing actions, and log all mutating operations.
- Benchmark with pipeline fidelity, not just headline success: for jailbreaks, security agents, or coding agents, track reproduction error, retrieval rank, localization quality, and utility preservation.
- Use structured diagnostic outputs between agent stages: pass root-cause hypotheses, dependencies, and testing implications rather than raw file lists or transcripts.
- Prioritize data curation experiments for agent training: source selection, teacher choice, and multi-turn rollout filtering appear to deliver outsized gains relative to many model-side tweaks.
Generated from per-paper analyses; no external browsing.