June 21, 2026 Research Brief
Evaluation turns lifecycle-aware.
Today’s papers push AI assessment into realistic workflows while exposing brittle safety, grounding, and training assumptions that cleaner benchmarks often miss.
Takeaways
- The strongest pattern today is a shift from static evaluation to **deployment-realistic, lifecycle-aware testing**: papers benchmark agents in legal workflows, physician assistance, scientific instrument control, travel booking, multimodal memory, and reproducibility audits rather than isolated QA.
- Several papers argue that **surface success is misleading**: chest-radiography VLMs can answer correctly without using images, text-only truthfulness fixes often vanish under stricter controls, and psychometric bias probes do not cleanly predict realistic downstream behavior.
- For agent training, the most actionable advances are **stability and data-efficiency mechanisms**: CGTR stabilizes self on-policy distillation by gating teacher refreshes; Q-Evolve improves sparse-reward agents with in-distribution critic learning; RODS synthesizes boundary-targeted multi-turn data online.
Start with: ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
Why it catches my eye: It offers a reusable, real-artifact benchmark for agent evaluation that measures whether systems can handle messy reproducibility failures at scale.
Read skeptically for: GitHub issues are noisy proxies, and static audits still miss execution-only failures.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
#1A strong first read for anyone evaluating agents, because it replaces boutique tasks with scalable, real repository issues.
- Why now
- Agent evaluation is shifting toward realistic, refreshable workflows rather than static benchmark slices.
- Skepticism
- Issue reports are incomplete and noisy, so benchmark success may overestimate true debugging ability.
A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
#2Useful as a companion paper because it tests whether popular lightweight reliability fixes survive stricter controls.
- Why now
- Many teams still hope inference-time truthfulness patches can substitute for deeper system changes.
- Skepticism
- Results are limited to two model families and three benchmarks.
Vision-language models for chest radiography do not always need the image
#3It is a sharp causal audit showing multimodal success can come from shortcuts rather than intended grounding.
- Why now
- Medical and multimodal deployments increasingly assume image use without testing whether the model actually relies on images.
- Skepticism
- The finding is domain-specific, and transfer to other VLM settings is not guaranteed.
Chinese version: [中文]
Run stats
- Candidates: 3477
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-19T00:00:00Z → 2026-06-20T00:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.16562 | MIRAGE: Auditing Anti-Muslim Bias in Frontier LLMs Across Reasoning, Agentic, and Time-Coupled Conditions | cs.LG | 95 | Bias benchmark for frontier LLMs in reasoning and agentic settings; strong safety relevance. | llm-safety, bias, agent-evaluation, reasoning, benchmark |
2606.16808 | Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models | cs.AI | 94 | Targets jailbreak robustness in reasoning models via adaptive safety triggering and preference tuning. | llm-safety, jailbreaks, reasoning-models, alignment, dpo, sft |
2606.16751 | Automated jailbreak attack targeting multiple defense strategies | cs.CR, cs.AI | 93 | Automated black-box jailbreak framework across defenses; highly relevant for LLM safety eval. | llm-safety, jailbreaks, red-teaming, adversarial-prompts, evaluation |
2606.19047 | RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents | cs.AI | 93 | Online data synthesis for multi-turn tool-use RL; strong agent-training relevance and concrete mechanism. | llm-agents, tool-use, reinforcement-learning, data-synthesis, post-training |
2606.16127 | AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models | cs.CL, cs.AI, cs.LG | 92 | Audits authoritarian tendencies in LLMs with psychometrics, vignettes, and realistic prompts. | alignment, benchmark, auditing, political-bias, llm-evaluation |
2606.19149 | OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing | cs.CR, cs.LG | 91 | LLM-based vulnerability discovery with decomposition, verification, and dynamic testing; strong security relevance. | security, llm-agents, vulnerability-discovery, code-analysis, verification |
2606.16898 | Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization | cs.CV, cs.AI | 91 | Targets robust refusal for embodied VLMs on unanswerable queries via synthetic OOD generation. | embodied-agents, refusal, ood, vlm-safety, reliability |
2606.16988 | Agent trajectories as programs: fingerprinting and programming coding-agent behavior | cs.SE, cs.LG | 90 | Procedural fingerprinting for coding agents; useful for auditing, monitoring, and agent behavior analysis. | agents, auditing, behavioral-analysis, coding-agents, evaluation |
2606.18613 | Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance | cs.CL, cs.AI | 90 | Interactive benchmark for doctor-patient-EHR agents; grounded tool-use evaluation with realistic scenarios. | benchmark, llm-agents, tool-use, evaluation, medical-ai |
2606.17710 | Vision-language models for chest radiography do not always need the image | cs.CV, cs.AI, cs.CL, cs.LG | 90 | Causal audit shows medical VLMs may ignore images; strong reliability and evaluation contribution. | vlm-evaluation, causal-audit, multimodal, reliability, medical-ai |
2606.18728 | LegalWorld: A Life-Cycle Interactive Environment for Legal Agents | cs.CL | 90 | Lifecycle legal-agent environment with causal state, memory, and benchmark for long-horizon evaluation. | agents, benchmark, evaluation, legal, long-horizon |
2606.12160 | A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs | cs.CL | 89 | Hallucination detection from internal logits; strong truthfulness/reliability relevance. | LLM, truthfulness, hallucination, decoding, reliability, evaluation |
2606.11182 | EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents | cs.LG, cs.AI | 89 | Test-time prompt learning for real-world agent streams; strong agent relevance and practical adaptation. | agents, test-time learning, prompting, adaptation, evaluation |
2606.16802 | LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control | cs.AI | 89 | Safe, realistic benchmark for computer-use agents in scientific instrument control; high agent eval value. | agents, benchmark, computer-use, multimodal, evaluation, safety |
2606.16801 | The Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models | cs.CL | 89 | LLM split-learning privacy method with concrete obfuscation design and attack/utility tradeoff focus. | LLM, privacy, split-learning, security, training |
2606.13100 | LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction | cs.CL | 89 | Long-context grounded retrieval/extraction benchmark with full reports, tables, figures, and KPI labels. | benchmark, long-context, retrieval, grounding, finance, evaluation |
2606.11176 | Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories | cs.CV, cs.CL, cs.CY, cs.HC | 89 | Multi-agent data journalism with evidence grounding and verifiable claims; strong agent reliability angle. | agents, verification, grounding, multimodal, evaluation |
2606.18557 | DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models | cs.AI, cs.LG, cs.LO | 89 | Verifiable reasoning benchmark exposing major FM gaps on defeasible abduction and rendering robustness. | reasoning, benchmark, evaluation, robustness, logic |
2606.17449 | MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation | cs.CL, cs.AI, cs.CV, cs.LG, cs.MM | 88 | Targets multimodal RAG hallucinations with dynamic multi-agent intervention and evaluation. | multimodal-rag, hallucination, agents, evaluation, reliability |
2606.18190 | Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation | cs.CR, cs.LG | 88 | ATT&CK-labeled multi-source cyber log dataset fills a key gap; strong security evaluation utility. | cybersecurity, dataset, evaluation, ATT&CK, logs |
2606.03532 | When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation | cs.LG, cs.AI | 88 | Studies stability in self on-policy distillation for Qwen3-8B; useful for reliable LLM post-training. | llm-training, distillation, stability, post-training, reasoning |
2606.05711 | Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems | cs.CL | 88 | Unified view of latent communication for LLM multi-agent systems; relevant to agent design and oversight. | llm-agents, multi-agent, latent-communication, survey/framework |
2606.18237 | ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues | cs.CL, cs.AI, cs.LG | 88 | Scalable reproducibility audit framework for LLM agents using real GitHub issues and paper-repo pairs. | agents, evaluation, reproducibility, benchmark, tool-use |
2606.16952 | Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data | cs.LG, cs.AI, stat.AP, stat.ME, stat.ML | 87 | Audits synthetic-data privacy leakage with causal framing and statistical tests. | privacy, synthetic-data, auditing, memorization, causal, evaluation |
2602.12430 | Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward | cs.MA, cs.AI | 87 | Timely survey on LLM agent skills, MCP integration, and security risks; high reuse for agent safety. | agents, survey, MCP, security, skills |
2606.18142 | Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models | cs.AI, cs.CL, cs.CY | 87 | Agentic benchmark for implicit welfare preferences in tool-using frontier models; novel deployment eval. | agent-safety, benchmark, tool-use, evaluation, ai-ethics |
2606.07367 | Self-evolving LLM agents with in-distribution Optimization | cs.LG | 87 | Self-evolving LLM agent RL with process rewards and in-distribution optimization for long-horizon tasks. | llm-agents, reinforcement-learning, process-reward, long-horizon, training |
2606.16316 | RL-Index: Reinforcement Learning for Retrieval Index Reasoning | cs.IR, cs.AI, cs.LG | 87 | Agentic retrieval shifts reasoning to indexing time; promising for RAG quality and latency. | RAG, retrieval, agents, reinforcement-learning, indexing |
2606.05008 | M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks | cs.CV, cs.AI, cs.CL | 87 | Cognitively grounded benchmark for multimodal memory in long-video models; exposes retention failures. | multimodal, memory, benchmark, video, evaluation, reliability |
2606.03954 | VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring | cs.CV, cs.LG, cs.RO | 87 | Embodied safety agent with real-time intervention and goal-conditioned safety filtering for risky actions. | embodied-safety, vision-language, agents, intervention, robotics |
AI Paper Insight Brief
2026-06-21
0) Executive takeaways (read this first)
- The strongest pattern today is a shift from static evaluation to deployment-realistic, lifecycle-aware testing: papers benchmark agents in legal workflows, physician assistance, scientific instrument control, travel booking, multimodal memory, and reproducibility audits rather than isolated QA.
- Several papers argue that surface success is misleading: chest-radiography VLMs can answer correctly without using images, text-only truthfulness fixes often vanish under stricter controls, and psychometric bias probes do not cleanly predict realistic downstream behavior.
- For agent training, the most actionable advances are stability and data-efficiency mechanisms: CGTR stabilizes self on-policy distillation by gating teacher refreshes; Q-Evolve improves sparse-reward agents with in-distribution critic learning; RODS synthesizes boundary-targeted multi-turn data online.
- Security work is converging on artifact- and workflow-level attack surfaces, not just prompts: agent skills introduce package-level vulnerabilities, UniAttack shows strong single-turn jailbreak transfer across defenses, synthetic data audits need to separate true disclosures from “phantom” matches, and split learning still leaks without obfuscation.
- A recurring design principle is structured intermediate verification: explicit safety tags, provenance bindings, verifier-backed reasoning tasks, constrained decoding, dynamic retrieval filtering, and exploit reproduction all outperform or outlast purely prompt-based control.
- For practitioners, the near-term implication is to invest less in one-shot prompt patches and more in gated pipelines, provenance, verifier-backed evaluation, and long-horizon failure analysis.
2) Key themes (clusters)
Theme: Realistic agent evaluation is replacing toy benchmarks
- Why it matters: Many papers show that single-turn or isolated-task benchmarks overstate readiness. The new wave of benchmarks measures whether systems remain reliable when they must coordinate tools, memory, roles, and long-horizon state.
- Representative papers:
- LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control
- Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance
- LegalWorld: A Life-Cycle Interactive Environment for Legal Agents
- ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
- Common approach:
- Build environments from real artifacts or workflows: MIMIC-IV admissions, paired legal judgments, GitHub issues, browser-based instrument simulators.
- Separate local competence from long-horizon reliability with turn-, subtask-, and session-level metrics.
- Use executable tools or stateful interfaces rather than free-form text-only evaluation.
- Validate with human ratings, hidden issues, or real-world outcome proxies.
- Open questions / failure modes:
- Static or simulated environments may miss hardware latency, real user behavior, or execution-only failures.
- Session-level success remains much lower than turn-level success, showing severe error accumulation.
- LLM-as-judge remains common, leaving open questions about evaluator bias.
- Domain coverage is still narrow in several benchmarks despite improved realism.
Theme: Safety and bias failures intensify in agentic, multimodal, and time-coupled settings
- Why it matters: Multiple papers show that harms become more visible when models act, reason step-by-step, or consume changing context. Safety measured on direct completions often underestimates deployment risk.
- Representative papers:
- MIRAGE: Auditing Anti-Muslim Bias in Frontier LLMs Across Reasoning, Agentic, and Time-Coupled Conditions
- Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models
- AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models
- VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring
- Common approach:
- Evaluate matched-pair or role-conditioned scenarios rather than generic prompts.
- Compare direct completion against CoT, agentic action, or retrieval-conditioned settings.
- Measure intervention timing, decision asymmetry, or realistic downstream behavior rather than only stated beliefs.
- Test whether simple prompt mitigations transfer across deployment modes.
- Open questions / failure modes:
- Prompt-based mitigations often help direct outputs but transfer poorly to CoT or agentic settings.
- Bias and safety judgments still rely heavily on automated judges or synthetic scenarios.
- Goal inference errors can cascade into wrong safety decisions in embodied settings.
- Some reported draft results still depend on placeholder or limited-confidence estimates.
Theme: Training-time stability and adaptive curricula are becoming first-class concerns
- Why it matters: As agent training moves toward on-policy RL, self-distillation, and sparse-reward environments, instability is no longer a minor implementation detail. Several papers identify concrete failure modes and propose adaptive control loops.
- Representative papers:
- When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation
- Self-evolving LLM agents with in-distribution Optimization
- RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents
- EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents
- Common approach:
- Use adaptive gating or routing instead of fixed schedules or shared prompts.
- Focus learning on informative boundary regions: reward variance, in-distribution critic estimates, or task-specialized prompt slots.
- Combine offline seeds with online self-generated data rather than relying on static corpora.
- Validate with ablations that isolate the effect of refresh gates, replay policies, or routing.
- Open questions / failure modes:
- Most evidence is from moderate-scale models and limited task families.
- Cross-iteration distribution shift remains a problem even when each update is locally stable.
- Online synthesis and co-evolution add compute and stochasticity.
- Learned adaptation still often depends on labels, deterministic simulators, or structured feedback.
Theme: Verifiers, provenance, and structured audits beat naive trust
- Why it matters: A strong cross-paper trend is replacing “trust the model” with explicit evidence channels: provenance links, exact verifiers, exploit reproduction, or causal controls. This is especially important for truthfulness, reproducibility, and security.
- Representative papers:
- A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
- Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
- DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models
- OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing
- Common approach:
- Add exact or executable checks: code reruns, polynomial-time verifiers, Dockerized exploit tests, paired bootstrap controls.
- Bind outputs to evidence sources such as code lines, URLs, or structured derivations.
- Stress-test claims under multiple judges, seeds, or rendering modalities.
- Distinguish semantic success from exact localization or exact-match correctness.
- Open questions / failure modes:
- Verifier-backed tasks can be narrow or formalized in ways that miss broader real-world ambiguity.
- LLM judges remain a bottleneck in many pipelines despite stronger controls.
- Exact-match performance often lags semantic-match performance, limiting operational usefulness.
- Some methods improve fidelity at the cost of latency or conservatism.
Theme: Security is moving from prompt attacks to system surfaces
- Why it matters: The attack surface is broadening from jailbreak prompts to skills, synthetic data releases, split-learning activations, and repository-scale code analysis. Security evaluation is becoming more systems-oriented.
- Representative papers:
- Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
- Automated jailbreak attack targeting multiple defense strategies
- Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data
- The Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models
- Common approach:
- Treat artifacts and intermediate states as attack vectors: skills, activations, synthetic outputs, fused prompts.
- Use black-box or model-agnostic audits that do not require internal access.
- Quantify practical exploitability with confirmation pipelines, not just theoretical risk.
- Propose governance layers such as trust tiers, verification gates, or held-out control comparisons.
- Open questions / failure modes:
- Many defenses are heuristic rather than formally private or robust.
- Single-turn jailbreak testing may miss multi-turn vulnerabilities.
- Security findings can depend on auxiliary models, judges, or chosen feature extractors.
- Governance proposals are often not yet validated in production.
3) Technical synthesis
- Several papers replace static thresholds with state-aware gating: CGTR refreshes teachers only after reward and length-tail conditions; MODE-RAG routes only high-VFE cases to heavy intervention; Safe Trigger activates
<safe>mainly on risky inputs. - Distribution control is a common motif: Q-Evolve constrains policy improvement to the critic’s support, Eevee isolates prompt specialization by routing, and RODS keeps training near the capability boundary instead of over-sampling solved tasks.
- A notable evaluation pattern is paired or counterfactual testing: MIRAGE uses matched Muslim/non-Muslim prompts, TAC uses controlled scenario variants, chest-radiography auditing swaps same-label images, and synthetic-data auditing compares train vs holdout disclosures.
- Many systems now use small structured modules on top of frozen or large backbones rather than full retraining: Semantic Flip’s MLP abstention head, VLESA’s Q-filter, MIXGUARD’s calibration model, and provenance/verification layers in Data2Story.
- Exact or executable verification is increasingly used as a training or evaluation primitive: DeFAb’s polynomial-time verifier, OpenAnt’s exploit containers, ReproRepo’s hidden issue recovery, and Data2Story’s code-based claim checks.
- Across multimodal work, the main failure is not raw perception but mis-grounded integration: M3Eval finds interference and temporal confusion; MODE-RAG targets retrieval-visual mismatch; chest-radiography VLMs often rely on priors instead of images.
- Several papers show that prompt-only fixes are brittle: truthfulness gains disappear under controls, welfare prompts help unevenly, and bias mitigations transfer poorly from direct completion to CoT/agentic settings.
- Security papers increasingly quantify cost-adjusted attack/defense performance: UniAttack reports low query/token cost, OpenAnt reports pipeline cost savings from reachability filtering, and RL-Index shifts reasoning cost offline for large latency wins.
- Benchmarks are moving toward lifecycle metrics: Pass@Session, end-to-end workflow success, paper-any issue recovery, and long-horizon collapse detection reveal failures hidden by per-turn or per-step averages.
- A recurring empirical lesson is that semantic-match success exceeds exact-match success: seen in reproducibility audits, ATT&CK technique identification, and several retrieval/extraction settings, implying localization and formatting remain weak links.
4) Top 5 papers (with “why now”)
- When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation
- Identifies teacher-update scheduling as a core stability variable in self-distillation, not a minor training detail.
- Shows fixed hard refresh can cause catastrophic “state-oblivious collapse,” while CGTR avoids collapse and achieves the best final scores across four tasks.
- Useful now because more post-training pipelines rely on self-generated supervision and on-policy updates.
- Skeptical about: evidence is from one model family at moderate scale, so universality is still unproven.
- Self-evolving LLM agents with in-distribution Optimization
- Combines weighted IQL, GAE-derived process rewards, and behavior-proximal PPO to improve sparse-reward agents without backtracking or manual labels.
- Beats strong baselines across AlfWorld, WebShop, and ScienceWorld, with notable sample-efficiency gains.
- Useful now because agent RL is bottlenecked by sparse rewards and brittle process supervision.
- Skeptical about: retrospective rewards depend on structured textual feedback and cross-iteration drift is not fully solved.
- A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
- Provides a six-control evaluation framework and shows many token-level truthfulness gains shrink or reverse on instruction-tuned models.
- Finds simple decoding baselines and deliberative prompting often outperform more elaborate token-level interventions.
- Useful now because many teams still consider lightweight inference-time truthfulness patches for deployment.
- Skeptical about: scope is limited to two model families and three benchmarks, so small real effects elsewhere may remain.
- ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
- Reframes reproducibility evaluation around real GitHub issues, yielding a much larger and more realistic benchmark than hand-curated setups.
- Shows static no-execution agents can recover semantically related blockers for most papers while keeping false positives low.
- Useful now because agent evaluation needs scalable, continuously refreshable real-world tasks rather than boutique benchmarks.
- Skeptical about: GitHub issues are noisy and incomplete, and static audits miss execution-only failures.
- Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
- Synthesizes the emerging “agent skills” abstraction and highlights a concrete new security surface around community-contributed skill packages.
- Pulls together benchmark progress, acquisition methods, and security evidence including a reported 26.1% vulnerability rate in community skills.
- Useful now because skills/MCP-style packaging is becoming a practical standard for agents.
- Skeptical about: the governance framework is a proposal rather than an empirically validated deployment system.
5) Practical next steps
- Add state-aware gates to any self-distillation or self-training loop; log teacher refresh events, reward deltas, and sequence-length tails to detect collapse precursors.
- Evaluate agent systems with session-level or workflow-level metrics, not just per-turn accuracy; track error accumulation explicitly.
- For safety and bias, run matched-pair audits across direct, CoT, agentic, and retrieval-conditioned settings before trusting a mitigation.
- Prefer verifier-backed or provenance-backed outputs where possible: claim-to-code links, executable checks, structured evidence manifests, or exact reward functions.
- If building tool-use agents, test boundary-focused data generation or replay selection rather than scaling static corpora indiscriminately.
- For multimodal systems, add causal grounding checks such as swaps, occlusions, or retrieval perturbations to verify the model is using the intended modality.
- Treat skills, prompts, synthetic outputs, and intermediate activations as security surfaces; add trust tiers, sandboxing, and held-out control audits.
- Benchmark prompt-based fixes against simple baselines and strict controls before shipping; several papers suggest the apparent gains are often evaluation artifacts.
Generated from per-paper analyses; no external browsing.