June 23, 2026 Research Brief
Evaluation becomes infrastructure.
Today’s papers argue that progress claims increasingly hinge on benchmark repair, process-level verification, and deployment-interface audits, while agent gains come more from structured scaffolds than larger models alone.
Takeaways
- Benchmark and evaluation quality is a first-order bottleneck: multiple papers show that noisy labels, structural shortcuts, selective archives, and task-misaligned metrics can dominate apparent model progress more than new reasoning tricks.
- Inference-time control is getting more targeted and mechanistic: today’s strongest interventions are not generic “self-reflection,” but selective latent-space edits, step-wise alignment, calibrated reflection triggers, and prioritized human review.
- Agent reliability is increasingly being improved through structure around the model rather than larger models alone: memory systems, deterministic tools, skill libraries, verification backends, and protocol discipline repeatedly deliver large gains.
Start with: Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Why it catches my eye: Open this first because it shows benchmark noise can outweigh model gains and offers a reusable repair pipeline.
Read skeptically for: The audit covers curated subsets and a limited set of model families, so broader benchmark effects remain uncertain.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
#1Useful if you evaluate reasoning models: it shows label errors can materially change conclusions and provides a practical relabeling workflow.
- Why now
- Reasoning progress is hard to trust if benchmark noise is larger than claimed gains.
- Skepticism
- Results are strongest on curated subsets and may not fully predict behavior on broader benchmark ecosystems.
AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation
#2A strong companion paper because it cleanly separates reasoning scaffolds from evidence access in a real decision workflow.
- Why now
- Many agent papers claim scientific reasoning gains without isolating whether data access is the real driver.
- Skepticism
- Small benchmark size and possible gold-set circularity limit how far the conclusions generalize.
Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization
#3Worth reading for a mechanistic, training-free inference-time method that targets hallucination without generic self-reflection.
- Why now
- Inference-time reliability work is shifting toward selective latent interventions rather than broad decoding heuristics.
- Skepticism
- The method depends on its representation assumptions and on having a reliable context anchor.
Chinese version: [中文]
Run stats
- Candidates: 3675
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-19T00:00:00Z → 2026-06-20T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.16121 | Invisible Manipulation Channels in AI-Assisted Financial Advisory: Implications for Market Integrity and Regulatory Design | cs.CR | 93 | Shows stealthy inference-time manipulation of LLM outputs that evades output-based audits. | llm-security, manipulation, auditing, finance, watermarking |
2606.17815 | Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors | cs.CR, cs.CL | 90 | Audits CLIP backdoors across deployment interfaces; strong security eval framework reuse value. | backdoors, CLIP, security, evaluation, multimodal |
2606.12830 | Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning | cs.CV, cs.AI | 90 | Tool-augmented visual agent for spatial reasoning; strong agentic capability with reusable training setup. | agents, multimodal, tool-use, spatial-reasoning, VLM |
2606.02837 | Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling | cs.CL, cs.AI | 90 | Audits major reasoning benchmarks; many label errors found, with corrected releases and relabeling framework. | benchmark, reasoning, data-quality, evaluation, neurosymbolic |
2606.17029 | DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents | cs.CL | 90 | Rubric supervision for RL deep-research agents; strong agent quality/eval relevance. | agents, RL, evaluation, deep-research, rubrics |
2606.10799 | Evaluating Research-Level Math Proofs via Strict Step-Level Verification | cs.AI | 89 | Step-level proof verification targets hallucination and context poisoning in LLM evaluation. | LLM-evaluation, verification, reasoning, hallucination, math |
2606.16774 | OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models | cs.AI, cs.CL | 89 | Skill-tree search for agentic LLMs; reusable tool-use skills with broad downstream relevance. | llm-agents, tool-use, skill-learning, tree-search, generalization |
2606.17005 | Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations | cs.AI, stat.ME | 89 | Framework for auditing frontier AI eval archives under missingness and benchmark drift. | evaluation, frontier-models, bayesian-inference, auditing, benchmarks |
2606.12983 | Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation | cs.AI | 89 | Structured verification for LLM-driven HDL; strong speed/coverage gains and reusable workflow. | LLM, verification, evaluation, code-generation, hardware |
2606.03327 | CAPER: Clause-Aligned Process Supervision for Text-to-SQL | cs.DB, cs.CL | 89 | Clause-level process supervision for Text-to-SQL with concrete gains; reusable PRM idea. | LLM, process-supervision, reward-modeling, text-to-sql, reliability |
2606.09556 | AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation | cs.AI | 88 | Careful ablation of evidence access vs reasoning in AI scientist agents; high agent reliability relevance. | agents, evaluation, evidence, reasoning, reliability |
2606.03603 | World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning | cs.CV, cs.CL | 88 | Combines world models with MLLMs and adds benchmarks for controlled concrete vs abstract reasoning. | multimodal, reasoning, world-models, benchmarks, MLLM |
2606.19135 | A Technical Taxonomy of LLM Agent Communication Protocols | cs.MA, cs.AI, cs.NI | 88 | Useful taxonomy of LLM multi-agent protocols; strong reuse value for agent interoperability/safety. | llm-agents, multi-agent, protocols, taxonomy, infrastructure |
2606.05872 | Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns | cs.AI, cs.CV | 88 | Lightweight agent-behavior metrics beyond success/cost; useful for auditing tool use and robustness. | agents, evaluation, safety, tool-use, robustness |
2606.03159 | NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation | cs.CV, cs.AI, cs.RO | 88 | Real-time action-conditioned world model for closed-loop AV simulation; strong safety evaluation relevance. | world-models, autonomous-driving, simulation, safety-evaluation, video-generation |
2606.12411 | Context-Driven Incremental Compression for Multi-Turn Dialogue Generation | cs.CL, cs.LG | 88 | Long-dialogue context compression with revisable memory; strong efficiency/reliability relevance for agents. | llm, agents, long-context, memory, efficiency, dialogue |
2606.03606 | Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks | cs.CR, cs.AI | 87 | Automatic numeric-remapping attacks expose brittle arithmetic generalization in LLM reasoning. | LLM-evaluation, reasoning, robustness, adversarial, benchmark |
2606.06787 | AdMem: Advanced Memory for Task-solving Agents | cs.AI | 87 | Unified semantic/episodic/procedural memory for long-horizon agents; strong practical agent relevance. | llm-agents, memory, long-horizon, multi-agent, retrieval |
2606.11906 | When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models | cs.CL | 87 | Systematic multilingual robustness eval for VLA models; reveals step-wise failure modes and intervention. | robustness, multilingual, robotics, VLA, evaluation |
2606.17727 | LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings | cs.AI | 87 | Long-horizon webpage generation benchmark with structural and functional agent-based eval. | benchmark, evaluation, web-agents, vlm, long-horizon |
2606.12854 | Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization | cs.CL, q-bio.QM | 87 | Small LLM claim verification beats larger models; exposes dataset shortcut and tests cross-domain generalization. | LLM, factuality, evaluation, biomedical, small-models |
2606.17871 | StepGuard: Guarding Web Navigation via Single-Step Calibration | cs.AI | 87 | Web agent robustness via step calibration and selective reflection; practical agent reliability. | web-agents, calibration, reflection, RL, reliability |
2606.03399 | Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models | cs.CL, cs.CR | 86 | Token-level cryptographic redaction for clinical LLM use targets practical privacy-preserving deployment. | privacy, LLMs, clinical, security, deployment |
2606.05525 | SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization | cs.AI, cs.HC | 86 | Reusable agent skills plus benchmark for scientific workflows; strong agent evaluation value. | agents, benchmark, tool-use, scientific-workflows, evaluation |
2606.04381 | From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models | cs.LG, cs.AI | 86 | Adds spatial modality to LLMs for geometric reasoning; notable frontier capability advance if claims hold. | llm, multimodal, reasoning, spatial, architecture |
2606.17986 | ShellGames: Speculative LLM-Driven SSH Deception | cs.CR | 85 | LLM-driven SSH deception studies persistent-state, hallucination, and subversion limits in agents. | agents, security, LLM, cyber, deception |
2606.03022 | Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization | cs.CL, cs.AI | 85 | Inference-time method for LLM hallucination reduction via representation geometry; reliability-focused. | LLMs, hallucination, inference-time, representation, reliability |
2606.16175 | PAL-Bench: Evidence-Grounded Profile Reconstruction from Longitudinal Personal Albums | cs.AI | 85 | Evidence-grounded multimodal benchmark with citation/provenance; useful for reliability and privacy-aware eval. | benchmark, multimodal, evidence-grounding, evaluation, provenance |
2606.07237 | When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations | cs.CL, cs.AI, cs.LG | 85 | Healthcare LLM prompt sensitivity study highlights reliability risks under natural and adversarial variation. | LLM-safety, robustness, healthcare, evaluation, adversarial |
2606.17642 | FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness | cs.AI | 85 | Agent memory for multimodal financial reasoning targets reliability, tool use, and hallucination reduction. | llm-agents, memory, multimodal, tool-use, reliability |
AI Paper Insight Brief
2026-06-23
0) Executive takeaways (read this first)
- Benchmark and evaluation quality is a first-order bottleneck: multiple papers show that noisy labels, structural shortcuts, selective archives, and task-misaligned metrics can dominate apparent model progress more than new reasoning tricks.
- Inference-time control is getting more targeted and mechanistic: today’s strongest interventions are not generic “self-reflection,” but selective latent-space edits, step-wise alignment, calibrated reflection triggers, and prioritized human review.
- Agent reliability is increasingly being improved through structure around the model rather than larger models alone: memory systems, deterministic tools, skill libraries, verification backends, and protocol discipline repeatedly deliver large gains.
- Evidence access remains a hard ceiling in knowledge-intensive domains: better scaffolds help calibration, but proprietary or grounded evidence sources still determine factual coverage and decision utility in domains like drug valuation and finance.
- Security work is shifting down-stack: several papers show that risks live in deployment interfaces and infrastructure layers (sampling, checkpoint reuse, shell interaction, privacy preprocessing), not just in model outputs.
- Long-horizon settings expose compounding failure modes: multilingual robot control, web navigation, long webpages, dialogue compression, and world-model use all show that small local errors cascade unless corrected at the right step.
2) Key themes (clusters)
Theme: Evaluation is the product
- Why it matters: Several papers argue that current benchmarks and public records systematically misstate capability or safety because the evaluation substrate itself is flawed. The practical implication is that teams should treat benchmark curation, archive design, and verifier quality as core infrastructure, not housekeeping.
- Representative papers:
- Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
- Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
- Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization
- Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks
- Common approach:
- Audit hidden assumptions in datasets or archives rather than only comparing model scores.
- Recompute labels or outcomes under controlled perturbations to isolate true reasoning robustness.
- Separate terminal metrics from process-level or coverage-aware metrics.
- Use structured verification pipelines (human+LLM, symbolic checks, rolling-origin backtests) to localize where evaluation fails.
- Open questions / failure modes:
- How often do benchmark gains disappear after annotation repair or shortcut removal?
- Can audit pipelines scale beyond curated subsets without introducing new judge bias?
- How should public leaderboards expose uncertainty, selection effects, and benchmark revisions?
- Conservative attack-generation or audit filters may understate true failure rates.
Theme: Selective intervention beats always-on correction
- Why it matters: A recurring pattern is that reliability improves when interventions are applied only at the right layer, step, or uncertainty regime. This reduces collateral damage compared with global steering, forced simulation, or uniform alignment.
- Representative papers:
- Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization
- When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models
- StepGuard: Guarding Web Navigation via Single-Step Calibration
- World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
- Common approach:
- Identify high-risk steps using internal signals such as gradient ratios, confidence, or rollout quality.
- Apply correction only to critical steps, outlier heads, or uncertain decisions.
- Keep the base model mostly frozen and intervene at inference time.
- Evaluate not just final success but acceptance/rejection behavior, call rates, and degradation under corrupted inputs.
- Open questions / failure modes:
- Internal confidence or geometric proxies may fail when context anchors are weak or misaligned.
- Step-wise methods may overfit to current architectures or simulators.
- Selective reflection and alignment still depend on good thresholds and retrieval references.
- Simulation quality remains a bottleneck when world-model outputs are plausible but task-wrong.
Theme: Agent scaffolding is becoming the main lever
- Why it matters: Many of the largest practical gains come from adding memory, skills, tools, verifiers, or structured RL objectives around a fixed or modest backbone. This suggests frontier agent progress may be bottlenecked more by systems design than by raw model scale in many domains.
- Representative papers:
- AdMem: Advanced Memory for Task-solving Agents
- OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models
- DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
- SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization
- Common approach:
- Externalize reusable procedural knowledge into skills, memory entries, or rubric trees.
- Use reward shaping or group-relative RL to improve long-horizon credit assignment.
- Separate semantic, episodic, and procedural state rather than relying on context alone.
- Measure gains in realistic multi-step environments rather than static QA.
- Open questions / failure modes:
- These systems often add substantial prompt, rollout, or orchestration cost.
- Skill and memory quality may be highly harness-dependent.
- Transfer beyond the evaluated model families and environments is still thin.
- Long-term memory can burn in slowly or inject stale/irrelevant guidance.
Theme: Grounded evidence and deterministic tooling as anti-hallucination infrastructure
- Why it matters: In high-stakes domains, the strongest pattern is not “better prompting” but constraining the model with deterministic tools, evidence provenance, and explicit verification. This is especially visible in finance, math, hardware verification, and research agents.
- Representative papers:
- AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation
- FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness
- Evaluating Research-Level Math Proofs via Strict Step-Level Verification
- Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
- Common approach:
- Replace free-form generation with evidence-cited scorecards, theorem ledgers, deterministic engines, or golden-reference comparisons.
- Decompose tasks into auditable local steps rather than monolithic judgments.
- Use external tools for arithmetic, retrieval, simulation, or compilation.
- Track completeness or provenance explicitly, not just answer quality.
- Open questions / failure modes:
- Better reasoning cannot compensate for missing proprietary or long-tail evidence.
- Deterministic backends may not exist for open-ended or weakly specified tasks.
- Strict verification can become over-conservative and reject valid outputs.
- Tooling quality and corpus coverage become new single points of failure.
Theme: Security and privacy risks are interface-dependent
- Why it matters: Several papers show that safety assumptions break when models are deployed through real interfaces: sampling layers, checkpoint reuse paths, cloud clinical pipelines, or interactive shells. Auditing only final text outputs misses meaningful attack surfaces.
- Representative papers:
- Invisible Manipulation Channels in AI-Assisted Financial Advisory: Implications for Market Integrity and Regulatory Design
- Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors
- Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models
- ShellGames: Speculative LLM-Driven SSH Deception
- Common approach:
- Model the deployment interface explicitly: sampling, retrieval/reranking, shell I/O, or token-level preprocessing.
- Evaluate attacks and defenses under realistic operational constraints rather than toy output checks.
- Use structured threat models and interface-specific metrics.
- Pair system design with deployment guidance or regulatory implications.
- Open questions / failure modes:
- Output-based audits and watermarking can miss lower-layer manipulation.
- Privacy-preserving preprocessing still leaks structure and depends on key management.
- Interface-specific exposure may vary sharply with candidate pools, prompts, or reuse patterns.
- Stateful deception systems still face realism gaps in filesystem and long-session behavior.
3) Technical synthesis
- Multiple papers replace coarse terminal rewards with semantically aligned intermediate units: clause-level SQL rewards, step-level proof verification, step-wise VLA sensitivity, and single-step web calibration all attack credit assignment directly.
- Retrieval is increasingly selective rather than unconditional: C-DIC retrieves thread-specific latent slots, FinAcumen gates memory by similarity threshold, PF-OPSD selectively calls simulation, and multilingual VLA alignment only edits critical steps.
- Several works use “frozen backbone + external structure” as the dominant recipe: FinAcumen, HERALD, DCO, STG, and SciVis skills all improve behavior without retraining the core model heavily.
- Verification pipelines often combine symbolic or deterministic components with LLM judgment: Z3 equivalence in NL→FOL, Verilator/Icarus in HDL, theorem ledgers in proof checking, and browser/DOM execution in webpage evaluation.
- Robustness diagnostics are moving from aggregate accuracy to conditional or stratified views: attacked-only arithmetic accuracy, hard-target PIR in PAL-Bench, page/task/step success in LongWebBench, and informed-DQ in drug valuation.
- Several papers expose asymmetry as a key signal of shortcut learning: HealthVer→SciFact transfers well while SciFact→HealthVer collapses; some CLIP backdoors transfer only through specific deployment interfaces; multilingual VLA failures concentrate in navigation primitives.
- Human effort is being optimized rather than removed: FOLIO/MALLS uses LLM-assisted prioritization for relabeling, while archive adjudication and PAL-Bench formalize what should remain evaluator-controlled.
- Cost/latency is treated as a first-class metric in practical systems papers: OmniDreams reports real-time FPS, STG reports runtime/energy, HERALD reports preprocessing overhead, ShellGames reports latency reduction, and DEEPRUBRIC reports RL GPU-hours.
- Evidence completeness repeatedly appears as a hidden variable behind “reasoning” performance: proprietary corpus access in drug valuation, deterministic data panels in finance, and public/private evidence contracts in PAL-Bench all show that missing evidence caps utility.
- Many methods rely on thresholded control knobs (τ, K, confidence triggers, critical-step cutoffs, retrieval depth), suggesting a broad need for calibration studies rather than one-off benchmark wins.
4) Top 5 papers (with “why now”)
- Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
- Finds major annotation error rates in widely used NL→FOL benchmarks: 38.9% incorrect formalizations in FOLIO validation and 36% in sampled MALLS test.
- Shows benchmark repair materially changes measured model quality, with re-evaluation gains of +9 to +22 points.
- Introduces a practical human+LLM review pipeline that reaches 90% dataset accuracy after reviewing only ~24% of FOLIO and ~13% of MALLS in the best setting.
- Why useful now: if you rely on formal reasoning benchmarks, this is a direct warning that benchmark noise may be larger than your model improvement.
- Skeptical about: scope is limited to curated subsets and three LLM families.
- Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization
- Proposes a mechanistic latent-space intervention that suppresses orthogonal attention-head components relative to a context anchor.
- Reports gains on faithfulness, factuality, and some reasoning settings while avoiding regressions seen with static steering methods.
- Single-pass and training-free, with complexity linear in selected layers/heads/model width.
- Why useful now: this is a concrete alternative to generic decoding hacks and fits the current push toward mechanistic inference-time control.
- Skeptical about: depends on the linear representation framing and on having a meaningful context anchor.
- AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation
- Cleanly separates gains from reasoning scaffolds versus proprietary evidence access in a real scientific decision task.
- Shows factual recall jumps from 0.38 to 0.96 when adding proprietary data, while informed decision quality rises from 2.57 to 7.43.
- Demonstrates that better scaffolds improve calibration/objectivity modestly but do not close the evidence gap.
- Why useful now: timely for anyone building “AI scientist” systems and trying to interpret whether progress comes from reasoning or data access.
- Skeptical about: gold-set circularity and small benchmark size limit how broadly to generalize.
- Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
- Replaces stochastic LLM testbench generation with deterministic, structure-aware verification tailored to combinational, sequential, and FSM-heavy designs.
- Reports 720× faster testbench generation, higher coverage, 100% compilation on a large curation task, and major runtime/energy savings.
- Also improves downstream search loops by reducing mean node counts 14–47% across four backbones.
- Why useful now: a strong example of how deterministic verifiers can unlock scalable data curation and test-time search for code/design agents.
- Skeptical about: strongest results are in known-reference settings and benchmark-scale RTL.
- Invisible Manipulation Channels in AI-Assisted Financial Advisory: Implications for Market Integrity and Regulatory Design
- Identifies a sampling-layer attack that biases financial recommendations while preserving watermark integrity and evading six black-box detectors.
- Provides a KL-based detectability argument and empirical amplification of directional keywords by ~1.8–1.9×.
- Shows PRNG/CSPRNG defenses fail in the stated threat model, while QRNG+TEE blocks the attack in experiments.
- Why useful now: highlights that compliance schemes focused on output text or watermark presence may miss infrastructure-level manipulation.
- Skeptical about: experiments use 7B models and limited prompt sets, so deployment-scale prevalence remains to be tested.
5) Practical next steps
- Audit your core benchmarks for annotation noise, structural shortcuts, and conditional evaluation artifacts before claiming model gains; prioritize datasets where small benchmark changes could flip conclusions.
- Add process-level diagnostics to agent evals: per-step accuracy, intervention trigger rates, retrieval hit quality, evidence completeness, and failure localization should sit beside final success.
- Prefer selective inference-time controls over always-on reflection or global steering; measure whether interventions help specifically on high-risk steps without harming clean cases.
- For high-stakes domains, separate reasoning quality from evidence access in your experiments; report coverage-aware metrics, not just polished final answers.
- Build deterministic tool backends where possible for arithmetic, retrieval, verification, simulation, or browser execution, and force provenance/citation checks at the interface boundary.
- Stress-test deployment interfaces directly: sampling layers, checkpoint reuse paths, shell or browser interaction loops, and privacy preprocessing pipelines need their own threat models and audits.
- If you run long-horizon agents, invest in external memory/skills/rubrics rather than only larger backbones; then benchmark cost, latency, and stale-memory failure modes explicitly.
- For multilingual or multimodal embodied systems, log step-wise sensitivity hotspots and primitive-level failure concentrations; use them to target alignment or fine-tuning budget.
Generated from per-paper analyses; no external browsing.