June 22, 2026 Research Brief
Evaluation goes process-first.
Today’s strongest papers replace outcome-only scoring with verifiable process checks, while agent training and inference methods add finer-grained feedback for safer, more reliable systems.
Takeaways
- Process-level evaluation is becoming the dominant pattern across safety-critical domains: chemistry, health agents, fraud detection, clinical VQA, and academic search all show that final-answer accuracy alone hides important failure modes.
- Several papers attack the same core bottleneck from different angles: **credit assignment and dense feedback** for agents/LLMs. SHARP improves multi-agent RL with per-agent Shapley credit; VIMPO derives token-level advantages without a learned critic; SafeSpec adds step-level safety verification inside speculative decoding.
- Robustness results are increasingly about **distributional or structural stress tests**, not average accuracy: NOTA perturbations break clinical uncertainty estimates, URL masking exposes fraud-detection shortcut reliance, matched candlestick interventions reveal trend shortcuts, and negation flips remote-sensing MLLM behavior.
Start with: From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
Why it catches my eye: It offers a reusable template for auditing reasoning through verifier-checked intermediate states instead of trusting final answers.
Read skeptically for: Its verification scope is narrow, centered on rule-checkable chemistry states rather than broader scientific reasoning.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
#1Useful beyond chemistry because it shows how to turn hidden reasoning into auditable intermediate states.
- Why now
- Scientific and high-stakes copilots need evidence that reasoning is valid, not just plausible.
- Skepticism
- The benchmark covers structured, verifier-friendly chemistry tasks more than open scientific reasoning.
SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling
#2A strong companion paper because it embeds safety verification directly into a production-relevant decoding stack.
- Why now
- Speculative decoding is becoming standard, so safety methods that fit inference pipelines matter immediately.
- Skepticism
- Attack-heavy settings may erase speed gains, and robustness depends on the trained safety head.
Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System
#3Worth opening for its concrete answer to a central agent bottleneck: assigning useful credit across collaborating roles.
- Why now
- Multi-agent systems are scaling faster than stable training methods for planner-worker coordination.
- Skepticism
- Shapley-style counterfactual credit is compute-heavy and may still misattribute contributions.
Chinese version: [中文]
Run stats
- Candidates: 3705
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-19T00:00:00Z → 2026-06-20T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.18129 | Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour | cs.HC, cs.AI | 93 | Clinically grounded benchmark for longitudinal mental-health LLM harms beyond static safety scores. | llm-safety, evaluation, mental-health, benchmark, reliability |
2606.20527 | StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs | cs.CL, cs.CV | 93 | Controlled benchmark isolates visual cues driving social bias in MLLMs; strong safety relevance. | MLLMs, bias, benchmark, evaluation, safety |
2606.19755 | SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling | cs.CR, cs.AI | 92 | Safety-aware speculative decoding with rollback/reflective sampling; strong LLM safety+efficiency fit. | llm-safety, speculative-decoding, inference, guardrails, efficiency |
2606.19868 | A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models | cs.AI | 91 | Systematic black-box LLM uncertainty eval; directly useful for reliability and hallucination control. | llm-reliability, uncertainty, evaluation, hallucination, black-box |
2606.18062 | Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond | cs.CL, cs.AI, cs.CR, cs.HC | 91 | Large in-the-wild study of security/privacy prompts and LLM responses; directly useful for safety auditing. | llm-safety, security, privacy, wildchat, user-study, evaluation |
2606.20008 | VIMPO: Value-Implicit Policy Optimization for LLMs | cs.LG | 91 | Critic-free RL for LLMs with policy-implied value function; likely useful for reasoning post-training. | LLMs, RL, reasoning, post-training, optimization |
2606.19826 | Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience | cs.CR, cs.MA | 91 | Directly studies adversarial influence in multi-LLM debate with concrete resilience metrics. | llm-agents, adversarial-robustness, multi-agent, evaluation, safety |
2606.03308 | The Security Budget of Code LLMs: An Information-Theoretic Capacity-Security Bound | cs.CR | 91 | Info-theoretic security-capacity bound for code LLMs; strong relevance to prompt robustness. | code-llm, security, information-theory, prompt-robustness, theory |
2606.18051 | Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose | cs.CL | 91 | Agent skill composition benchmark/framework over real MCP skills; strong relevance to tool-using LLM agents. | llm-agents, tool-use, planning, benchmark, retrieval, mcp |
2606.20546 | Predictability as a Fine-Grained Measure for Privacy | cs.LG | 90 | New privacy framework beyond DP with formal comparisons; potentially important for ML privacy evaluation. | privacy, differential-privacy, theory, evaluation |
2606.19893 | MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments | cs.AI | 89 | Trains research agents in adversarial evolving worlds; directly targets credibility and misinformation handling. | agents, agent-safety, reinforcement-learning, evaluation, misinformation |
2606.20235 | ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments | cs.IR, cs.AI | 89 | Benchmark for agentic paper search in open environments; strong agent evaluation and reproducibility value. | agents, benchmark, evaluation, search, tool-use |
2606.16659 | FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection | cs.CL | 89 | Agentic fraud benchmark tests cross-channel SMS-to-web reasoning without easy URL shortcut cues. | agents, security, benchmark, fraud-detection, evaluation, multimodal |
2606.12835 | The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale | cs.MA, cs.AI, cs.CY, cs.NI | 89 | Broad agent ecosystem architecture with security, coordination, and multi-agent risk relevance. | agents, multi-agent, security, coordination, systems |
2606.20177 | Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs | cs.CV, cs.AI | 89 | Benchmark exposes negation failures in remote-sensing MLLMs and proposes enhancement method. | MLLMs, evaluation, negation, robustness, benchmark |
2606.03808 | PURGE: Projected Unlearning via Retain-Guided Erasure | cs.LG, cs.AI, cs.CR | 89 | Machine unlearning method with retain-guided erasure; relevant to privacy, deletion, and model safety. | unlearning, privacy, safety, representation-erasure, continual-learning |
2602.08335 | Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System | cs.AI | 89 | Multi-agent LLM optimization with Shapley credit assignment; strong agent-training relevance. | multi-agent, LLM, reinforcement-learning, credit-assignment, agents |
2606.17861 | GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? | cs.CL | 89 | Real-engine benchmark for end-to-end coding agents with interactive verification; high reuse for agent eval. | agents, coding-agents, benchmark, evaluation, interactive, game-engine |
2606.05901 | Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version) | cs.CL, cs.AI | 88 | Agentic graph-RAG for complex QA targets hallucination reduction in a practical LLM deployment setting. | LLM, RAG, hallucination, agents, graph-retrieval, QA |
2606.19881 | REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection | cs.CL | 88 | Controlled multilingual PII detection benchmark with rich metadata; high privacy/safety evaluation utility. | privacy, pii, benchmark, multilingual, evaluation |
2606.03036 | TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment | cs.AI | 88 | Resource-efficient multi-axis LLM safety eval for bias, toxicity, and truthfulness. | llm-evaluation, safety, bias, toxicity, truthfulness, benchmarking |
2606.19245 | TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology | cs.AI, cs.LG | 88 | Verifiable benchmark for AI agents on realistic drug-discovery decisions; high reuse value. | agents, benchmark, evaluation, scientific-ai, reliability |
2606.19857 | Large Language Models Do Not Always Need Readable Language | cs.CL, cs.AI | 88 | Probes non-readable model-to-model language, relevant to hidden channels and agent oversight. | llms, communication, agent-safety, interpretability, evaluation |
2606.03660 | From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models | cs.AI | 88 | Verifiable process-level benchmark for LLM chemical reasoning; auditable evaluation beyond final answers. | evaluation, reasoning, verifiable, benchmark, process-supervision |
2606.10403 | KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty | cs.CL | 88 | Reasoning benchmark with human difficulty labels; useful for diagnosing test-time scaling and robustness. | reasoning, benchmark, evaluation, human-difficulty, test-time-scaling, vlm |
2606.11698 | T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking | cs.CR, cs.AI | 88 | Targets extraction-resistant model watermarking with simulated theft; strong AI security relevance. | ai-security, watermarking, model-extraction, ip-protection |
2606.18203 | RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills | cs.CL, cs.AI | 87 | Scalable rubric-based evaluation for personal health agents with expert-aligned, verifiable criteria. | agents, evaluation, health, rubrics, llm-judge, benchmark |
2606.17423 | Martingale Doppelgänger-Eval: An Identification Framework for Auditing Candlestick Understanding in Vision-Language Models | q-fin.CP, stat.ML | 87 | Identification-focused benchmark audits whether VLMs use evidence vs trend shortcuts. | VLMs, auditing, benchmark, shortcut-learning, evaluation |
2605.18160 | Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models | cs.CV, cs.AI | 87 | Targets long-generation visual consistency in MLLMs, a key frontier multimodal reliability issue. | multimodal, MLLM, visual-reasoning, long-context, reliability |
2606.16583 | Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure? | cs.CL | 87 | Directly studies whether uncertainty helps safe clinical VQA deployment; strong reliability signal. | safety, uncertainty, calibration, vlm, clinical-ai, evaluation |
AI Paper Insight Brief
2026-06-22
0) Executive takeaways (read this first)
- Process-level evaluation is becoming the dominant pattern across safety-critical domains: chemistry, health agents, fraud detection, clinical VQA, and academic search all show that final-answer accuracy alone hides important failure modes.
- Several papers attack the same core bottleneck from different angles: credit assignment and dense feedback for agents/LLMs. SHARP improves multi-agent RL with per-agent Shapley credit; VIMPO derives token-level advantages without a learned critic; SafeSpec adds step-level safety verification inside speculative decoding.
- Robustness results are increasingly about distributional or structural stress tests, not average accuracy: NOTA perturbations break clinical uncertainty estimates, URL masking exposes fraud-detection shortcut reliance, matched candlestick interventions reveal trend shortcuts, and negation flips remote-sensing MLLM behavior.
- Lightweight architectural or systems changes still matter: VIF improves multimodal grounding with only ~1.04× inference time and 1.05× memory, while graph-backed RAG and skill-routing pipelines show practical gains without full retraining.
- Benchmarks are shifting toward realistic agent environments with verifiable artifacts: Godot game generation, preclinical pharmacology decisions, paper search over open literature, and SMS-to-web fraud chains all show current agents remain far from reliable autonomy.
- Privacy/security work is broadening beyond classic DP: unlearning (PURGE), extraction-resistant watermarking (T2S), multilingual PII detection (REDACT), and predictability-based privacy all emphasize more deployment-relevant threat models and diagnostics.
2) Key themes (clusters)
Theme: Process-level evaluation replaces outcome-only scoring
- Why it matters: Multiple papers show that correct final outputs can coexist with invalid reasoning, unsupported evidence use, or harmful interaction dynamics. This is especially important in domains where auditability matters more than raw accuracy.
- Representative papers:
- From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
- Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour
- FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection
- RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
- Common approach:
- Decompose evaluation into layered signals: final correctness, structural adherence, and verifier-checked intermediate behavior
- Use deterministic or rubric-based checks instead of relying only on free-form LLM judges
- Audit whether model decisions are supported by observed evidence, not just whether they look plausible
- Localize failures to specific steps, spans, or behavioral attributes
- Open questions / failure modes:
- Human/expert annotation remains expensive in clinically grounded settings
- Verified traces may still reflect benchmark-state agreement rather than unique human reasoning
- LLM-judge components remain in the loop for some audits, creating residual subjectivity
- Extending these methods to open-ended, long-horizon, or multimodal workflows remains hard
Theme: Better credit assignment for RL and multi-agent systems
- Why it matters: A recurring bottleneck is that sparse, trajectory-level rewards are too coarse for long-horizon reasoning and multi-agent coordination. New work is trying to recover dense, actionable learning signals without paying full critic-training costs.
- Representative papers:
- Common approach:
- Replace broadcast rewards with finer-grained per-agent or per-token signals
- Use counterfactual or policy-implied structure to infer contribution without a standard learned critic
- Add process rewards for efficiency, reflection, or tool quality rather than only final correctness
- Normalize rewards within groups to reduce variance and stabilize updates
- Open questions / failure modes:
- Counterfactual credit estimation adds substantial compute overhead
- Approximate credit signals may still misattribute planner vs worker contributions
- Most evidence is still concentrated in math/tool-use settings rather than broad agent tasks
- Some proposals remain design frameworks without completed empirical validation
Theme: Shortcut reliance is the main robustness story
- Why it matters: Many systems look competent until shortcut channels are removed or counterfactually perturbed. The strongest papers here do not just report lower accuracy; they identify what spurious cue the model is using instead of the intended evidence.
- Representative papers:
- Martingale Doppelgänger-Eval: An Identification Framework for Auditing Candlestick Understanding in Vision-Language Models
- FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection
- Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?
- Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs
- Common approach:
- Remove shortcut features explicitly (URLs, trend-label coupling, correct answer options)
- Use matched interventions or perturbations to isolate causal sensitivity to intended evidence
- Measure not just accuracy but calibration, evidence support, or revision behavior under stress
- Build domain-specific stress tests rather than relying on generic robustness suites
- Open questions / failure modes:
- Some benchmarks are intentionally controlled and may not fully reflect natural traffic
- Stress tests can reveal failure but not automatically provide a mitigation path
- Robustness often varies sharply by modality, task subtype, or model family
- Shortcut removal can shift operating points in undesirable ways, e.g. false-positive spikes
Theme: Lightweight inference-time fixes are gaining traction
- Why it matters: Several papers show that meaningful robustness or grounding gains can come from small modules or decoding-time interventions, which is attractive for production systems that cannot afford full retraining.
- Representative papers:
- Common approach:
- Insert lightweight modules or heads into existing inference pipelines
- Trigger extra computation only when a risk signal is detected
- Preserve base-model utility through teacher regularization, rollback, or additive fusion
- Emphasize low overhead and compatibility with deployed backbones
- Open questions / failure modes:
- Safety-triggered modes can erase speed gains under attack
- Small modules may not scale cleanly to video or longer multimodal contexts
- Test-time adaptation can overfit if unlabeled adaptation sets are too large
- Detector calibration remains a central source of false positives and over-refusal
Theme: Agent benchmarks are getting more realistic—and current agents still struggle
- Why it matters: The benchmark frontier is moving from toy tasks to environments with real artifacts, tool use, and hidden failure modes. Across domains, current agents are still far from dependable.
- Representative papers:
- GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?
- TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
- ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments
- Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose
- Common approach:
- Evaluate complete workflows rather than isolated answers
- Use shared backends, deterministic graders, or replay-based verification for reproducibility
- Measure efficiency and process behavior alongside end metrics
- Diagnose bottlenecks such as decomposition granularity, off-target exploration, or harness effects
- Open questions / failure modes:
- Absolute performance remains low in realistic settings
- Harness and toolchain choices can materially change results
- Some benchmarks still rely on multimodal or LLM judges for parts of scoring
- Synthetic or curated queries may not fully capture real user distributions
Theme: Privacy and security evaluation is becoming more deployment-specific
- Why it matters: Rather than treating privacy/security as a single scalar property, new work models concrete threats: extraction, unlearning, PII detection under multilingual variation, and partial-compromise attackers.
- Representative papers:
- Common approach:
- Evaluate privacy with attacker-relevant metrics such as MIA AUROC, watermark survival, or query-specific leakage
- Use structured perturbation axes to expose where detectors fail
- Trade exact guarantees for more realistic threat modeling when appropriate
- Combine theory with practical mechanisms or benchmark infrastructure
- Open questions / failure modes:
- Many methods remain limited to small models, single seeds, or asymptotic analysis
- Synthetic benchmarks still need stronger real-world correlation studies
- Some guarantees are first-order or partial rather than end-to-end formal privacy guarantees
- Compute overhead remains significant for rehearsal, simulation, or adaptive noise design
3) Technical synthesis
- A common design pattern is decomposition before scoring: SHARP decomposes rewards by agent and tool call; RubricsTree decomposes health responses into Boolean leaves; ChemCoTBench-V2 decomposes reasoning into verifier-checkable states; SkillWeaver decomposes user requests into atomic subtasks.
- Several papers replace opaque end metrics with counterfactual or interventional tests: SHARP uses trajectory masking, Doppelgänger-Eval uses matched evidence edits, FraudSMSWalker masks URLs, and clinical VQA uses NOTA perturbations.
- Group-relative normalization appears in RL settings as a variance-control mechanism: SHARP uses group-relative advantages; VIMPO uses group estimates to anchor policy-implied values.
- There is a strong move toward hybrid evaluation stacks: deterministic graders where possible, LLM judges where necessary, and human audits for calibration. Few papers rely on any single evaluator.
- Multiple works show that calibration degrades exactly where capability is weakest: clinical UE is least useful on low-accuracy modalities; fraud agents are least grounded on hard benign cases; RS negation failures are worst on state-level reasoning.
- Inference-time adaptation is increasingly modular: VIF adds a two-layer visual module, SafeSpec adds a safety head plus rollback, NeFo updates LoRA adapters at test time.
- Several benchmarks expose that tool or environment design is part of the model result: TxBench-PP shows harness effects; ScholarQuest shows expansion strategy matters; GameCraft-Bench requires replay traces, not just code artifacts.
- Security papers increasingly argue that single scalar metrics are misleading: pass@1 cannot certify prompt hardening, toxicity refusal can hide truthfulness issues, and aggregate PII F1 hides high-sensitivity misses.
- Many of the strongest empirical papers use stress tests that preserve superficial task format while changing latent semantics: remove correct options, negate queries, preserve trend while changing candlestick evidence, or reveal/hide URLs.
- Across domains, the most actionable gains come from small structural changes plus better diagnostics, not necessarily larger models.
4) Top 5 papers (with “why now”)
1. Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System
- Introduces a practical reward decomposition for tool-integrated multi-agent LLM training: broadcast accuracy, Shapley-style marginal credit, and tool-process reward.
- Shows sizable gains across MuSiQue, GAIA-text, WebWalkerQA, FRAMES, and DocMath-Eval, with reported average improvements of 23.66% over single-agent baselines and 14.05% over other multi-agent methods.
- Especially relevant now because multi-agent/tool-using systems are scaling faster than our ability to train them stably; this directly targets the coordination bottleneck.
- Useful if you are training planner-worker systems and need per-role learning signals rather than monolithic rewards.
- Skeptical take: counterfactual Shapley estimation is expensive, approximate, and still leaves many useful subagents as a minority.
2. SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling
- Integrates a lightweight safety head into speculative decoding so safety checks and quality verification happen in the same target-model pass.
- Adds rollback-and-reflect recovery instead of only refusing, preserving benign-workload speedups while reducing jailbreak success.
- Why now: speculative decoding is becoming standard in production inference, and most safety methods do not fit cleanly into that stack.
- Reported results are strong on two model families, including ~2.06× benign speedup on Qwen3-32B with average ASR around 0.07.
- Skeptical take: under attack, Safety Mode triggers frequently and throughput drops sharply; generalization depends on the trained safety head.
3. From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
- Builds a 5,620-sample benchmark with deterministic chemistry-state verification across 18 tasks.
- Shows a striking gap between template adherence and actual chemically valid reasoning, making it a clean example of why process evaluation matters.
- Why now: chemistry and scientific copilots are moving into higher-stakes workflows where plausible-but-invalid reasoning is unacceptable.
- Useful beyond chemistry as a template for structured intermediate-state verification in other scientific domains.
- Skeptical take: verification is limited to rule-verifiable 2D chemistry tasks and benchmark-state agreement, not full scientific reasoning breadth.
4. RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
- Proposes a hierarchical rubric DAG with 100+ atomic Boolean checks and adaptive routing, aiming to make open-ended health-agent evaluation both scalable and clinically aligned.
- Achieves much stronger expert alignment than a principle-based baseline (ICC3 0.876 vs 0.291; κ 0.787 vs 0.431) and detects context corruption reliably.
- Why now: health agents are one of the clearest cases where open-ended LLM evaluation must be both scalable and auditable.
- Also notable because the evaluator is useful downstream as prompt guidance, feedback, and RL reward.
- Skeptical take: taxonomy transfer and routing coverage remain open risks, especially for rare but safety-critical rubrics.
5. TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
- Provides a realistic, deterministically graded benchmark for preclinical pharmacology decisions with 4,800 trajectories across 16 model-harness configurations.
- Finds no system is close to reliable autonomy; the best setup reaches 59.3% pass rate, and method/calibration errors dominate failures.
- Why now: biotech and scientific-agent claims are accelerating, but this paper shows current systems still fail on local, decision-relevant scientific judgment.
- Particularly useful because it separates model quality from harness effects and gives a concrete failure taxonomy.
- Skeptical take: scope is intentionally narrow and local; results do not yet generalize to broader discovery or clinical workflows.
5) Practical next steps
- Add process-level metrics to your eval stack wherever possible: evidence support, intermediate-state validity, revision quality, or rubric-leaf pass rates—not just final accuracy.
- For multi-agent or tool-using systems, test credit decomposition explicitly: compare broadcast rewards against per-agent/per-tool rewards and measure harmful or redundant subagent rates.
- Stress-test for shortcut dependence by masking likely leakage channels: URLs, answer options, metadata, trend cues, or retrieval provenance.
- If you deploy multimodal systems, try lightweight inference modules before full retraining: dynamic visual reinjection, safety heads, or test-time LoRA adaptation can yield favorable cost/benefit.
- Evaluate uncertainty methods under counterfactual failure conditions, not just standard calibration curves; ask whether uncertainty rises when the task becomes unanswerable or evidence is removed.
- For RAG/agent systems, measure process efficiency and grounding together: tool calls, expansion depth, candidate-set size, evidence support, and recall efficiency.
- In safety-critical domains, prefer deterministic or structured verifiers over pure LLM-as-judge whenever the domain admits symbolic checks.
- For privacy/security, report threat-specific metrics alongside aggregate utility: MIA AUROC, watermark survival after extraction, high-sensitivity PII recall, or leakage under partial-compromise assumptions.
Generated from per-paper analyses; no external browsing.