June 18, 2026 Research Brief
Agent evaluation grows teeth.
Today’s papers push agent research away from single-score demos toward process-aware evaluation, transactional runtimes, and realistic security tests that expose cross-step failures.
Takeaways
- **Evaluation is shifting from final-answer scoring to process-aware measurement.** Multiple papers argue that pass/fail, pass@1, or pooled factuality scores hide the real failure modes in agents; stronger evaluation now tracks trajectories, hidden intent, provenance, replay, intermediate beliefs, and inference-budget sensitivity.
- **Agent safety failures are increasingly cross-step and cross-source, not single-turn.** New work on semantic transactions, provenance-aware verification, real-document prompt injection, multimodal skill attacks, and off-procedure dialogue all show that local checks miss harms that only appear when evidence is composed over time.
- **Harness and runtime design matter almost as much as the base model.** Several papers show large performance swings from tool interfaces, replay systems, skill packaging, self-evolution schedules, and benchmark hygiene—suggesting many leaderboard gains are still system-engineering gains rather than pure model gains.
Start with: Cordon: Semantic Transactions for Tool-Using LLM Agents
Why it catches my eye: It offers a reusable runtime pattern for staging, validating, and auditing agent actions before irreversible tool effects commit.
Read skeptically for: Its guarantees depend on mediated, observable tools; opaque plugins and external side effects can still escape containment.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
Cordon: Semantic Transactions for Tool-Using LLM Agents
#1A strong systems paper that reframes agent safety around task-level commit control rather than isolated tool-call filtering.
- Why now
- Stateful agents are moving into workflows where rollback, containment, and audit trails matter as much as raw capability.
- Skepticism
- Coverage is limited when tools or side effects are not fully mediated by the runtime.
How Inference Compute Shapes Frontier LLM Evaluation
#2Useful because it shows capability claims can swing materially with token budget, retries, and scaffold choices.
- Why now
- Frontier model comparisons increasingly depend on inference policy, making single-budget leaderboard numbers harder to trust.
- Skepticism
- The reported curves may change under different elicitation, search, or tool-use scaffolds.
PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents
#3It tests prompt-injection defense on real enterprise documents and proposes a provenance-aware mitigation with practical deployment relevance.
- Why now
- Enterprise RAG systems are now ingesting long, authority-laden documents where synthetic defenses often overstate security.
- Skepticism
- Adaptive attackers and limited per-domain sample sizes leave open how robust the defense is in broader deployment.
Chinese version: [中文]
Run stats
- Candidates: 283
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-16T00:00:00Z → 2026-06-17T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.18193 | A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models | cs.CR, cs.AI, cs.CL | 95 | Large-scale jailbreak red-team on frontier LLMs with concrete attack breakdowns and residual risk. | jailbreak, red-teaming, frontier-llms, robustness, safety-evaluation |
2606.18060 | PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience | cs.AI, cs.CL | 95 | Adversarial benchmark shows auto-research agents readily amplify pseudoscience with near-zero refusal. | agent-safety, benchmark, evaluation, misinformation, science-agents |
2606.18198 | Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners | cs.CR, cs.CV | 95 | Multimodal hidden-instruction attack on agent skill scanners; directly relevant to agent security. | agent-safety, security, multimodal, prompt-injection, red-teaming, skills |
2606.18120 | Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping | cs.CR, cs.AI, cs.CL, cs.LG | 95 | Concrete prompt-injection analysis for templated LLM apps; directly relevant to agent security. | prompt-injection, agent-security, templating, Handlebars, jailbreaks |
2606.17467 | PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents | cs.CR, cs.CL | 93 | Real-document prompt injection benchmark plus provenance-aware defense for enterprise agent retrieval. | prompt-injection, agents, retrieval-security, enterprise, benchmark |
2606.17573 | Cordon: Semantic Transactions for Tool-Using LLM Agents | cs.OS, cs.CR | 92 | Transactional runtime for tool-using agents addresses rollback, containment, audit, and safe commits. | agents, tool-use, runtime-safety, containment, auditability |
2606.17478 | Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing | cs.CL, cs.AI | 91 | Activation-based deception auditing with interpretable reports and strong gains over probe baselines. | deception, interpretability, auditing, reasoning-llms, safety |
2606.17546 | SEAGym: An Evaluation Environment for Self-Evolving LLM Agents | cs.AI | 91 | Evaluation environment for self-evolving agents with transfer, replay, overfitting, and cost diagnostics. | agents, evaluation, benchmark, self-improvement, reliability |
2606.17904 | DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue | cs.AI | 91 | Benchmark for off-procedure inputs in grounded diagnostic dialogue; strong abstention/safety eval value. | evaluation, grounding, hallucination, benchmark, dialogue-safety, abstention |
2606.17929 | PreAct: Computer-Using Agents that Get Faster on Repeated Tasks | cs.AI | 91 | Practical computer-use agent architecture with guarded replay and major speedups on repeated tasks. | agents, computer-use, automation, efficiency, runtime-safety |
2606.18037 | ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents | cs.AI, cs.CL, cs.MA | 89 | Source-aware factuality verifier for MCP agents targets cross-source conflation, a practical failure mode. | mcp, agents, factuality, provenance, verification |
2606.17698 | EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent | cs.AI, cs.CL | 89 | Long-horizon shopping-agent benchmark with hidden intent and source-traceable failure analysis. | agents, benchmark, long-horizon, tool-use, evaluation |
2606.18068 | Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications | cs.AI | 89 | Deterministic orchestration and protocol gating to reduce hallucinations in medical agent workflows. | agent-safety, healthcare, hallucination, multi-agent, guardrails, neuro-symbolic |
2606.17799 | Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering | cs.SE, cs.AI, cs.CL | 89 | Strong benchmark critique for coding agents; separates model from harness and environment effects. | evaluation, coding-agents, benchmarks, software-engineering, agents |
2606.17930 | How Inference Compute Shapes Frontier LLM Evaluation | cs.AI | 88 | Shows frontier LLM evals can be heavily shaped by inference compute, affecting capability assessment. | evaluation, frontier-llms, inference-scaling, benchmarks, capabilities |
2606.18021 | LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI | cs.AI, cs.CL, cs.LG, cs.MA | 88 | Typed hallucination auditing and calibrated debate for legal AI; actionable reliability metrics. | hallucination, legal-ai, calibration, multi-agent, reliability |
2606.18043 | Uncertainty Quantification for Flow-Based Vision-Language-Action Models | cs.RO, cs.LG | 88 | Uncertainty estimation for VLA robots targets failure detection in deployment-critical settings. | uncertainty, robotics, VLA, reliability, OOD |
2606.17454 | Dissecting model behavior through agent trajectories | cs.AI, cs.LG | 87 | Frames agent failures as intent-execution gap; useful lens for harness reliability and auditing. | agents, agent-harness, interpretability, reliability, trajectories |
2606.17819 | A Framework for Evaluating Agentic Skills at Scale | cs.SE, cs.AI, cs.CL | 87 | Scalable framework evaluating 500 real-world agent skills across models; high reuse for agent assessment. | agents, evaluation, benchmarks, skills, scalability, llm-systems |
2606.17803 | Continual Self-Improvement with Lightweight Experiential Latent Memories | cs.LG | 87 | Continual self-improvement via latent memories for reasoning traces could matter for agent learning. | continual-learning, reasoning, memory, self-improvement, agents |
2606.17872 | AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor | cs.LG, cs.AI | 86 | Safety-aware KV cache compression links efficiency with jailbreak robustness in long-context inference. | efficiency, kv-cache, jailbreak, long-context, alignment |
2606.18023 | LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling | cs.LG, cs.AI | 86 | 7B looped transformer study on test-time compute scaling with concrete 18T-token training evidence. | frontier-llm, architecture, test-time-compute, efficiency, transformers |
2606.17645 | Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns | cs.AI, cs.CL, cs.LG | 85 | Transferable web-skill reuse could materially cut agent cost/latency and improve cross-site generalization. | web-agents, skills, efficiency, transfer, tool-use |
2606.17541 | Offline Preference-Based Trajectory Evaluation | cs.LG, cs.AI | 85 | Trajectory-preference metric improves offline evaluation discrimination for agentic systems. | agents, evaluation, metrics, offline-eval, benchmarks, trajectory-analysis |
2606.17591 | Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning | cs.AI | 85 | Addresses retention/forgetting in verbal RL agents with governance of learned insights. | verbal-RL, agents, memory, nonstationarity, governance |
2606.17464 | CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models | cs.LG | 84 | Principled benchmark for membership inference on LLMs improves privacy evaluation validity. | privacy, membership-inference, benchmark, llms, evaluation |
2606.17383 | Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation | q-fin.RM, cs.AI, cs.LG, stat.ML | 84 | POMDP-based validation framework targets beliefs, forecasts, and policies in agentic AI systems. | agent-safety, validation, pomdp, governance, evaluation |
2606.17687 | SuCo: Sufficiency-guided Continuous Adaptive Reasoning | cs.CL, cs.AI | 84 | Adaptive reasoning via minimal sufficient CoT targets efficiency and accuracy in reasoning models. | llm, reasoning, efficiency, chain-of-thought, adaptive-compute, training |
2606.18195 | Learning from the Self-future: On-policy Self-distillation for dLLMs | cs.CL | 84 | First on-policy self-distillation framework for diffusion LLMs; notable post-training advance. | diffusion-LLMs, post-training, self-distillation, reasoning, LLMs |
2606.18216 | Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients | cs.CL | 83 | Teacher-in-prompt RL method for small students is a plausible, reusable post-training idea. | RL, distillation, post-training, small-models, reasoning |
AI Paper Insight Brief
2026-06-18
0) Executive takeaways (read this first)
- Evaluation is shifting from final-answer scoring to process-aware measurement. Multiple papers argue that pass/fail, pass@1, or pooled factuality scores hide the real failure modes in agents; stronger evaluation now tracks trajectories, hidden intent, provenance, replay, intermediate beliefs, and inference-budget sensitivity.
- Agent safety failures are increasingly cross-step and cross-source, not single-turn. New work on semantic transactions, provenance-aware verification, real-document prompt injection, multimodal skill attacks, and off-procedure dialogue all show that local checks miss harms that only appear when evidence is composed over time.
- Harness and runtime design matter almost as much as the base model. Several papers show large performance swings from tool interfaces, replay systems, skill packaging, self-evolution schedules, and benchmark hygiene—suggesting many leaderboard gains are still system-engineering gains rather than pure model gains.
- Inference-time compute and memory policies are now first-class capability/safety levers. More budget, replay, adaptive reasoning, loop depth, and KV-cache compression all materially change measured capability or safety; single-budget benchmark numbers are becoming less informative.
- Practical defenses are moving toward conservative gating with explicit audit artifacts. The strongest systems here tend to stage actions, preserve provenance, validate intermediate objects, or block on uncertainty rather than rely on one-shot generation plus post-hoc scoring.
- Several benchmarks expose uncomfortable robustness gaps in realistic settings. Real enterprise documents break synthetic prompt-injection defenses; hidden shopping intent remains hard; grounded diagnostic dialogue still force-maps off-procedure inputs; auto-research agents readily produce persuasive pseudoscience.
2) Key themes (clusters)
Theme: Process-aware evaluation replaces endpoint metrics
- Why it matters: A recurring message is that aggregate success metrics are saturating or misleading because they collapse rich trajectories, hidden constraints, and protocol choices into one number. Better evaluation now measures how an agent got there, what intermediate state it formed, and how sensitive results are to harness and compute.
- Representative papers:
- Common approach:
- Decompose agent behavior into intermediate objects: beliefs, forecasts, actions, utility, or trajectory phases.
- Replace binary success with pairwise preferences, solution-distance, replay diagnostics, or compute-scaling curves.
- Audit benchmark integrity and protocol sensitivity rather than treating benchmark scores as intrinsic model properties.
- Use ablations and frozen snapshots to localize whether gains come from model, harness, or evaluation setup.
- Open questions / failure modes:
- Many trajectory metrics are textual or proxy-based rather than semantic.
- Richer evaluation is more expensive and harder to standardize across labs.
- Protocol choices can still dominate conclusions, especially for long-horizon tasks.
- Community adoption may lag because leaderboards prefer simple scalar metrics.
Theme: Runtime safety is becoming transactional, provenance-aware, and execution-grounded
- Why it matters: Several papers show that agent failures often emerge only after multiple tool calls, source merges, or deferred side effects. Defenses that inspect isolated prompts or tool calls miss these composed harms.
- Representative papers:
- Cordon: Semantic Transactions for Tool-Using LLM Agents
- ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
- PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents
- Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners
- Common approach:
- Preserve lineage/provenance through execution rather than pooling evidence.
- Stage or shadow side effects until a task-level validator approves commit.
- Route claims or retrieved content to source-specific checks before allowing output.
- Simulate execution or reconstruct abuse chains instead of relying on static artifact inspection.
- Open questions / failure modes:
- Guarantees only hold for mediated, observable tools and effects.
- Exact source attribution becomes much harder in dense multi-source settings.
- Adaptive attackers may learn to survive sanitization or simulation heuristics.
- These systems add latency, cost, and integration complexity.
Theme: Realistic security benchmarks are exposing failures hidden by synthetic setups
- Why it matters: Multiple papers argue that prior security conclusions were inflated by unrealistic data splits, synthetic documents, or narrow attack surfaces. More realistic benchmarks often lower confidence in existing defenses.
- Representative papers:
- CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
- PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents
- Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping
- A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models
- Common approach:
- Build cleaner splits or real-document corpora to remove benchmark artifacts.
- Compare against blind baselines or static analyses to test whether the benchmark itself leaks signal.
- Stress systems with adaptive, iterative, or domain-camouflaged attacks rather than one-shot toy prompts.
- Report attack-family-specific results instead of a single aggregate ASR.
- Open questions / failure modes:
- Many realistic evaluations still depend on open checkpoints, curated corpora, or judge models.
- Results may be point-in-time snapshots of model and safety-stack versions.
- Domain realism improves validity but can reduce breadth and statistical power.
- Stronger adaptive attacks may still invalidate current defenses.
Theme: Memory, replay, and self-improvement are moving from ad hoc context stuffing to governed reuse
- Why it matters: A second cluster focuses on making agents improve from experience without full retraining, but with safeguards against stale or harmful reuse. The key shift is from “store everything” to verified, curated, or parameter-efficient reuse.
- Representative papers:
- PreAct: Computer-Using Agents that Get Faster on Repeated Tasks
- Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning
- Continual Self-Improvement with Lightweight Experiential Latent Memories
- SEAGym: An Evaluation Environment for Self-Evolving LLM Agents
- Common approach:
- Reuse prior experience only after verification, replay, or governance checks.
- Separate active knowledge from deprecated-but-preserved history.
- Store compact executable programs or latent memories instead of raw traces alone.
- Evaluate evolution over snapshots, replay, transfer, and cost rather than only final score.
- Open questions / failure modes:
- Compile/verify overhead can be large.
- Memory quality is uneven and hard to predict in advance.
- Cross-backend transfer is inconsistent; gains may be harness-specific.
- Long-term memory management, eviction, and consolidation remain open.
Theme: Inference-time adaptation is now a major frontier for both capability and safety
- Why it matters: Several papers show that test-time behavior can be improved—or broken—by changing reasoning length, loop depth, token budget, or cache policy. This makes inference-time control a central optimization and safety surface.
- Representative papers:
- Common approach:
- Learn or diagnose when extra reasoning is sufficient versus redundant.
- Treat loop count or token budget as a tunable compute allocation problem.
- Add lightweight safety signals directly at inference bottlenecks like KV eviction.
- Evaluate capability as a curve over compute, not a single operating point.
- Open questions / failure modes:
- Sufficiency signals often rely on ground truth or strong auxiliary models.
- More compute helps unevenly across benchmarks and can mask protocol dependence.
- Aggressive penalties or too many loops can reverse gains.
- Standard reporting still under-specifies inference budget and allocation policy.
Theme: Domain-grounded benchmarks are surfacing hidden-intent and abstention failures
- Why it matters: New domain benchmarks in shopping, diagnostics, legal extraction, healthcare, and pseudoscience show that realistic hidden constraints and abstention demands remain weak points even when generic capability looks strong.
- Representative papers:
- EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
- DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue
- LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI
- PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience
- Common approach:
- Scatter requirements across multiple sources or turns rather than exposing all intent upfront.
- Use typed rubrics or typed hallucination categories to localize failure.
- Emphasize abstention, clarification, and provenance rather than only answer correctness.
- Audit judges and construction pipelines to improve reproducibility.
- Open questions / failure modes:
- Many datasets remain narrow in domain, geography, or task type.
- LLM judges are still central to grading in several settings.
- Synthetic personas or injections trade realism for control.
- Strong performance on visible constraints still does not imply recovery of hidden intent.
3) Technical synthesis
- A common design pattern is layered decomposition: beliefs/forecasts/actions/utility, claim/source/support, rule/evidence/skill, or intent/behavior/abuse. This is replacing monolithic “agent score” evaluation.
- Several papers converge on gating before irreversible action: Cordon stages effects before commit, PARSE routes high-directiveness docs to heavier sanitization, healthcare gating blocks diagnosis until OLDCARTS completion, and PreAct verifies before storing reusable programs.
- Benchmark hygiene is a major theme: CheckMIABench uses checkpoint-based matched marginals; SSA identifies git-history leakage in SWE-Bench-Pro; multiple papers audit judge stability or leakage channels explicitly.
- There is a broad move from pooled evidence to source-specific verification: ProvenanceGuard checks routed support per source, LegalHalluLens types hallucinations by clause class, and DiagFlowBench distinguishes abstention from forced mapping.
- Trajectory-level analysis is becoming the preferred lens for agents: solution-distance, replay diagnostics, temporal preferences, phase schedules, and compute-scaling curves all reveal differences hidden by pass@1 or success rate.
- Many methods rely on conservative fail-closed policies: block on any failed claim, stage external effects, require verify-before-store, or use thresholded uncertainty to escalate.
- Inference-time compute is no longer just a cost variable; it is part of the capability definition. Token budgets, repeated submissions, loop count, adaptive CoT length, and KV retention all materially change outcomes.
- Several papers show non-monotonicity: more loops can hurt (LoopCoder-v2), larger λ can reverse safety gains (AnchorKV), bigger batches can destabilize self-evolution (SEAGym), and structured intake alone can reduce accuracy without uncertainty filtering in healthcare.
- A recurring empirical lesson is that harness/interface choices create family-specific behavior: SSA uses family-aware adapters and reasoning nudges; skill evaluation and coding-benchmark papers argue harness variance can rival model variance.
- Across safety papers, the strongest gains often come from explicit structure plus lightweight learned components rather than end-to-end retraining: transaction runtimes, source routers + NLI + calibrators, directiveness gates, or refusal anchors.
4) Top 5 papers (with “why now”)
- How Inference Compute Shapes Frontier LLM Evaluation
- Shows that benchmark scores can move substantially with larger token budgets, context compaction, and repeated submissions, especially on FrontierMath and HLE.
- Decomposes gains into reach, efficiency, and reliability, clarifying that newer models often improve by unlocking harder tasks rather than simply using tokens better.
- Useful now because frontier evaluation is increasingly compute-sensitive; single-budget scores are becoming poor proxies for real capability.
- Skepticism / limitation: results use one shared ReAct-style scaffold, so scaling curves may change under stronger elicitation or search strategies.
- Cordon: Semantic Transactions for Tool-Using LLM Agents
- Introduces a runtime abstraction that validates a whole task’s lineage, authority, and staged effects before commit rather than checking tool calls independently.
- On 45 correlated-risk workflows, Cordon intercepted 45/45 risky effects before commit, versus 14/45 for strategy adapters and 0/45 for plain execution.
- Useful now because agent deployments are moving from read-only copilots to stateful systems with irreversible side effects.
- Skepticism / limitation: guarantees only cover mediated and observable operations; opaque plugins or external side effects remain outside full containment.
- Dissecting model behavior through agent trajectories
- Provides both a practical harness (SSA) and a trajectory metric that surfaces family-specific behavior invisible to pass@1.
- Identifies a concrete benchmark-integrity issue—git-history leakage in SWE-Bench-Pro—that materially inflates some scores.
- Useful now because coding-agent evaluation is increasingly bottlenecked by harness quality and benchmark contamination, not just model quality.
- Skepticism / limitation: the solution-distance metric is textual rather than semantic, so equivalent fixes can still be mis-scored.
- PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents
- Demonstrates that paraphrasing, a popular synthetic-benchmark defense, does not significantly reduce ASR on real enterprise documents while hurting utility.
- PARSE’s domain-aware, fact-preserving pipeline achieves the best reported ASR/utility tradeoff on a 122-task real-document benchmark.
- Useful now because enterprise RAG systems increasingly ingest long, authority-laden documents where prompt injection is semantically camouflaged.
- Skepticism / limitation: not tested against adaptive adversaries, and per-domain sample sizes are still underpowered.
- PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience
- Benchmarks full auto-research systems end-to-end on pseudoscientific claim–evidence pairs and finds high pseudoscientific capability with near-zero refusal.
- Shows that stronger systems can produce more polished and persuasive pseudoscientific reports, not just more capable benign outputs.
- Useful now because research agents are moving from note-taking to autonomous experiment/report generation, raising a new class of epistemic safety risk.
- Skepticism / limitation: the benchmark is intentionally narrow and judge-scored, so it measures a focused failure mode rather than the full spectrum of scientific misuse.
5) Practical next steps
- Add process-level telemetry to agent evals: store trajectories, tool errors, replay traces, intermediate beliefs, and per-step verifier outputs rather than only final success.
- Report capability as a function of inference compute for any frontier benchmark you publish: at minimum vary token budget, retries, and parallel-vs-serial allocation.
- For tool-using agents, prototype a task-level commit boundary: stage external effects, preserve lineage, and require validation before release.
- In RAG or MCP systems, move from pooled factuality checks to claim-by-source verification and explicitly flag supported-but-misattributed claims.
- Re-test prompt-injection defenses on real enterprise documents, not just synthetic snippets; measure both ASR and utility retention.
- Add benchmark hygiene checks: blind baselines for privacy/security tasks, leakage audits for coding benchmarks, and judge-stability audits where LLM judges are used.
- For repeated workflows, implement verify-before-store replay or other conservative memory insertion rules rather than caching successful traces blindly.
- Track abstention and forced-mapping behavior separately in grounded assistants; low fabrication alone is not enough if the model confidently maps off-procedure inputs to wrong valid nodes.
- If deploying compression or adaptive reasoning, include safety regressions in systems tuning: KV compression, loop depth, and reasoning truncation should be evaluated on jailbreak/abstention metrics, not only utility.
- For self-improving agents, separate active rules from preserved evidence, and evaluate over frozen snapshots with replay and OOD transfer to catch regressions early.
Generated from per-paper analyses; no external browsing.