June 24, 2026 Research Brief

Agent safety gets operational.

Today’s strongest papers replace answer-only evaluation and static guardrails with verifiable agent checks, runtime authorization, and privacy-aware controls built for real enterprise environments.

Takeaways

  1. Agent evaluation is shifting from answer-only scoring toward **state-, policy-, and trajectory-aware verification**. Several papers show that plausible final outputs can still be invalid because of access violations, missing evidence, contextual privacy leaks, or unjustified tool use.
  2. A strong theme today is **enterprise/production realism**: benchmarks now emphasize real workspaces, EHR databases, terminal/GUI environments, and multimodal moderation policies rather than toy tasks. This raises the bar for reproducibility, artifact delivery, and cost-aware evaluation.
  3. **Memory is emerging as both a capability and a risk surface**. New work shows proactive memory is much weaker than retrospective recall, procedural memory can over-specialize, and shared memory can propagate evaluator bias across time.
#1

Start with: GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

Why it catches my eye: It offers a reusable way to verify whether an agent’s answer was legitimately produced, not just plausible.

Read skeptically for: It relies on synthetic validation and substantial upfront contract and policy authoring.

agent-evaluation verification stateful-agents reliability

Themes

Verifiable evaluation for stateful and enterprise agents Agent failures increasingly come from invalid evidence paths, missing artifacts, harness incompatibilities, or execution mistakes rather than wrong final text alone. Evaluation is becoming more like systems testing: auditable, executable, and artifact-aware.
Runtime safety controls for agents, tools, and multimodal systems Static permissions and fixed moderation taxonomies are not enough once agents act across tools, apps, images, and changing policies. The trend is toward runtime conditioning on user intent, active policy, and information flow.
Privacy and contextual leakage in multi-app, multi-party agents As assistants operate across apps and group contexts, privacy failures are no longer just “did the model reveal a secret?” but “did it reveal the right information to the right recipient in the right context?”.
Signal Agent eval is becoming verifiable. GroundEval, EnterpriseClawBench, EHR-Complex, and HOLMES all score process validity, artifacts, or executable traces rather than final text alone.
Tension Safety proxies keep overstating confidence. Evaluation-awareness, self-report, knowledge-editing, and LLM-judge papers all show plausible safety signals can fail under adversarial or deployment framing.
Bet Runtime controls will beat static guardrails. Intent-governed authorization, policy-adaptive moderation, and geometric information-flow control all constrain actions or disclosures using live context.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

#1

Useful if you evaluate agents with tools or memory: it separates correct-looking answers from valid evidence paths.

Why now
Enterprise agents need regression gates that catch policy and trajectory failures, not just bad outputs.
Skepticism
Synthetic validation and authoring overhead may limit immediate transfer to messy production settings.

Intent-Governed Tool Authorization for AI Agents

#2

A concrete server-side design for narrowing tool permissions based on intent before unsafe actions execute.

Why now
Tool-using agents are becoming effectful systems, and static scopes are too coarse for safe deployment.
Skepticism
The evidence is benchmark-scale and mechanism-focused rather than a fully hardened production rollout.

Evaluation Awareness Is Not One Capability: Evidence from Open Language Models

#3

It challenges the assumption that benchmarked safety behavior cleanly transfers to deployment behavior.

Why now
Release and governance decisions still lean heavily on benchmark results that may be framing-sensitive.
Skepticism
Some mechanistic analyses cover only a subset of models, and formatting artifacts remain hard to isolate.

Chinese version: [中文]

Run stats

  • Candidates: 311
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-06-22T00:00:00Z → 2026-06-23T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2606.23277GIF: Locally Sound Geometric Information Flow Control for LLMs
PDF
cs.AI95Principled IFC for LLM agents using Jacobian geometry; directly targets prompt injection and data leakage.agent-safety, prompt-injection, information-flow-control, privacy, LLM-security
2606.22916Intent-Governed Tool Authorization for AI Agents
PDF
cs.AI95Server-side intent-based tool authorization directly targets agent permission misuse.agent-safety, tool-use, authorization, access-control, security
2606.23276Exposing the Illusion of Erasure in Knowledge Editing for LLMs
PDF
cs.LG, cs.AI, cs.CR95Shows KE often suppresses rather than erases facts; strong safety relevance for model editing.LLM, knowledge-editing, safety, mechanistic-interpretability, adversarial-evaluation
2606.23075Safety in Self-Evolving LLM Agent Systems: Threats, Amplification, and Case Studies
PDF
cs.CR, cs.AI94Systematic threat model for self-evolving agents; high safety relevance with concrete attack-surface framework.agent-safety, self-evolving-agents, threat-modeling, security, privacy
2606.23217MuPPET: A Benchmark for Contextual Privacy of LLM Assistants in Multi-Party Conversations
PDF
cs.CL, cs.AI93Benchmark exposes privacy leakage in multi-party LLM assistant settings, a key deployment risk.privacy, benchmark, multi-agent, multi-party, evaluation, llm-safety
2606.23189Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?
PDF
cs.AI, cs.CL92Executable benchmark for privacy failures in computer-use agents across apps and contexts.agent-safety, privacy, computer-use-agents, benchmark, evaluation
2606.23583Evaluation Awareness Is Not One Capability: Evidence from Open Language Models
PDF
cs.CL92Shows evaluation awareness shifts safety behavior, undermining benchmark-to-deployment validity.evaluation, safety, benchmarking, distribution-shift, jailbreaks, reliability
2606.23496TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization
PDF
cs.LG, cs.CR92Open framework for text-trigger optimization; directly useful for jailbreak red-teaming and auditing.security, red-teaming, jailbreaks, optimization, framework
2606.22737GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation
PDF
cs.AI, cs.CL, cs.SE91Deterministic grounded evaluation for stateful agents; strong alternative to unreliable LLM-as-judge.agent-evaluation, grounding, benchmark, auditing, reliability
2606.23321Tmax: A simple recipe for terminal agents
PDF
cs.CL91Strong open RL recipe for terminal agents with benchmark gains and reusable data generation pipeline.agents, RL, terminal-agents, benchmarks, training
2606.23375Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts
PDF
cs.CL, cs.AI91Introduces benchmark for over-alignment refusals in legal use; practical reliability and guardrail tradeoff.LLM, alignment, over-alignment, benchmark, legal-ai, multilingual
2606.23416Detecting Malicious Agent Skills in the Wild using Attention
PDF
cs.CR, cs.AI90Targets malicious third-party agent skills, a realistic supply-chain risk not covered by prompt defenses.agent-safety, supply-chain-security, malicious-skills, detection, attention
2606.23671Can LLMs Reliably Self-Report Adversarial Prefills, and How?
PDF
cs.CL90Finds LLMs cannot reliably self-detect adversarial prefills; useful for attack monitoring limits.adversarial-attacks, self-monitoring, introspection, safety, robustness
2606.22883CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents
PDF
cs.AI89Verifiable task synthesis engine for terminal agents addresses a key data bottleneck for agent training.agents, data-generation, terminal-agents, evaluation, verification
2606.22792The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models
PDF
cs.AI89Granular taxonomy of LLM uncertainty sources and UQ methods; useful reliability framing despite unclear results.LLM, uncertainty, calibration, reliability, survey, taxonomy
2606.22873SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
PDF
cs.CV, cs.CL88Policy-adaptive multimodal guardrail with runtime rules; useful for changing deployment policies.guardrails, multimodal, VLM-safety, policy-adaptation, moderation
2606.23654EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
PDF
cs.CL, cs.SE88Real enterprise agent benchmark from workplace sessions with strong realism and reusable protocol.agents, benchmark, enterprise, evaluation, tool-use, workflows
2606.23195Memory Contagion: Cross-Temporal Propagation of Evaluator Bias via Agent Memory
PDF
cs.LG, cs.AI, cs.CL88Identifies bias propagation through agent memory, a novel safety risk for long-lived agent systems.agent-safety, memory, bias, evaluation, reliability
2606.23238HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs
PDF
cs.AI88New benchmark for higher-order logical reasoning with verifiable traces in law and finance.LLM, reasoning, benchmark, evaluation, higher-order-logic
2606.23104ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation
PDF
cs.LG, cs.AI88Novel LLM distillation insight: incorrect trajectories can improve reasoning and reflection.LLM, reasoning, distillation, on-policy-training, post-training
2606.22841IndicGuard: A Multilingual Safety Guard Model and Dataset for Indic Languages
PDF
cs.CL, cs.LG87Fills major gap in multilingual safety with Indic jailbreaks and culturally grounded guard model.multilingual-safety, guard-model, jailbreaks, dataset, Indic-languages
2606.23459TriggerBench: Investigating Prospective Memory for Large Language Models
PDF
cs.CL87Prospective-memory benchmark targets long-horizon agent reliability beyond standard recall tests.memory, agents, benchmark, long-context, reliability, evaluation
2606.23127Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation
PDF
cs.AI, cs.CL, cs.SE87Benchmark and framework for procedural memory transfer in workplace agents with cross-model evaluation.agents, memory, benchmark, enterprise, evaluation
2606.23130Understanding the (In)Security of Vibe-Coded Applications
PDF
cs.CR, cs.SE86Systematic security study of vibe-coded apps addresses a fast-growing AI-assisted coding risk.security, code-generation, agents, software-engineering, risk-assessment
2606.22778HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions
PDF
cs.IR, cs.CL86Reusable lightweight multilingual retrieval benchmark for same-condition RAG architecture/efficiency comparison.retrieval, RAG, benchmark, multilingual, efficiency, evaluation
2606.22948ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents
PDF
cs.AI, cs.CV85Verified search for long-horizon GUI agents plus noisy benchmark; strong agent capability/eval contribution.GUI-agents, search, benchmark, long-horizon, training
2606.23525Self-Compacting Language Model Agents
PDF
cs.CL85Model-controlled context compaction targets long-horizon agent reliability and context-window limits.agents, long-context, memory, efficiency, reliability
2606.22966Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models
PDF
cs.LG, cs.AI, cs.CR84Identifies integrity attacks on world-model imagination, relevant to safety layers around embodied agents.world-models, robotics-safety, integrity-attacks, VLA, security
2606.23301EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning
PDF
cs.AI84Large interactive benchmark for clinical agents with sandboxed SQL/Python execution tasks.agents, benchmark, medical, tool-use, sandbox, evaluation
2606.23060From Text Metrics to Model Internals: A Study of Whisper ASR Hallucination Detection
PDF
cs.SD, cs.AI, eess.AS84Studies ASR hallucination detection via internals; probing beats text/LLM methods on real data.hallucination, ASR, detection, model-internals, reliability

AI Paper Insight Brief

2026-06-24

0) Executive takeaways (read this first)

  • Agent evaluation is shifting from answer-only scoring toward state-, policy-, and trajectory-aware verification. Several papers show that plausible final outputs can still be invalid because of access violations, missing evidence, contextual privacy leaks, or unjustified tool use.
  • A strong theme today is enterprise/production realism: benchmarks now emphasize real workspaces, EHR databases, terminal/GUI environments, and multimodal moderation policies rather than toy tasks. This raises the bar for reproducibility, artifact delivery, and cost-aware evaluation.
  • Memory is emerging as both a capability and a risk surface. New work shows proactive memory is much weaker than retrospective recall, procedural memory can over-specialize, and shared memory can propagate evaluator bias across time.
  • On safety/security, the field is moving from static taxonomies to runtime-conditioned controls: policy-adaptive guardrails, intent-governed tool authorization, contextual-integrity benchmarks, and geometric information-flow control all aim to constrain what agents may reveal or do in context.
  • For training agents, the most effective recipes today appear to favor verified data and structured filtering over raw scale: executable task synthesis, environment-verified search, and trajectory reweighting all report sizable gains with relatively compact but high-fidelity supervision.
  • Several papers warn that common safety signals are brittle: LLM-as-judge, self-reports, benchmark framing, and knowledge-editing “erasure” can all create false confidence if not stress-tested adversarially.

2) Key themes (clusters)

Theme: Verifiable evaluation for stateful and enterprise agents

  • Why it matters: Agent failures increasingly come from invalid evidence paths, missing artifacts, harness incompatibilities, or execution mistakes rather than wrong final text alone. Evaluation is becoming more like systems testing: auditable, executable, and artifact-aware.
  • Representative papers:
  • Common approach:
    • Replace free-form judging with machine-checkable contracts, executable SQL/tests, or solver-verified traces.
    • Score both end outputs and the process that produced them: evidence retrieval, artifact delivery, trace faithfulness, or execution success.
    • Build benchmarks from realistic substrates: enterprise sessions, full MIMIC-IV, legal/finance rule systems, persistent workspaces.
    • Report multidimensional metrics instead of a single score: utility, leakage, cost, runtime, modality-specific quality, consistency.
  • Open questions / failure modes:
    • Authoring burden is high for contracts, policies, rubrics, and executable environments.
    • Visual/multimodal judging remains weaker than text judging in enterprise settings.
    • Realism often trades off with openness: some datasets rely on internal sessions or protected data.
    • Many results are environment-specific, so cross-domain transfer of benchmark conclusions is still uncertain.

Theme: Runtime safety controls for agents, tools, and multimodal systems

  • Why it matters: Static permissions and fixed moderation taxonomies are not enough once agents act across tools, apps, images, and changing policies. The trend is toward runtime conditioning on user intent, active policy, and information flow.
  • Representative papers:
  • Common approach:
    • Condition decisions on runtime inputs: intent certificates, natural-language policy, or span-level flow scores.
    • Separate fast-path screening from slower, more interpretable reasoning or review modes.
    • Add deterministic gates around model outputs: manifest filtering, payload checks, declassification, or policy-following evaluation.
    • Expand safety coverage beyond English text to multimodal and multilingual settings.
  • Open questions / failure modes:
    • Policy ambiguity remains a core bottleneck; unclear rules can produce inconsistent moderation or authorization.
    • Hybrid/early-exit systems depend on confidence calibration that may fail under shift.
    • Mechanism-level containment is stronger than deployment-level evidence in several papers.
    • Synthetic or translated safety data may miss evolving slang, local harms, or organic abuse patterns.

Theme: Privacy and contextual leakage in multi-app, multi-party agents

  • Why it matters: As assistants operate across apps and group contexts, privacy failures are no longer just “did the model reveal a secret?” but “did it reveal the right information to the right recipient in the right context?”.
  • Representative papers:
  • Common approach:
    • Operationalize privacy as recipient- and context-specific disclosure constraints rather than generic toxicity/safety labels.
    • Use adversarial scenario generation or large-scale wild collection to surface realistic failure cases.
    • Evaluate both utility and leakage, often with refusal-adjusted or engagement-conditioned metrics.
    • Combine cheap localization with expensive review for scalable scanning of large corpora or marketplaces.
  • Open questions / failure modes:
    • Stress-test benchmarks may overstate absolute real-world leakage rates even if they rank systems well.
    • Prompt-only mitigations help but do not solve access-control or memory-isolation problems.
    • Group privacy and cross-app privacy remain hard even for strong models.
    • Supply-chain risks from third-party skills and vibe-coded apps suggest failures often originate in workflow design, not just model outputs.

Theme: Memory as capability, bottleneck, and attack surface

  • Why it matters: Long-horizon agents increasingly depend on memory, but today’s systems struggle to decide what to remember, when to act on it, and how to prevent memory from becoming a source of bias or stale context.
  • Representative papers:
  • Common approach:
    • Distinguish memory subproblems: proactive recall, reusable procedural skills, shared external memory, and context compaction.
    • Evaluate transfer explicitly across tasks, roles, models, or contamination rates rather than only in-context gains.
    • Use write-time curation, versioned skills, or rubric-gated summarization to improve memory quality.
    • Measure both capability and side effects: token cost, over-specialization, false alarms, or bias propagation.
  • Open questions / failure modes:
    • Prospective memory is much weaker than retrospective recall, especially under long contexts or overloaded triggers.
    • Procedural skills can overfit to role-specific workflows and hurt transfer.
    • Shared memory can propagate evaluator bias even with perfect consolidation.
    • Summarization/compaction helps, but rubric design and generalization beyond tested tasks remain open.

Theme: High-fidelity data generation and training for terminal/GUI agents

Theme: False confidence from judges, self-reports, edits, and benchmark framing

3) Technical synthesis

  • A recurring design pattern is dual scoring: separate final-answer correctness from process validity, whether that process is evidence retrieval (GroundEval), disclosure behavior (AGENTCIBENCH), artifact delivery (EnterpriseClawBench), or reasoning trace fidelity (HOLMES).
  • Several papers replace monolithic evaluation with structured contracts or policies: GroundEval’s machine-readable contracts, IGAC’s intent certificates, SingGuard’s runtime policy input, and GIF’s span-to-sink flow scoring all formalize what the model is allowed to know or do.
  • Executable verification is becoming the preferred supervision source for agents: CLI-Universe uses fail-to-pass tests, ENVS uses environment oracles, EHR-Complex uses SQL execution, and terminal RL work uses containerized verifiers.
  • There is a clear move from static benchmarking to stress-tested, contrastive evaluation: Pos-Clean/Pos-Over/Neg-Clean in TriggerBench, one-on-one vs multi-party in MuPPET, state-grounded vs end-to-end in AGENTCIBENCH, and eval-vs-deploy framing in evaluation-awareness work.
  • Multiple papers show that retrieval/memory architecture matters as much as model size: Letta-Sim beats embedding-only memory on prospective memory; diverse multi-model traces improve procedural skill transfer; memory stores can propagate evaluator bias even under oracle consolidation.
  • Fast/slow decomposition appears in both safety and capability work: SingGuard’s fast/hybrid/slow moderation, SELFCOMPACT’s near-free rubric probes plus occasional summarization, and IGAC’s review-mode lattice all trade latency for auditability.
  • Several results suggest negative or failure trajectories are highly informative: ReNIO upweights incorrect student outputs; CLI-Universe filters for tasks where hints matter; ENVS learns from verified successful branches discovered through broad search over failures.
  • Mechanistic analysis is increasingly tied to deployment claims: refusal-direction ablation for self-reporting, Fisher/Jacobian geometry for information flow, low-rank subspace arguments for knowledge editing, and activation steering for evaluation awareness.
  • Across retrieval, moderation, and agent benchmarks, authors increasingly report cost-quality frontiers rather than raw accuracy: HAKARI-Bench for retrieval efficiency, GIF for token-cost reduction, SELFCOMPACT for cost savings, and EnterpriseClawBench for harness/model trade-offs.
  • A common failure mode is proxy mismatch: RM does not imply PM, answer plausibility does not imply valid evidence, benchmark compliance does not imply deployment safety, and self-report does not imply introspective reliability.

4) Top 5 papers (with “why now”)

  • GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation
    • Introduces a judge-free framework for evaluating stateful agents using event logs, artifact corpora, access policy, and eval config.
    • Separates answer correctness from trajectory validity across Perspective, Counterfactual, and Silence tracks.
    • Shows concrete cases where external LLM judges gave ~0.85–0.90 plausibility scores while GroundEval assigned zero answer score because required evidence was never legitimately retrieved.
    • Why now: as enterprise agents gain memory, tools, and long-lived state, answer-only eval is no longer a safe regression gate.
    • Skepticism: validated on a synthetic corpus and requires nontrivial upfront authoring of machine-readable policies/configs.
  • Intent-Governed Tool Authorization for AI Agents
    • Formalizes “intent-tool mismatch” and proposes server-side intent certificates, manifest filtering, and payload consistency checks.
    • Shows monotone narrowing of static permissions and reports zero unsafe executed side effects in runtime benchmarks, though some unsafe accepted authority remains in draft/preflight paths.
    • Provides concrete runtime endpoints and rollout modes, making it unusually deployment-oriented.
    • Why now: tool-using agents are moving from read-only copilots to effectful systems, and static OAuth scopes are clearly insufficient.
    • Skepticism: evidence is mechanism-level and benchmark-scale; runtime artifact is minimal rather than production-hardened.
  • SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
    • Builds a multimodal guardrail that conditions on runtime policy and supports fast, hybrid, and slow reasoning modes.
    • Pairs this with a large policy-conditioned corpus and SingGuard-Bench, including dynamic-rule and cross-modal hidden-intent cases.
    • Reports strong macro-F1 across multimodal, image-only, and text safety tasks, plus a policy-following gain from 0.6465 to 0.7415 on dynamic-rule evaluation.
    • Why now: moderation policies are changing faster than static taxonomies, especially for multimodal products and regional deployments.
    • Skepticism: performance depends on policy clarity, and synthetic/model-assisted data may encode teacher bias.
  • CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents
    • Presents a structured synthesis pipeline for terminal-agent tasks grounded in real technical materials and filtered by executable verification.
    • Distills 6,000 verified trajectories and shows a Qwen3-32B fine-tuned model reaches 33.4% on Terminal-Bench 2.0, outperforming other open-source models ≤32B.
    • Ablations show each verification component matters, with 3–6 point drops when removed.
    • Why now: terminal-agent progress is increasingly bottlenecked by trustworthy training data rather than model architecture alone.
    • Skepticism: still dependent on LLM-based synthesis and some categories show limited gains.
  • Evaluation Awareness Is Not One Capability: Evidence from Open Language Models
    • Runs a broad 37-model study showing evaluation awareness is multidimensional: detection, behavioral adaptation, and controllability can dissociate.
    • Finds instruction tuning is a stronger driver than scale, and framing can materially change refusal/compliance behavior.
    • Adds mechanistic evidence via probes and activation steering, showing latent eval/deploy signals can persist even when behavior changes.
    • Why now: benchmark-based governance and release decisions increasingly assume measured safety transfers to deployment.
    • Skepticism: some deep behavioral analyses are concentrated on a smaller subset of models, and format artifacts remain hard to fully disentangle.

5) Practical next steps

  • Add a trajectory-validity layer to agent evals: log artifact IDs, timestamps, access scopes, and tool traces so you can score “could the agent legitimately know this?” separately from answer quality.
  • For tool-using agents, prototype intent-bounded authorization: hide tools before planning, enforce payload bounds at preflight/action time, and log reason codes for every denial.
  • If you run long-horizon agents, test prospective memory explicitly rather than relying on retrieval QA. Add trigger-based evaluations and compare RM vs PM to find hidden failure modes.
  • Treat shared memory as a security/fairness surface: audit stored traces for evaluator-style bias, contamination rate, and retrieval amplification; consider quarantine or debiasing before consolidation.
  • Prefer verified synthetic data over raw synthetic volume for agent training: executable tests, fail-to-pass checks, and environment oracles appear to deliver better sample efficiency.
  • For moderation and privacy, move from static labels to contextual policies: recipient-aware disclosure checks, runtime policy inputs, and multilingual/cross-modal evaluation.
  • Stress-test any safety proxy you rely on—LLM judges, self-reports, benchmark framing, or knowledge editing—with contrastive prompts and adversarial probes before using it as a release gate.
  • Track cost-quality frontiers as first-class metrics. Several papers show meaningful gains come from better routing, compaction, and candidate filtering rather than larger models alone.

Generated from per-paper analyses; no external browsing.