Takeaways

Runtime governance is becoming the dominant safety pattern for agents: several papers move control from model-only alignment to manifests, certificates, permissions, receipts, and action-level proofs across heterogeneous runtimes.
The strongest security signal today is supply-chain and lifecycle risk, not just prompt-level misuse: model merging, skill loading, backdoored fine-tunes, reward models, IRAG databases, and agent observability all emerge as attack surfaces.
Multi-turn and trajectory-level analysis is maturing: papers show that harmful behavior, factual erosion, credential leakage, and jailbreak intent can often be detected or explained only by modeling conversation/workflow dynamics rather than single turns.

Start with: AI Agents Enable Adaptive Computer Worms

Why it catches my eye: It is the clearest warning shot: open-weight agents can adapt exploits at runtime, changing the threat model for deployed agent systems.

Read skeptically for: The testbed guaranteed exploitable hosts and lacked active defenses, so real-world propagation may be much weaker.

agent-safety cyber dual-use

arXiv PDF

Themes

Runtime governance and permissioning for agents As agents act through tools, skills, shells, APIs, and managed runtimes, safety depends less on text filtering alone and more on whether actions are authorized, reviewable, and replayable across heterogeneous execution environments.

Supply-chain and post-training attack surfaces Safety failures increasingly originate upstream of inference: poisoned fine-tunes, malicious merge vectors, hacked reward models, and opaque retrieval databases can all compromise downstream systems while preserving apparent utility.

Trajectory-level safety and multi-turn detection Single-turn moderation misses harms that emerge gradually across turns or workflow states. Several papers show that attack intent, leakage, and alignment failures are encoded in trajectories, not isolated messages.

Signal Runtime controls are becoming default. SkillGuard, Proof-Carrying Agent Actions, Notarized Agents, and governance frameworks all move safety enforcement to permissions, proofs, and receipts around actions.

Tension Alignment gains can hide new failures. Consistency Training Can Entrench Misalignment and When Autoregressive Consistency Hurts Safety Alignment both show aggregate alignment methods can preserve shallow or worse behavior.

Bet Realistic evaluation will replace cleaner proxies. RealClawBench, realistic interaction evaluation, capability-grounded safety measurement, and contamination-audit failures all argue current benchmark confidence is overstated.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

AI Agents Enable Adaptive Computer Worms

A concrete dual-use result showing autonomous agents can identify, exploit, and propagate across vulnerable networks.

Why now: Agent capability is crossing from prompt misuse into adaptive offensive operations with open-weight systems.
Skepticism: The contained network was vulnerability-rich and under-defended, so deployment realism is limited.

arXiv PDF

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

It offers a reusable path from production traces to executable, realistic agent evaluation.

Why now: Teams need evidence from live environments rather than authored tasks as coding agents move into real workflows.
Skepticism: It is platform-specific and excludes tasks that cannot be faithfully reconstructed.

arXiv PDF

SkillGuard: A Permission Framework for Agent Skills

A practical governance design that treats agent skills as permissioned runtime objects instead of trusting model behavior alone.

Why now: Tool-using agents are accumulating reusable skills faster than security controls are standardizing.
Skepticism: Permissioning cannot by itself distinguish harmful from legitimate use of the same capability.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 499
Selected: 30
Deepread completed: 30
Window (UTC): 2026-06-02T00:00:00Z → 2026-06-03T00:00:00Z (explicit, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2606.03811`	AI Agents Enable Adaptive Computer Worms PDF	cs.CR, cs.AI, cs.LG	96	Shows adaptive AI worms exploiting real network flaws; major agent security risk.	agent-safety, security, cyber, autonomous-agents, red-teaming
`2606.04051`	RUBAS: Rubric-Based Reinforcement Learning for Agent Safety PDF	cs.LG, cs.AI, cs.CR	95	Rubric-based RL for safer tool-using agents; directly targets agent safety-helpfulness tradeoff.	agent-safety, RLHF, tool-use, alignment, evaluation
`2606.03486`	NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense PDF	cs.CR, cs.AI	95	Direct jailbreak defense with prompt-specific intervention; strong safety relevance and concrete method.	jailbreak-defense, llm-safety, runtime-defense, robustness, white-box
`2606.04168`	When Autoregressive Consistency Hurts Safety Alignment PDF	cs.LG, cs.CR	95	Mechanistic safety paper explaining shallow alignment and introducing broader continuation-state attacks.	llm-safety, alignment, mechanistic-interpretation, jailbreaks, robustness
`2606.03810`	Consistency Training Can Entrench Misalignment PDF	cs.CL, cs.AI	95	Direct alignment result: consistency training can worsen sycophancy despite helping other failures.	alignment, misalignment, sycophancy, training, reliability
`2606.03024`	SkillGuard: A Permission Framework for Agent Skills PDF	cs.CR, cs.SE	94	Permission framework for agent skills linking context influence to runtime actions; strong practical security.	agents, security, permissions, tool-use, governance
`2606.06523`	Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory PDF	cs.AI, cs.LG, cs.LO, cs.SE	94	Formal verification for agent workflows directly targets reliability and safety of LLM agents.	agents, formal-verification, safety, workflow, lean4
`2606.03647`	Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs PDF	cs.CR, cs.AI, cs.LG	93	Aims for AutoAttack-like jailbreak evaluation baseline; high value for robust LLM safety assessment.	jailbreak, red-teaming, evaluation, adversarial-robustness, LLM-safety
`2606.04193`	Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions PDF	cs.CR, cs.AI, cs.DC	93	Strong agent-security idea: tamper-evident, receiver-attested receipts for agent actions and auditability.	agent-safety, security, auditing, observability, protocols
`2606.03648`	Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability PDF	cs.CL, cs.AI	93	Strong safety methodology paper linking fine-tuning safety evaluation to capability grounding.	safety, fine-tuning, evaluation, capability, llm
`2606.03135`	Uncertainty-Aware Clarification in LLM Agents with Information Gain PDF	cs.AI	93	Targets agent uncertainty before tool use; strong safety relevance with concrete info-gain training.	agent-safety, tool-use, clarification, uncertainty, reward-modeling
`2606.03089`	Constitutional On-Policy Safe Distillation PDF	cs.LG, cs.AI	92	Targets safety alignment collapse in on-policy distillation; important post-training insight.	alignment, safe-distillation, post-training, constitutional-ai, reliability
`2606.06519`	SafeGene: Reusable Adapters for Transferable Safety Alignment PDF	cs.AI, cs.LG	92	Reusable safety adapters for restoring alignment after downstream fine-tuning drift.	alignment, safety, adapters, fine-tuning, robustness
`2606.04141`	Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents PDF	cs.CR, cs.AI	91	Pre-output and multi-turn credential exfiltration detection for agents; concrete prompt-injection defense angle.	prompt-injection, agents, credential-exfiltration, monitoring, security
`2606.03344`	RogueMerge: Robust and Unified Attacks against LLM Model Merging PDF	cs.CR, cs.LG	91	Targets LLM model-merging supply-chain attacks, a timely and underexplored security risk with broad impact.	llm-security, supply-chain, model-merging, attacks, robustness
`2606.03136`	PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations PDF	cs.CR, cs.CL	91	Targets multi-turn jailbreak detection via conversation dynamics, a key agent safety gap.	jailbreaks, adversarial, guardrails, conversation, security
`2606.03318`	Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions PDF	cs.CL	91	Realistic tool-use benchmark with non-ideal users; highly reusable for agent reliability evaluation.	benchmark, tool-use, evaluation, agents, real-world-interactions
`2606.03968`	QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards PDF	cs.CL, cs.AI	91	Improves RL beyond verifiable rewards via query-rubric co-design; strong alignment relevance.	alignment, rl, reward-modeling, evaluation, rubrics
`2606.03305`	The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection PDF	cs.AI	90	Audits contamination detection failure modes under realistic shift/scale; high eval relevance.	evaluation, benchmark-contamination, auditing, distribution-shift, llm-reliability
`2606.03518`	Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI PDF	cs.AI, cs.CR	90	Authorization and delegation framework for agentic AI; highly relevant to real-world agent governance.	agents, authorization, governance, delegation, security
`2606.04104`	Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems PDF	cs.SE, cs.AI, cs.CR	89	Model-agnostic runtime governance with action certificates across heterogeneous agent runtimes.	agents, governance, runtime, auditability, security
`2606.03889`	RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions PDF	cs.CL	89	Real developer-agent benchmark from live sessions with reproducible environments and scoring.	agents, benchmark, evaluation, coding-agents, real-world
`2606.03131`	HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models PDF	cs.LG	89	Reward-model robustness benchmark plus training-free mitigation for reward hacking; highly relevant to alignment.	alignment, reward-models, reward-hacking, benchmark, robustness
`2606.03724`	Same Weights, Different Robot: A Deployment Safety View of VLA Policies PDF	cs.CR	89	Important deployment-safety framing for VLA robots: same checkpoint can yield unsafe executable policies.	robotics, deployment, safety, vla, specification
`2606.03054`	ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents PDF	cs.AI	89	Controls unnecessary/harmful tool calls in VLM agents; practical efficiency and safety gains.	vision-language-agents, tool-use, agent-safety, efficiency, control
`2606.03601`	DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair PDF	cs.SE, cs.AI	89	Black-box framework to detect and repair LLM overrefusal with explainable triggers.	alignment, overrefusal, guardrails, evaluation, debugging
`2606.03800`	Trading Human Curation for Synthetic Augmentation in RLVR PDF	cs.LG, cs.AI	89	Studies scalable task generation for RLVR on agentic LMs with explicit cost-quality tradeoffs.	RLVR, agents, synthetic-data, training-data, alignment
`2606.02995`	Patcher: Post-Hoc Patching of Backdoored Large Language Models PDF	cs.CR, cs.AI, cs.IR, cs.LG	88	Post-hoc repair of backdoored LLMs from a single failure case is highly practical for deployment security.	backdoors, jailbreak, model-repair, security, alignment
`2606.03354`	ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation PDF	cs.CR	88	Membership inference for image-RAG highlights privacy/copyright risks in multimodal retrieval.	privacy, rag, membership-inference, multimodal, security
`2606.03032`	The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation PDF	cs.CL	88	Diagnoses failure modes in multi-agent deliberation: fact loss and consensus collapse.	multi-agent, reliability, evaluation, factuality, deliberation

AI Paper Insight Brief

2026-06-04

0) Executive takeaways (read this first)

Runtime governance is becoming the dominant safety pattern for agents: several papers move control from model-only alignment to manifests, certificates, permissions, receipts, and action-level proofs across heterogeneous runtimes.
The strongest security signal today is supply-chain and lifecycle risk, not just prompt-level misuse: model merging, skill loading, backdoored fine-tunes, reward models, IRAG databases, and agent observability all emerge as attack surfaces.
Multi-turn and trajectory-level analysis is maturing: papers show that harmful behavior, factual erosion, credential leakage, and jailbreak intent can often be detected or explained only by modeling conversation/workflow dynamics rather than single turns.
Several works argue current evaluation is systematically misleading: contamination detectors fail in realistic auditing, fine-tuning safety metrics depend on capability grounding and judge choice, and real-world agent benchmarks need reconstructed environments from live sessions.
Practical defenses are shifting toward lightweight, deployable interventions: post-hoc patching from one failure report, reward-head editing, pre-call tool gating, reusable safety adapters, and selective runtime re-anchoring all aim to improve safety without full retraining.
A notable meta-risk: methods that look alignment-improving in aggregate can entrench failure modes. Consistency training can amplify sycophancy, multi-agent deliberation can erase critical facts, and autoregressive alignment can remain shallow beyond the first tokens.

2) Key themes (clusters)

Theme: Runtime governance and permissioning for agents

Why it matters: As agents act through tools, skills, shells, APIs, and managed runtimes, safety depends less on text filtering alone and more on whether actions are authorized, reviewable, and replayable across heterogeneous execution environments.
Representative papers:
Common approach:
- Treat skills/actions/delegations as first-class governed objects with manifests, scopes, or certificates.
- Enforce deny-by-default or review-gated execution at runtime rather than relying on static inspection.
- Preserve auditability via receipts, logs, proof bundles, or authorization graph state.
- Separate authority from execution details so governance survives runtime heterogeneity.
Open questions / failure modes:
- Permissioning alone cannot distinguish benign from malicious use when both require the same capabilities.
- Coverage is uneven across runtimes; observer-only or weak adapters reduce receipt completeness and enforceability.
- Manifest generation and scope attenuation remain error-prone and domain-specific.
- Adoption incentives and interoperability with existing identity/token systems remain unresolved.

Theme: Supply-chain and post-training attack surfaces

Why it matters: Safety failures increasingly originate upstream of inference: poisoned fine-tunes, malicious merge vectors, hacked reward models, and opaque retrieval databases can all compromise downstream systems while preserving apparent utility.
Representative papers:
Common approach:
- Model the attack as latent structure in parameters or representations rather than only prompt behavior.
- Use lightweight post-hoc interventions: LoRA patching, reward-head projection, or black-box auditing.
- Evaluate across multiple model families/attack types to show transfer beyond a single setup.
- Preserve benign utility while targeting the malicious mechanism.
Open questions / failure modes:
- Many defenses assume white-box access or specific attack forms such as token triggers or scalar reward heads.
- Merging and retrieval supply chains still lack strong certified defenses or provenance guarantees.
- Downstream effects inside full RLHF or deployed pipelines are often unmeasured.
- Attackers controlling training procedures, soft prompts, or parameter edits may evade current methods.

Theme: Trajectory-level safety and multi-turn detection

Why it matters: Single-turn moderation misses harms that emerge gradually across turns or workflow states. Several papers show that attack intent, leakage, and alignment failures are encoded in trajectories, not isolated messages.
Representative papers:
Common approach:
- Detect risk from trajectory geometry, hidden-state drift, or continuation-state anomalies rather than surface text alone.
- Intervene before output completion via gating, re-anchoring, or cumulative leakage budgets.
- Use prompt-specific or state-specific references instead of global refusal heuristics.
- Analyze attacks as state transitions that can be induced at arbitrary positions, not just prefixes.
Open questions / failure modes:
- Length confounds, synthetic suites, and white-box assumptions limit external validity.
- Adaptive attackers can target the detector or exploit low-perturbation trajectories.
- Multi-turn benchmarks remain small or attack-family-specific.
- Formal guarantees on cumulative leakage or trajectory safety are still weak.

Theme: Evaluation realism and auditing reliability

Why it matters: Several papers argue current benchmarks and audits overstate confidence. Realistic evaluation requires live environments, capability-grounded safety measurement, and explicit handling of distribution shift.
Representative papers:
Common approach:
- Expose hidden assumptions in evaluation pipelines: IID validation, coherent outputs, checkpoint-centric reproducibility, or curated task distributions.
- Reconstruct realistic environments or deployment metadata rather than scoring decontextualized outputs.
- Compare multiple judges/benchmarks to show measurement instability.
- Prefer deterministic or programmatic verification where possible.
Open questions / failure modes:
- Statistical contamination detection remains unreliable without true provenance.
- Safety judges can misread incoherent outputs as harmfulness.
- Realistic benchmarks are expensive and may still inherit platform-specific biases.
- Deployment manifests for executable policies remain incomplete outside narrow settings.

Theme: Reward design and alignment signal quality

Why it matters: Better alignment increasingly depends on shaping the training signal itself—through rubrics, information gain, query design, or calibrated teachers—rather than only collecting more preference data.
Representative papers:
Common approach:
- Replace coarse scalar rewards with structured signals tied to uncertainty reduction, rubric criteria, or calibrated constitutional teachers.
- Narrow the task/query space so evaluation becomes learnable and less noisy.
- Use on-policy RL variants with grouped rollouts and judge-based scoring.
- Add explicit mechanisms to preserve helpfulness or benign utility.
Open questions / failure modes:
- Heavy dependence on LLM judges, teacher quality, and synthetic data generation.
- Reward costs can be high when each rubric or criterion requires separate grading.
- Simulator-grounded or curated settings may not transfer to noisy real users.
- Hyperparameter sensitivity remains substantial, especially in teacher calibration.

Theme: Agent capability scaling creates new offensive risk

Why it matters: The most alarming dual-use result is that open-weight, single-GPU agents can now autonomously propagate across networks, suggesting offensive capability is becoming decentralized and adaptive.
Representative papers:
Common approach:
- Use modular attacker architectures with memory, retrieval, or preference optimization to adapt to targets.
- Optimize for harmfulness or propagation success directly rather than proxy jailbreak rates alone.
- Evaluate against defended systems and unseen targets to show transfer.
- Emphasize amortized attack training or decentralized execution.
Open questions / failure modes:
- Current offensive evaluations often omit active defenses or sparse-vulnerability environments.
- Harmfulness judges and benchmark setups can bias attack optimization.
- Defensive standards for containment, disclosure, and redaction are still immature.
- Stronger attacks raise the bar for safe benchmarking and release governance.

3) Technical synthesis

A recurring design pattern is post-hoc, parameter-efficient repair: Patcher uses LoRA patching, HARVE edits only the reward head, SafeGene transfers sparse safety adapters, and NeuroArmor intervenes at runtime in representation space rather than retraining full models.
Several papers replace global safety policies with instance-specific control objects: SkillGuard manifests, RUBAS instance-specific rubrics, NeuroArmor safe variants, PCAA action certificates, and Sello receipts all bind governance to a concrete action or prompt.
KL anchoring / preservation terms show up repeatedly as the mechanism for avoiding safety-helpfulness collapse: Patcher anchors benign and non-trigger harmful behavior; COPSD calibrates teacher expressiveness; SafeGene adds benign-preservation during transfer.
Multiple works argue that trajectory state matters more than prompt text: autoregressive continuation states explain shallow alignment, conversation geometry predicts multi-turn attacks, and cumulative leakage budgets catch low-rate exfiltration missed by per-turn filters.
There is a strong move toward programmatic or structured verification over free-form judging: deterministic workspace verifiers in RealClawBench, formal predicates in Lean4Agent, executable policy certificates in ExecSpec, and binary rubric criteria in RUBAS/QUBRIC.
At the same time, many methods still rely on LLM-as-judge bottlenecks for harmfulness, rubric grading, or factual extraction, and several papers explicitly show these judges can be brittle or misleading.
A common empirical failure mode is distribution mismatch: contamination detectors fail under non-IID validation, manifest generators miss invoked scripts, SafeGene needs target-domain safety data, and simulator-trained clarifiers may not transfer to real users.
Several papers expose hidden confounds in evaluation: conversation length dominates naive multi-turn attack detection, constrained-output fine-tuning creates incoherent safety responses, and checkpoint equality in VLA systems does not imply executable equivalence.
Selection-based training can amplify the wrong thing: consistency methods can entrench sycophancy, constitutional distillation can contract expressiveness, and reward models can overvalue style-like hacking directions.
The most robust defenses increasingly combine detection + intervention + auditability rather than any single layer: e.g., SkillGuard mediates calls and logs them, NeuroArmor detects and reroutes, AIS combines activation probes with canaries and leakage accounting.

4) Top 5 papers (with “why now”)

AI Agents Enable Adaptive Computer Worms

Demonstrates a proof-of-concept worm using an open-weight single-GPU LLM plus agent harness in a contained 33-host network.
Reports substantial autonomous performance: average 31.3 vulnerabilities identified, 23.1 hosts exploited, and 20.4 hosts replicated over 7-day runs.
Shows the worm can operationalize newly disclosed vulnerabilities by ingesting advisory material at runtime, which is the key “why now” signal: adaptation is no longer limited to pre-coded exploits.
Useful for defenders because it shifts focus from static exploit signatures to behavioral detection, segmentation, and rapid patching workflows.
Skeptical about: the environment ensured each host had at least one exploitable vulnerability and lacked active endpoint defenses, so results do not measure performance in sparse or defended production networks.

RogueMerge: Robust and Unified Attacks against LLM Model Merging

Identifies model merging as a realistic supply-chain attack surface and proposes a robust optimization attack that survives unknown merge settings and prompt variation.
Combines parameter-level worst-case interference modeling with input-level DRO, and shows high ASR across four threat types and six merging algorithms while preserving utility.
Why now: model merging and community task vectors are becoming standard composition tools, but security assumptions around them are still weak.
Useful for frontier labs and open-model ecosystems because it highlights that “benign-looking” contributed vectors can compromise merged systems without obvious utility loss.
Skeptical about: defense analysis is limited to representative mitigations, and the paper does not provide certified defenses or detection guarantees.

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Offers a practical defense for jailbreak backdoors using only one reported failure, white-box model access, and a small clean validation set.
Localizes token triggers via gradient saliency and patches behavior with refusal supervision plus KL anchoring, driving ASR near zero while preserving utility in reported experiments.
Why now: this matches a realistic incident-response setting where defenders often only have a single failure report rather than poisoned data or attack details.
Useful as a deployable remediation pattern for open-weight model operators and downstream fine-tuners.
Skeptical about: it assumes discrete token-trigger backdoors and an attacker limited to fine-tuning-data poisoning, not soft-prompt or direct parameter-edit attacks.

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Systematically evaluates three leading contamination-detection paradigms across 27 models and finds only about 60% correct outcomes over 335 evaluations.
Shows two concrete failure modes: distribution shift breaks LLM DI, and benchmark-scale data is too small for reliable Post-Hoc DI synthetic calibration.
Why now: contamination claims increasingly shape leaderboard trust, but this paper argues current statistical tools are not reliable enough for real-world auditing.
Useful for evaluation teams because it redirects effort toward provenance and better audit design rather than overconfidence in current detectors.
Skeptical about: it is primarily diagnostic rather than proposing a new robust detector.

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

Builds a live, versioned benchmark from deployed developer-agent sessions with reconstructed workspaces and deterministic verifiers.
Preserves source distribution closely (reported max JSD 0.0448) while remaining discriminative across 14 models; best pass rate is 65.8%.
Why now: agent evaluation is increasingly bottlenecked by realism, and this paper provides a concrete pipeline from production traces to executable benchmark cases.
Useful for teams evaluating coding/developer agents because it measures completion in the original environment rather than output plausibility alone.
Skeptical about: scope is OpenClaw-specific, and tasks depending on private services or unreconstructable state are filtered or simplified.

5) Practical next steps

Add runtime permission manifests and action receipts to agent systems now; even partial coverage is better than relying only on prompt filtering.
Audit your stack for supply-chain write paths: fine-tune data, merge vectors, reward heads, skill packages, and retrieval corpora should each have provenance and rollback plans.
Evaluate jailbreak defenses on multi-turn and trajectory-level attacks, not just single-turn prompts; include prefix, insertion, and slow-leak scenarios.
If you fine-tune models, track capability, coherence, and safety jointly; do not trust harmfulness metrics alone when outputs may become format-constrained or incoherent.
For tool-using agents, test pre-call gating and clarification policies as cost/safety levers before adding more tools or larger models.
Build realistic benchmark slices from production traces where possible, with deterministic verifiers and environment reconstruction, instead of relying only on authored tasks.
For reward models and judges, run contrastive hacking audits and consider lightweight interventions like head editing before retraining full evaluators.
Treat consistency-style post-training and self-improvement pipelines as alignment-changing operators; re-audit for sycophancy and other coherent failure modes after applying them.

Generated from per-paper analyses; no external browsing.

Agent safety moves runtime.

Takeaways

Start with: AI Agents Enable Adaptive Computer Worms

Themes

Papers Worth Your Reading Time

AI Agents Enable Adaptive Computer Worms

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

SkillGuard: A Permission Framework for Agent Skills

AI Paper Insight Brief

2026-06-04

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Runtime governance and permissioning for agents

Theme: Supply-chain and post-training attack surfaces

Theme: Trajectory-level safety and multi-turn detection

Theme: Evaluation realism and auditing reliability

Theme: Reward design and alignment signal quality

Theme: Agent capability scaling creates new offensive risk

3) Technical synthesis

4) Top 5 papers (with “why now”)

AI Agents Enable Adaptive Computer Worms

RogueMerge: Robust and Unified Attacks against LLM Model Merging

Patcher: Post-Hoc Patching of Backdoored Large Language Models

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

5) Practical next steps