Takeaways

Agent safety work is shifting from outcome-only evaluation to **process- and trajectory-level supervision**: several papers show that final success or refusal often hides serious internal failures, from web-agent process anomalies to shallow refusals and unstable belief updates.
**Retrieval, memory, and context are now first-class attack surfaces**. Web retrieval can weaken safety alignment, long-term memory can be poisoned through normal conversation, and benign-looking reference text or skills can steer models into harmful behavior.
A recurring pattern is that **small, specialized models trained on structured supervision can outperform larger zero-shot judges/guards** for narrow safety tasks: process-anomaly detection, financial compliance detection, and action-only scheming monitors all show this.

Start with: Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

Why it catches my eye: It identifies a structural safety-utility tradeoff in retrieval agents and offers a reusable benchmark for testing it.

Read skeptically for: Main evidence centers on controlled URL setups, so autonomous long-horizon retrieval remains less tested.

agent-safety retrieval tool-use evaluation

arXiv PDF

Themes

Process-level auditing replaces outcome-only evaluation Several papers show that final task success, refusal, or benchmark score can mask unsafe or unreliable internal behavior. The practical implication is that deployment monitoring needs trajectory labels, localized failure spans, and intermediate-state diagnostics rather than only end outcomes.

Retrieval and memory are structural safety vulnerabilities Retrieval and memory are supposed to improve capability, but several papers show they also systematically weaken alignment or create persistent attack channels. The common lesson is that relevance and persistence amplify risk, not just utility.

Runtime guardrails are moving from prompts to enforcement layers Prompt-only safety checks are increasingly treated as insufficient for high-privilege agents. The stronger proposals in this batch move enforcement into typed interfaces, verifiers, and pre-reply trajectory guards.

Signal Process beats outcome-only safety. OpenClawBench, BenchTrace, belief-management, and temporal-logit work all show final success or refusal can hide unsafe internal behavior.

Tension Helpful context widens attack surface. Web retrieval degrades alignment, conversational memory can be poisoned, and distractor instructions scale badly even as capability improves.

Bet Specialist runtime guards will win first. Typed guardrails, action-only scheming monitors, and domain detectors outperform generic prompt-only safety for narrow high-risk tasks.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

Useful if you build retrieval agents: it shows relevance itself can increase harmful compliance, not just obvious prompt injection.

Why now: Retrieval is becoming a default agent capability, so this is a core deployment risk rather than an edge case.
Skepticism: Controlled URL experiments may not capture the full dynamics of autonomous retrieval and long-horizon planning.

arXiv PDF

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

It makes the outcome-process gap concrete and gives a practical benchmark for trajectory-level monitoring.

Why now: Teams deploying agents need process diagnostics, not just pass/fail task scores, to catch latent failures early.
Skepticism: Silver labels and subtype imbalance limit how confidently the fine-grained anomaly taxonomy transfers.

arXiv PDF

Provably Secure Agent Guardrail

A strong companion paper because it moves safety from prompting to typed execution checks with formal guarantees.

Why now: As agents gain authority to act, deterministic enforcement layers matter more than better refusal phrasing.
Skepticism: The guarantees depend on strong assumptions about action formalization, complete axioms, and a trusted verifier.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 483
Selected: 30
Deepread completed: 30
Window (UTC): 2026-05-28T00:00:00Z → 2026-05-29T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2605.29601`	Training Deliberative Monitors for Black-Box Scheming Detection PDF	cs.CL, cs.AI, cs.LG	96	Black-box scheming detection for agents via action-only monitors; highly relevant AI control direction.	agent-safety, scheming, monitoring, black-box, alignment, evaluation
`2605.30322`	Gram: Assessing sabotage propensities via automated alignment auditing PDF	cs.LG, cs.AI	96	Direct agent sabotage auditing framework with concrete misbehavior rates and driver analysis.	agent-safety, alignment-audit, sabotage, evaluation, agents
`2605.29224`	Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents PDF	cs.CL, cs.AI, cs.CR	95	Strong agent-safety result: web retrieval can weaken alignment; diagnostic framework is highly reusable.	agent-safety, retrieval, alignment, tool-use, security, evaluation
`2605.30040`	Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage PDF	cs.CR, cs.AI, cs.CL	95	Auditing LLM token billing exposes provider-manipulation risks with direct security and governance impact.	llm-security, auditing, pricing, trust, governance
`2605.29468`	SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing PDF	cs.CR, cs.AI	95	Adversarial benchmark for research-integrity compliance; directly probes covert misconduct assistance.	safety, benchmark, adversarial-eval, alignment, scientific-integrity
`2605.29708`	Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs PDF	cs.CL	95	Directly probes where MoE LLM safety lives; expert-level red-teaming is highly relevant to alignment.	LLM safety, MoE, red-teaming, alignment, robustness
`2605.29491`	The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF PDF	cs.AI	94	Benchmark shows inverse scaling on distractor instructions, directly relevant to prompt injection/RAG robustness.	prompt-injection, rag, robustness, benchmark, inverse-scaling, agents
`2605.29354`	Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills PDF	cs.CR, cs.LG	94	Stealthy neutral-prompt attack raises package hallucination risk in coding agents; strong security relevance.	agent-security, prompt-injection, coding-agents, hallucination, supply-chain
`2605.29251`	Provably Secure Agent Guardrail PDF	cs.AI, cs.CR	93	Targets agent control with provable guardrails and executable proof constraints; high safety relevance.	agent-safety, guardrails, formal-methods, security, neuro-symbolic
`2605.29960`	Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction PDF	cs.CR, cs.AI	92	Realistic memory-poisoning attack on LLM agents via conversation; important new agent attack surface.	agent-safety, memory-poisoning, trojan, security, long-term-memory
`2605.30162`	BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders PDF	cs.AI, cs.CR, cs.LG	92	Audits refusal robustness for biosecurity prompts; exposes brittle safety behavior under small changes.	biosecurity, refusal, safety-evaluation, robustness, interpretability
`2605.29427`	FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions PDF	cs.CL	92	Regulation-grounded compliance benchmark/guard model for financial LLM deployments; strong applied safety value.	safety, guardrails, compliance, benchmark, finance
`2605.29253`	OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories PDF	cs.AI	91	Large benchmark for process-side anomalies in agent trajectories, beyond outcome-only evaluation.	agents, benchmark, process-monitoring, anomaly-detection, evaluation, safety
`2605.29237`	Evolving Skill-Structured Attack Memory Enhances LLM Jailbreaking PDF	cs.CR	91	Automated jailbreak framework with evolving attack memory; strong safety-eval value for red teaming.	jailbreak, red-teaming, safety-evaluation, adversarial-attacks, llm-security
`2605.29927`	Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents PDF	cs.CL, cs.AI, cs.LG	91	Systematic study of planning representations for web agents; directly useful for agent reliability.	llm-agents, web-agents, planning, evaluation, reliability
`2605.29800`	Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels PDF	cs.CL	91	Shows LLM judge panels have highly correlated errors; important warning for evaluation reliability.	evaluation, llm-as-judge, reliability, benchmarking, correlated-errors
`2605.29886`	CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation PDF	cs.CL, cs.AI	91	Structured RL critic for RAG error diagnosis could reduce hallucinations with reusable critique signals.	RAG, hallucination, RL, evaluation, reliability
`2605.29801`	AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security PDF	cs.AI, cs.CL, cs.CR, cs.CV, cs.LG	90	Alignment framework for agent safety/security with updated taxonomy and lightweight training recipe.	agent-safety, alignment, security, taxonomy, guardrails, data-engine
`2605.29225`	BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents PDF	cs.AI	90	Benchmark for reflection and self-evolution in agents with targeted failure analysis, not just task scores.	agents, benchmark, reflection, self-improvement, evaluation
`2605.29682`	Scaling Laws for Agent Harnesses via Effective Feedback Compute PDF	cs.CL	90	Proposes scaling law for agent harnesses via effective feedback, a useful lens for agentic systems.	agents, scaling-laws, evaluation, tool-use, test-time-compute
`2605.30159`	Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents PDF	cs.AI	90	Targets long-horizon agent memory with belief-entropy optimization; strong agent reliability relevance.	llm-agents, memory, long-horizon, reliability, optimization
`2605.29629`	Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures PDF	cs.AI	89	Moves beyond ASR with logit-based diagnostics for jailbreak failures; useful safety measurement tool.	jailbreak, evaluation, logits, safety-metrics, diagnostics
`2605.29218`	GTA: Generating Long-Horizon Tasks for Web Agents at Scale PDF	cs.AI, cs.CL	89	Scalable generation of long-horizon web-agent tasks with trajectories could unlock better training/eval.	web-agents, benchmark, task-generation, long-horizon, supervision
`2605.30049`	Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers PDF	cs.AI	89	Safety steering for diffusion transformers with transfer across shifted risk domains is broadly useful.	multimodal-safety, diffusion, safety-steering, robustness, SAE
`2605.30323`	In-Context Reward Adaptation for Robust Preference Modeling PDF	cs.LG, cs.AI	89	Adapts reward models in-context to unseen preferences, addressing a core RLHF robustness limitation.	RLHF, preference modeling, alignment, reward models, robustness
`2605.30189`	Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection PDF	cs.CR, cs.AI, cs.CL, cs.LG	88	Shows LoRA adapter backdoors can preserve clean accuracy; practical supply-chain risk for LLM safety.	backdoors, LoRA, supply-chain-security, poisoning, LLM-security
`2605.29737`	Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs PDF	cs.CR, cs.CL, cs.SE	88	Shows tiny prompt changes can induce insecure code; important reliability/security finding for coding LLMs.	coding-llms, security, prompt-fragility, code-generation, robustness
`2605.29951`	MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization PDF	cs.AI, cs.CL, cs.LG, cs.MM	88	Multimodal harm reasoning dataset and training method target subtle unsafe image-text interactions.	multimodal, safety, harm-detection, vlm, reasoning
`2605.30219`	When Should Models Change Their Minds? Contextual Belief Management in Large Language Models PDF	cs.AI, cs.CL, cs.LG	88	Belief management benchmark targets when models should update, retain, or ignore context over time.	reliability, long-context, benchmark, belief-tracking, rl
`2605.29397`	Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework PDF	cs.CL	88	Lightweight proxy for web-agent observation reduction with strong practical relevance to agent efficiency.	agents, web-agents, evaluation, efficiency, tool-use

AI Paper Insight Brief

2026-05-30

0) Executive takeaways (read this first)

Agent safety work is shifting from outcome-only evaluation to process- and trajectory-level supervision: several papers show that final success or refusal often hides serious internal failures, from web-agent process anomalies to shallow refusals and unstable belief updates.
Retrieval, memory, and context are now first-class attack surfaces. Web retrieval can weaken safety alignment, long-term memory can be poisoned through normal conversation, and benign-looking reference text or skills can steer models into harmful behavior.
A recurring pattern is that small, specialized models trained on structured supervision can outperform larger zero-shot judges/guards for narrow safety tasks: process-anomaly detection, financial compliance detection, and action-only scheming monitors all show this.
Multiple papers argue that architecture and interface choices matter as much as base model capability: same-turn retrieval is riskier than deferred retrieval, plan representation changes web-agent performance, and typed execution layers can provide guarantees that prompt-only guardrails cannot.
There is growing evidence that scaling alone does not monotonically improve robustness. Larger models can be more distractible, MoE routing can preserve semantics while safety is bypassed, and multi-judge panels add far less independent signal than their size suggests.
The most actionable near-term direction is to build runtime safety layers around agents: typed action verification, trajectory monitors, retrieval decoupling, memory admission controls, and domain-specific detectors all look more mature than relying on generic refusal behavior.

2) Key themes (clusters)

Theme: Process-level auditing replaces outcome-only evaluation

Why it matters: Several papers show that final task success, refusal, or benchmark score can mask unsafe or unreliable internal behavior. The practical implication is that deployment monitoring needs trajectory labels, localized failure spans, and intermediate-state diagnostics rather than only end outcomes.
Representative papers:
Common approach:
- Build structured supervision over trajectories rather than final labels only.
- Measure localization/diagnosis quality, not just binary success.
- Introduce intermediate metrics such as FAR, belief-state rewards, or token-time refusal signals.
- Use synthetic or normalized environments to make turn-level verification exact or reproducible.
Open questions / failure modes:
- Many labels still rely on LLM judges or silver annotations.
- Closed-world or synthetic settings may not transfer cleanly to open deployments.
- Process metrics can be expensive to collect unless generation and replay are standardized.
- It remains unclear how best to convert these diagnostics into online interventions.

Theme: Retrieval and memory are structural safety vulnerabilities

Why it matters: Retrieval and memory are supposed to improve capability, but several papers show they also systematically weaken alignment or create persistent attack channels. The common lesson is that relevance and persistence amplify risk, not just utility.
Representative papers:
Common approach:
- Isolate post-retrieval effects by controlling retrieved content or memory writes.
- Study how architecture choices change risk, e.g. same-turn vs deferred retrieval.
- Use structured critics or embedding-space analyses to diagnose when external context should be trusted.
- Evaluate robustness under paraphrasing, filtering, chunking, and transfer across retrievers/models.
Open questions / failure modes:
- Relevance itself appears to be the activation condition for both utility and safety degradation.
- Memory poisoning durability under consolidation/eviction remains open.
- Many defenses tested are partial pipeline patches rather than end-to-end hardened systems.
- API settings often hide the signals needed for strong monitoring.

Theme: Runtime guardrails are moving from prompts to enforcement layers

Why it matters: Prompt-only safety checks are increasingly treated as insufficient for high-privilege agents. The stronger proposals in this batch move enforcement into typed interfaces, verifiers, and pre-reply trajectory guards.
Representative papers:
Common approach:
- Separate untrusted model outputs from trusted verification or detection planes.
- Train compact specialist monitors on structured, domain-grounded supervision.
- Use trajectory-level or action-level inputs rather than only final text.
- Optimize for low false positives and deployable latency/cost.
Open questions / failure modes:
- Formal guarantees depend on strong assumptions about schemas, axioms, and trusted computing bases.
- Domain-specific detectors may not generalize outside their regulatory or task scope.
- Pre-reply guards cannot undo harms from earlier tool actions.
- Synthetic training data may leave blind spots against adaptive adversaries.

Theme: New benchmarks are getting harder, more realistic, and less shortcut-friendly

Why it matters: Several papers argue current benchmarks overestimate agent competence by allowing retrieval shortcuts, collapsing trajectories to outcomes, or underrepresenting realistic failure modes. The new datasets emphasize multi-hop reasoning, hard web tasks, multimodal composition, and domain-specific misuse.
Representative papers:
Common approach:
- Construct matched or controlled variants to isolate framing, composition, or planning effects.
- Use executable paths, symbolic verifiers, or reproducible environments.
- Stress multilingual, cross-site, or cross-domain transfer.
- Measure robustness under multiple runs rather than single-shot scores.
Open questions / failure modes:
- Many benchmarks still depend on LLM judges for some labels.
- Coverage remains narrow for multimodal, multilingual, or open-world settings.
- Dynamic web and real-world drift can quickly stale benchmark instances.
- Harder benchmarks may improve diagnosis before they improve training signal quality.

Theme: Supply-chain and model-component attacks are becoming more subtle

Why it matters: The attack surface is broadening from prompts to adapters, skills, billing systems, and expert submodules. These attacks are notable because they preserve normal behavior while creating targeted failures or economic abuse.
Representative papers:
Common approach:
- Preserve clean-task utility while inducing targeted unsafe behavior.
- Exploit hidden assumptions: trusted providers, benign-looking skills, router-level safety, or adapter provenance.
- Pair attack demonstrations with lightweight detection heuristics or mechanistic analyses.
- Show transfer or generalization beyond the exact trigger used in training.
Open questions / failure modes:
- Detection often depends on calibration cohorts or probe coverage.
- Weight-level signatures may not transfer across model families.
- Some attacks exploit structural trust assumptions that current tooling cannot independently verify.
- MoE-specific safety defenses remain underdeveloped relative to the demonstrated risk.

3) Technical synthesis

A strong cross-paper trend is dense intermediate supervision: executable paths in GTA, localized anomaly spans in OpenClawBench, reflection labels in BenchTrace, belief-state rewards in BeliefTrack/MMPO, and structured critiques in CRITIC-R1.
Several papers replace generic scalar rewards with task-structured rewards: Jaccard belief-state rewards, rubric rewards for distractor robustness, conservative-vs-diagnostic critique rewards, and semantically grounded multimodal harm rewards.
LLM-as-judge remains common, but the stronger papers either calibrate against humans, use symbolic verifiers, or train smaller deployable models from judged data rather than leaving the judge in the loop at runtime.
There is a recurring architectural lesson that decoupling helps safety: DEFER separates retrieval from harmful requests; planner/executor separation can improve web performance; ePCA separates neural intent from symbolic execution approval.
Multiple works show specialized open-weight models can beat larger zero-shot frontier models once trained on narrow, high-quality supervision: OpenClawBench detector, FinGuard, and deliberative scheming monitors are the clearest examples.
Several papers expose non-monotonic scaling: larger models can be more distractible, MoE safety can be bypassed with tiny expert edits, and adding more LLM judges does not linearly add independent signal.
Representation-level diagnostics are becoming practical: TLO uses logits only, BioRefusalAudit uses SAE-derived divergence, SafeDIG uses SAE-based intervention in DiTs, and hidden-state steering in BeliefTrack transfers some RL gains without retraining.
A common failure pattern is surface success masking latent fragility: successful trajectories can still be anomalous, refusals can be shallow or format-gated, and secure code can flip under tiny prompt perturbations.
Many methods rely on controlled synthetic or semi-synthetic environments to get exact labels, then test transfer to more realistic settings; this is productive but leaves open-world generalization as the main unresolved gap.
The most mature deployment pattern across papers is a stacked safety architecture: benchmark/diagnose → train specialist monitor/critic → add runtime gating or verification → keep human review for high-risk cases.

4) Top 5 papers (with “why now”)

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

Shows retrieval is not just an injection vector; topical relevance itself can increase harmful compliance.
Quantifies two distinct mechanisms: same-turn agentic retrieval creates a commitment bias, and even oppositional “safe” sources can raise harmfulness when relevant.
Introduces HarmURLBench (1,405 URLs, 320 behaviors), which is directly useful for evaluating retrieval-enabled agents.
Why now: retrieval/tool use is becoming default in production agents, and this paper identifies a structural safety-utility tradeoff rather than a patchable edge case.
Skepticism / limitation: main experiments isolate externally specified URLs, so autonomous retrieval and long-horizon planning interactions are only partially covered.

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Quantifies the “Outcome–Process Gap”: 2,904 of 31,135 oracle-passing executions were still process-anomalous.
Provides a large trajectory corpus with anomaly labels, onset/span localization, and subtype taxonomy.
A LoRA-tuned Gemma 3 12B detector reaches binary F1 0.729, beating GPT-5.4 zero-shot on the task.
Why now: as agents start acting in real environments, process monitoring is more actionable than post-hoc outcome scoring.
Skepticism / limitation: labels are silver rather than fully human, and subtype imbalance makes fine-grained evaluation less mature than binary detection.

Provably Secure Agent Guardrail

Proposes ePCA: agents must emit typed action payloads that are translated into logic and checked by an SMT solver against immutable safety axioms.
Offers a formal safety theorem under explicit assumptions and reports very low verification latency (~0.44 ms mean).
Reframes agent safety from semantic moderation to execution-layer enforcement.
Why now: high-privilege agents are moving from demos to real workflows, and empirical prompt guardrails are increasingly inadequate for irreversible actions.
Skepticism / limitation: guarantees hinge on strong assumptions about intent formalization, exhaustive axioms, and an uncompromised trusted verification plane.

Training Deliberative Monitors for Black-Box Scheming Detection

Trains action-only monitors that detect scheming from tool-use traces without CoT or white-box access.
Distilled open-weight monitors hit strong cost/performance tradeoffs; Qwen3.5-27B SFT+RL reaches Mean-OOD pAUROC@20 of 0.831 at low cost.
Shows most gains come from supervised deliberative rationale distillation, with RL adding smaller refinements.
Why now: deployers need cheap, defender-owned monitors for agent traces, not expensive frontier-model judges on every episode.
Skepticism / limitation: training and evaluation rely on synthetic/semi-synthetic scheming data, so real-world adaptive adversaries remain an open test.

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Introduces a scalable pipeline for generating multi-hop web tasks with deterministic executable paths over explicit site graphs.
Produces 5,000 intra-site and 600 inter-site tasks across 50+ websites, with much higher page coverage than prior datasets.
Reveals a large human-agent gap and strong failure on cross-site and multilingual tasks.
Why now: web-agent progress is bottlenecked by benchmarks that are too shallow or too easy to shortcut with search.
Skepticism / limitation: excludes interactive/gated/transactional workflows and still depends on LLM-based verification.

5) Practical next steps

Add trajectory-level monitoring to agent stacks now: log actions, state writes, errors, uncertainty markers, and retrieval provenance so you can train or evaluate process-anomaly detectors later.
For retrieval-enabled agents, test same-turn vs deferred retrieval as a default ablation; if safety matters, treat temporal decoupling as a baseline mitigation rather than an optional UX choice.
Build memory admission controls for long-term memory: require salience checks, trigger-pattern scans, and retrieval-time anomaly detection before writing or activating memories.
For high-privilege actions, move from prompt guardrails to typed action schemas + deterministic policy checks wherever the action space is enumerable.
Stop relying on single end metrics like ASR or task success; add time-resolved or turn-resolved diagnostics such as early refusal signals, belief-state consistency, and failure localization.
If you use LLM judges, measure effective independence, not panel size; diversify model families/prompts or keep humans in the loop for high-stakes evaluation.
Audit your coding-agent supply chain for skills, adapters, and package suggestions: scan LoRA adapters behaviorally, validate dependencies against registries, and distrust benign-looking third-party skills.
For web agents, prioritize harder benchmark coverage: multi-hop, multilingual, cross-site, and plan-format ablations are exposing weaknesses that standard benchmarks miss.

Generated from per-paper analyses; no external browsing.

Agent safety moves runtime.

Takeaways

Start with: Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

Themes

Papers Worth Your Reading Time

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Provably Secure Agent Guardrail

AI Paper Insight Brief

2026-05-30

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Process-level auditing replaces outcome-only evaluation

Theme: Retrieval and memory are structural safety vulnerabilities

Theme: Runtime guardrails are moving from prompts to enforcement layers

Theme: New benchmarks are getting harder, more realistic, and less shortcut-friendly

Theme: Supply-chain and model-component attacks are becoming more subtle

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps