June 5, 2026 Research Brief

Agent safety turns stateful.

Today’s strongest papers show agent risk and evaluation moving from single prompts and final answers toward persistent state, process tracing, and structured control surfaces.

Takeaways

  1. **Persistent state is now the main agent security boundary.** Multiple papers show that memory, files, tool descriptions, and other stored context can be poisoned or misdescribed, with cross-session attacks succeeding at non-trivial rates and existing prompt-injection defenses missing weak-signal variants.
  2. **Process-level evaluation is replacing endpoint-only evaluation.** New benchmarks increasingly score planning, long-horizon iteration, clinical process, companionship, cyber workflows, and autonomous agent development by tracing intermediate decisions rather than just final answers.
  3. **A recurring design pattern is “structure over prose.”** Stronger results come from explicit structure: provenance graphs, typed invariants, compile-time memory control, machine-readable API recovery hints, constraint-level verification, and trajectory-aware alignment.
#1

Start with: From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Why it catches my eye: It gives the clearest reusable map of persistent agent compromise, with a taxonomy, benchmark, and stage-wise metrics for memory poisoning.

Read skeptically for: Results use one base model and partially emulated pipelines, so transfer to deployed agent stacks is not fully settled.

agent-safety memory security benchmark

Themes

Persistent-state attacks and agent security Agent risk is shifting from single-session prompt injection to attacks that persist in memory, files, and tool ecosystems. Once malicious state is stored, later sessions can reactivate it without the attacker being present.
Process-level evaluation for agents in the wild Benchmarks are moving from “did the model get the answer?” to “did the system act correctly over time?” This is crucial for planning, medicine, cyber, companionship, and long-horizon engineering where intermediate mistakes dominate real-world failure.
Structured control, provenance, and auditable agent architectures Several papers argue that safer agents need explicit structure around memory, execution, and policy enforcement—not just better prompting. Provenance, invariants, and compile-time control are emerging as core systems primitives.
Signal Persistent state is the new boundary. Memory poisoning, cross-session prompt injection, and MCP description-code mismatch all show stored artifacts can reactivate attacks after the original session ends.
Tension Evaluation is richer than control. Benchmarks now trace planning, cyber workflows, clinical encounters, and long-horizon iteration, but many defenses still focus on prompt-layer filtering.
Bet Structure will beat prose defenses. Ontology-grounded agents, permissioned memory, provenance tracing, and structured API recovery all point toward typed control surfaces over natural-language instructions.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

#1

Useful if you build agents with memory: it decomposes persistent compromise into concrete write and activation paths you can actually audit.

Why now
Persistent memory is becoming a default agent feature, making stored-state compromise a practical deployment risk.
Skepticism
One-model evaluation and partially emulated pipelines may understate architecture-specific behavior.

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

#2

A strong companion read because it offers a concrete alternative architecture: auditable agents grounded in ontologies, invariants, and append-only logs.

Why now
As persistent-state attacks rise, formal control and auditability are becoming more relevant than prompt-only safeguards.
Skepticism
The guarantees depend on correct ontology and invariant design, with limited large-scale deployment evidence so far.

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

#3

It sharpens the threat model by showing how prompt injection persists through working memory and shared artifacts across sessions.

Why now
Agent stacks are standardizing shared files and memory, so cross-session persistence is becoming a baseline security assumption.
Skepticism
The benchmark is intentionally early and may miss newer persistence channels in fast-changing agent frameworks.

Chinese version: [中文]

Run stats

  • Candidates: 307
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-06-03T00:00:00Z → 2026-06-04T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2606.04329From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
PDF
cs.CR, cs.AI96Systematic memory-poisoning study with taxonomy, benchmark, and strong agent-safety relevance.agent-safety, prompt-injection, memory, benchmark, security
2606.04425What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
PDF
cs.CR, cs.AI95Introduces cross-session stored prompt injection, a realistic persistent threat in agentic systems.agent-safety, prompt-injection, persistent-state, security, agents
2606.04769Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications
PDF
cs.CR, cs.AI, cs.SE95Studies MCP tool description/code mismatch, with taxonomy and detection for agent tool-use security.agents, tool-use, MCP, security, prompt-injection, auditing
2606.04903Provably Auditable and Safe LLM Agents from Human-Authored Ontologies
PDF
cs.LO, cs.AI, cs.MA, cs.PL95Auditable, ontology-grounded LLM agents with formal safety/correctness claims and append-only logs.agent-safety, auditing, formal-methods, ontologies, governance
2606.04929Sequential Data Poisoning in LLM Post-Training
PDF
cs.LG, cs.CR94Introduces sequential poisoning across SFT and preference stages; strong relevance to LLM post-training security.LLM, data-poisoning, post-training, RLHF, DPO, security
2606.04778Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
PDF
cs.AI, cs.CL, cs.LG93Shows inference-time safety can fail mid-generation and proposes trajectory-level alignment.alignment, robustness, jailbreaks, inference-time, llm-safety
2606.04483Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
PDF
cs.CL93Natural-language jailbreak family with high ASR across models; important robustness failure mode.jailbreaks, alignment, robustness, red-teaming, safety-eval
2606.04867AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
PDF
cs.AI92Public benchmark for AI companion safety; evaluates LLM judges on real harmful conversations.safety, benchmark, llm-as-judge, harm-detection, evaluation
2606.04455The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
PDF
cs.AI, cs.CL91New benchmark for autonomous agent development with anti-reward-hacking safeguards.agents, benchmark, evaluation, reward-hacking, autonomy
2606.04460CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
PDF
cs.CR, cs.AI, cs.LG91Large-scale end-to-end cyber benchmark for AI agents across vuln discovery, exploit, and patching.agents, cybersecurity, benchmark, evaluation, red-teaming
2606.05152Reinforcement Learning from Rich Feedback with Distributional DAgger
PDF
cs.LG, cs.AI, cs.CL90General RL recipe for rich feedback beyond binary rewards; relevant to reasoning and agent training.rlhf, reasoning, training, imitation-learning, credit-assignment
2606.04628RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation
PDF
cs.CL, cs.MA90Agent memory architecture with permissions/provenance; directly relevant to safer agent runtime design.agents, memory, safety, permissions, provenance, runtime
2606.04874Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents
PDF
cs.CL89Large diagnostic planning benchmark exposing long-horizon, tool-noise, and refusal weaknesses.agents, planning, benchmark, evaluation, multimodal
2606.05080AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
PDF
cs.AI, cs.LG89Benchmark for ultra long-horizon closed-loop optimization, a key gap in agent capability evaluation.agents, benchmark, long-horizon, evaluation, autonomy
2606.04486Global Sketch-Based Watermarking for Diffusion Language Models
PDF
cs.CR, cs.CL, cs.LG, stat.ML88Watermarking for diffusion LMs via global sketches; notable for provenance and misuse detection.watermarking, diffusion-language-models, security, provenance, generation
2606.04435Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation
PDF
cs.AI, cs.CL, cs.CR, cs.IR87Targets cascading hallucination in agentic RAG with taxonomy and mitigation framework.rag, hallucination, agents, reliability, grounding
2606.05122Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data
PDF
cs.CL87Elicits latent self-evaluation/judge calibration in base LLMs with minimal data; useful for reliability.LLM, self-evaluation, calibration, judging, reliability
2606.04507Self-Evolving Deep Research via Joint Generation and Evaluation
PDF
cs.CL, cs.AI87Targets deep-research agents with joint generator-evaluator training for open-ended tasks.agents, evaluation, self-improvement, deep-research, post-training
2606.04915Caliper: Probing Lexical Anchors versus Causal Structure in LLMs
PDF
cs.CL, cs.IR87Controlled benchmark shows causal reasoning often relies on lexical anchors, not structure.evaluation, reasoning, causal-reasoning, robustness, benchmarks
2606.04889GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
PDF
cs.CL86Token-wise advantage reweighting for verifiable-reward RL could improve reasoning post-training efficiency.llm-training, rlvr, reasoning, post-training, optimization
2606.04660LifeSide: Benchmarking Agents as Lifelong Digital Companions
PDF
cs.CL85Benchmark for lifelong companion agents covering memory, privacy, and emotional adaptation across sessions.agents, benchmark, memory, privacy, evaluation
2606.04816Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems
PDF
cs.AI, cs.LG85Improves reliability of LLM-generated optimization code by exposing omitted or spurious constraints.reliability, code-generation, verification, optimization, evaluation
2606.04990From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents
PDF
cs.CR, cs.AI84Useful survey on evidence tracing and provenance for auditing complex LLM agent executions.agents, auditing, provenance, survey, trust
2606.04928Data Attribution in Large Language Models via Bidirectional Gradient Optimization
PDF
cs.LG, cs.CL84LLM training-data attribution supports provenance, accountability, and governance of model outputs.data-attribution, governance, interpretability, provenance, llms
2606.05158Streaming Communication in Multi-Agent Reasoning
PDF
cs.CL, cs.AI, cs.MA84Streaming multi-agent communication cuts latency and may improve reasoning quality in agent pipelines.multi-agent, reasoning, latency, systems, agents
2606.05112Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
PDF
cs.CL83Interactive standardized-patient benchmark for dynamic clinical agent evaluation over full encounters.agents, benchmark, clinical, evaluation, interactive
2606.05025Invariant Gradient Alignment for Robust Reasoning Distillation
PDF
cs.LG, cs.AI83Addresses shortcut learning in reasoning distillation with cross-domain invariant gradient alignment.reasoning, distillation, ood-robustness, generalization, training
2606.04751FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games
PDF
cs.AI83Benchmark for hypothesis-driven inductive reasoning in LLM agents; useful for scientific-agent eval.evaluation, agents, inductive-reasoning, benchmark, scientific-reasoning
2606.05037Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery
PDF
cs.SE, cs.AI82Concrete agent-recovery result: structured API feedback beats verbose errors in pilot studies.agents, tool-use, reliability, apis, evaluation
2606.05030Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair
PDF
cs.CL, cs.SC82Goal-conditioned infilling to repair faulty CoT targets error snowballing in decoder-only LLMs.reasoning, chain-of-thought, robustness, training, inference

AI Paper Insight Brief

2026-06-05

0) Executive takeaways (read this first)

  • Persistent state is now the main agent security boundary. Multiple papers show that memory, files, tool descriptions, and other stored context can be poisoned or misdescribed, with cross-session attacks succeeding at non-trivial rates and existing prompt-injection defenses missing weak-signal variants.
  • Process-level evaluation is replacing endpoint-only evaluation. New benchmarks increasingly score planning, long-horizon iteration, clinical process, companionship, cyber workflows, and autonomous agent development by tracing intermediate decisions rather than just final answers.
  • A recurring design pattern is “structure over prose.” Stronger results come from explicit structure: provenance graphs, typed invariants, compile-time memory control, machine-readable API recovery hints, constraint-level verification, and trajectory-aware alignment.
  • Inference-time robustness remains shallow in many models. Safety-aligned models can still be redirected mid-generation; jailbreaks exploit under-covered natural registers; and lexical cues still dominate purported causal reasoning.
  • Training signals are getting more granular. Several papers improve learning by moving beyond scalar outcome rewards: token-level gradient reweighting, future-aware distillation, evaluator–solver co-evolution, and trajectory-pair preference optimization.
  • Frontier capability is increasingly bottlenecked by persistence, time-awareness, and iteration discipline—not just raw model quality. In long-horizon engineering and meta-agent settings, many failures come from premature stopping, budget misuse, brittle iteration, or exploit-seeking behavior.

2) Key themes (clusters)

Theme: Persistent-state attacks and agent security

Theme: Process-level evaluation for agents in the wild

Theme: Structured control, provenance, and auditable agent architectures

Theme: Robustness failures beyond obvious jailbreaks

Theme: Better training signals for reasoning and open-ended agents

3) Technical synthesis

  • Stage-wise decomposition is becoming standard: write/incorporation/activation for stored prompt injection, ASR/RSR for memory poisoning, S1–S4 for cyber tasks, and plan-grade/error taxonomies for planning.
  • Architectural choices dominate security outcomes: HERMES was much more vulnerable than OpenClaw in memory poisoning, and direct-loading channels outperformed conditional-loading channels for stored prompt injection.
  • Weak-signal attacks are the recurring blind spot: policy-conformant memory poisoning, contextual-disguise SPI payloads, vernacular jailbreaks, and description-code mismatches all exploit semantics that evade surface detectors.
  • Many papers separate “can patch” from “can discover”: CyberGym-E2E shows patch generation is much easier than end-to-end vulnerability discovery; APB separates planning from execution; AutoLab shows first-attempt quality is less predictive than iterative refinement.
  • Trajectory matters more than static state: CHARM monitors cross-stage drift, trajectory alignment trains on injected continuations, DistIL adds future-credit terms, and TRI repairs only broken spans between verified anchors.
  • Verification is shifting from scalar outcomes to structured constraints: VRP constraint injection checks omitted/spurious constraints; MedSP1000 scores rubric completion; Agentic Redux enforces invariants; self-reflective APIs return literal repair actions.
  • Prompting alone is often insufficient: several papers replace or augment prompt defenses with provenance, memory hardening, compile-time context control, or deterministic verifiers.
  • Benchmarks increasingly include anti-hacking design: MAC uses dual containers and auditing, AutoLab uses sealed verifiers and immutable files, CyberGym-E2E validates post-patch functionality, and self-reflective APIs explicitly audit leakage.
  • Model scale helps, but not uniformly: larger models often perform better on planning, companion safety judging, and long-horizon tasks, but reasoning add-ons or specialized medical models do not always beat stronger general models.
  • Theoretical work is converging with systems work: DistIL, IGA, TRI, watermarking, and Agentic Redux all pair formal guarantees with practical mechanisms, though empirical validation remains uneven.

4) Top 5 papers (with “why now”)

1. From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

  • Identifies four memory write channels and nine structural vulnerabilities, giving a concrete map of where persistent compromise happens.
  • Introduces MPBench with 3,240 adversarial cases and explicit ASR/RSR metrics for persistent write and cross-session influence.
  • Shows large real vulnerability differences across agent designs: HERMES averages 66.67% ASR / 64.70% RSR vs OpenClaw 34.25% / 17.40%.
  • Why now: persistent memory is moving from optional feature to core agent substrate, and this paper makes clear that current write paths are a major unprotected boundary.
  • Skeptical about: evaluation uses one base model and some benchmark deliveries emulate rather than fully exercise deployed retrieval/tool pipelines.

2. What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

  • Formalizes stored prompt injection as a system-level threat spanning injection source, persistence channel, and incorporation mechanism.
  • SPI-Benchmark’s stage-wise metrics show meaningful end-to-end exploitability across models, with overall E2E-ASR from 32.1% to 42.0%.
  • Finds fact manipulation is especially effective, while direct-loading channels like working memory and AGENTS.md are riskier than conditional archival memory.
  • Why now: many agent stacks are standardizing persistent artifacts and shared state, making stored prompt injection a likely default threat model.
  • Skeptical about: benchmark scope is intentionally initial and may miss emerging persistence mechanisms in rapidly changing agent architectures.

3. AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

  • Provides a hack-resistant, multi-hour benchmark of 36 closed-loop optimization tasks across systems, puzzles, model development, and CUDA.
  • Large-scale evaluation across 17 models shows claude-opus-4.6 leads (Avg@3 0.68, Dominance 0.93), but many failures come from poor persistence and time-awareness rather than inability to code.
  • Manual analysis of 302 zero-score rollouts surfaces concrete behavioral bottlenecks like premature stopping and budget exhaustion.
  • Why now: the field is shifting from short coding tasks to autonomous iteration, and this benchmark measures the capability frontier that matters for real automation.
  • Skeptical about: results are necessarily harness- and hardware-dependent, and the benchmark covers measurable engineering workflows more than open-ended scientific discovery.

4. Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

  • Shows that short token injections at arbitrary decoding steps can redirect aligned models, extending “shallow safety” into a broader trajectory-level vulnerability.
  • Proposes trajectory augmentation plus SimPO preference optimization, cutting injection ASR dramatically; e.g. Llama-3.1-8B on AdvBench drops from 92.12% to 4.42% in the reported setting.
  • Generalizes beyond the training attack to PAIR, Prefilling, and I-GCG while largely preserving utility.
  • Why now: many deployed attacks effectively control early or mid-generation tokens, so input-only alignment is no longer a sufficient defense model.
  • Skeptical about: training uses a single chosen injection phrase and greedy decoding, so robustness breadth under more diverse perturbations is still open.

5. CyberGym-E2E: Scalable Real-World Benchmark for AI Agents’ End-to-End Cybersecurity Capabilities

  • Builds a 920-instance benchmark from 139 OSS-Fuzz projects with reproducible environments, PoCs, patches, and validated tests.
  • Cleanly separates patch-only from end-to-end performance, showing discovery/PoC generation is the main bottleneck.
  • Example gap is stark: Opus 4.5 with Claude Code reaches 82.3% patch-only success but only 19.2% end-to-end S3 on the initial evaluation.
  • Why now: cyber capability claims are increasingly dual-use sensitive, and this benchmark gives a more realistic measure of what agents can actually do.
  • Skeptical about: current coverage is concentrated on memory-safety C/C++ vulnerabilities with sanitizer-based oracles and still requires human validation steps.

5) Practical next steps

  • Harden persistent state first: separate untrusted inputs from memory-write decisions, add provenance on every write, and gate retrieval/incorporation by source trust and recency.
  • Instrument stage-wise metrics in your agent stack: track write success, incorporation, activation, retrieval influence, and downstream action effects rather than only final task success.
  • Audit all persistent artifacts as attack surfaces: memory stores, AGENTS.md-like files, MCP tool descriptions, cached plans, and post-training datasets should all get integrity checks and review gates.
  • Adopt structured recovery and control surfaces: prefer machine-readable API repair hints, typed tool side-effect metadata, and explicit memory/block permissions over prose-only instructions.
  • Evaluate planning separately from execution: use planning diagnostics before end-to-end benchmarks so you can distinguish decomposition/tool-choice failures from environment noise.
  • Stress-test with weak-signal attacks: contextual disguise, policy-conformant memory writes, vernacular jailbreaks, and mid-generation injections should be part of routine red-teaming.
  • Add provenance and audit logs now, even if partial: claim-to-evidence links, tool-call lineage, memory write lineage, and rollback points will pay off for debugging and safety review.
  • Experiment with finer-grained training signals: token-level advantage reweighting, future-aware distillation, or trajectory-pair preference training are promising where scalar outcome rewards are too blunt.

Generated from per-paper analyses; no external browsing.