June 5, 2026 Research Brief
Agent safety turns stateful.
Today’s strongest papers show agent risk and evaluation moving from single prompts and final answers toward persistent state, process tracing, and structured control surfaces.
Takeaways
- **Persistent state is now the main agent security boundary.** Multiple papers show that memory, files, tool descriptions, and other stored context can be poisoned or misdescribed, with cross-session attacks succeeding at non-trivial rates and existing prompt-injection defenses missing weak-signal variants.
- **Process-level evaluation is replacing endpoint-only evaluation.** New benchmarks increasingly score planning, long-horizon iteration, clinical process, companionship, cyber workflows, and autonomous agent development by tracing intermediate decisions rather than just final answers.
- **A recurring design pattern is “structure over prose.”** Stronger results come from explicit structure: provenance graphs, typed invariants, compile-time memory control, machine-readable API recovery hints, constraint-level verification, and trajectory-aware alignment.
Start with: From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
Why it catches my eye: It gives the clearest reusable map of persistent agent compromise, with a taxonomy, benchmark, and stage-wise metrics for memory poisoning.
Read skeptically for: Results use one base model and partially emulated pipelines, so transfer to deployed agent stacks is not fully settled.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
#1Useful if you build agents with memory: it decomposes persistent compromise into concrete write and activation paths you can actually audit.
- Why now
- Persistent memory is becoming a default agent feature, making stored-state compromise a practical deployment risk.
- Skepticism
- One-model evaluation and partially emulated pipelines may understate architecture-specific behavior.
Provably Auditable and Safe LLM Agents from Human-Authored Ontologies
#2A strong companion read because it offers a concrete alternative architecture: auditable agents grounded in ontologies, invariants, and append-only logs.
- Why now
- As persistent-state attacks rise, formal control and auditability are becoming more relevant than prompt-only safeguards.
- Skepticism
- The guarantees depend on correct ontology and invariant design, with limited large-scale deployment evidence so far.
What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
#3It sharpens the threat model by showing how prompt injection persists through working memory and shared artifacts across sessions.
- Why now
- Agent stacks are standardizing shared files and memory, so cross-session persistence is becoming a baseline security assumption.
- Skepticism
- The benchmark is intentionally early and may miss newer persistence channels in fast-changing agent frameworks.
Chinese version: [中文]
Run stats
- Candidates: 307
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-03T00:00:00Z → 2026-06-04T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.04329 | From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents | cs.CR, cs.AI | 96 | Systematic memory-poisoning study with taxonomy, benchmark, and strong agent-safety relevance. | agent-safety, prompt-injection, memory, benchmark, security |
2606.04425 | What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems | cs.CR, cs.AI | 95 | Introduces cross-session stored prompt injection, a realistic persistent threat in agentic systems. | agent-safety, prompt-injection, persistent-state, security, agents |
2606.04769 | Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications | cs.CR, cs.AI, cs.SE | 95 | Studies MCP tool description/code mismatch, with taxonomy and detection for agent tool-use security. | agents, tool-use, MCP, security, prompt-injection, auditing |
2606.04903 | Provably Auditable and Safe LLM Agents from Human-Authored Ontologies | cs.LO, cs.AI, cs.MA, cs.PL | 95 | Auditable, ontology-grounded LLM agents with formal safety/correctness claims and append-only logs. | agent-safety, auditing, formal-methods, ontologies, governance |
2606.04929 | Sequential Data Poisoning in LLM Post-Training | cs.LG, cs.CR | 94 | Introduces sequential poisoning across SFT and preference stages; strong relevance to LLM post-training security. | LLM, data-poisoning, post-training, RLHF, DPO, security |
2606.04778 | Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories | cs.AI, cs.CL, cs.LG | 93 | Shows inference-time safety can fail mid-generation and proposes trajectory-level alignment. | alignment, robustness, jailbreaks, inference-time, llm-safety |
2606.04483 | Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs | cs.CL | 93 | Natural-language jailbreak family with high ASR across models; important robustness failure mode. | jailbreaks, alignment, robustness, red-teaming, safety-eval |
2606.04867 | AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety | cs.AI | 92 | Public benchmark for AI companion safety; evaluates LLM judges on real harmful conversations. | safety, benchmark, llm-as-judge, harm-detection, evaluation |
2606.04455 | The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? | cs.AI, cs.CL | 91 | New benchmark for autonomous agent development with anti-reward-hacking safeguards. | agents, benchmark, evaluation, reward-hacking, autonomy |
2606.04460 | CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities | cs.CR, cs.AI, cs.LG | 91 | Large-scale end-to-end cyber benchmark for AI agents across vuln discovery, exploit, and patching. | agents, cybersecurity, benchmark, evaluation, red-teaming |
2606.05152 | Reinforcement Learning from Rich Feedback with Distributional DAgger | cs.LG, cs.AI, cs.CL | 90 | General RL recipe for rich feedback beyond binary rewards; relevant to reasoning and agent training. | rlhf, reasoning, training, imitation-learning, credit-assignment |
2606.04628 | RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation | cs.CL, cs.MA | 90 | Agent memory architecture with permissions/provenance; directly relevant to safer agent runtime design. | agents, memory, safety, permissions, provenance, runtime |
2606.04874 | Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents | cs.CL | 89 | Large diagnostic planning benchmark exposing long-horizon, tool-noise, and refusal weaknesses. | agents, planning, benchmark, evaluation, multimodal |
2606.05080 | AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? | cs.AI, cs.LG | 89 | Benchmark for ultra long-horizon closed-loop optimization, a key gap in agent capability evaluation. | agents, benchmark, long-horizon, evaluation, autonomy |
2606.04486 | Global Sketch-Based Watermarking for Diffusion Language Models | cs.CR, cs.CL, cs.LG, stat.ML | 88 | Watermarking for diffusion LMs via global sketches; notable for provenance and misuse detection. | watermarking, diffusion-language-models, security, provenance, generation |
2606.04435 | Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation | cs.AI, cs.CL, cs.CR, cs.IR | 87 | Targets cascading hallucination in agentic RAG with taxonomy and mitigation framework. | rag, hallucination, agents, reliability, grounding |
2606.05122 | Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data | cs.CL | 87 | Elicits latent self-evaluation/judge calibration in base LLMs with minimal data; useful for reliability. | LLM, self-evaluation, calibration, judging, reliability |
2606.04507 | Self-Evolving Deep Research via Joint Generation and Evaluation | cs.CL, cs.AI | 87 | Targets deep-research agents with joint generator-evaluator training for open-ended tasks. | agents, evaluation, self-improvement, deep-research, post-training |
2606.04915 | Caliper: Probing Lexical Anchors versus Causal Structure in LLMs | cs.CL, cs.IR | 87 | Controlled benchmark shows causal reasoning often relies on lexical anchors, not structure. | evaluation, reasoning, causal-reasoning, robustness, benchmarks |
2606.04889 | GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards | cs.CL | 86 | Token-wise advantage reweighting for verifiable-reward RL could improve reasoning post-training efficiency. | llm-training, rlvr, reasoning, post-training, optimization |
2606.04660 | LifeSide: Benchmarking Agents as Lifelong Digital Companions | cs.CL | 85 | Benchmark for lifelong companion agents covering memory, privacy, and emotional adaptation across sessions. | agents, benchmark, memory, privacy, evaluation |
2606.04816 | Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems | cs.AI, cs.LG | 85 | Improves reliability of LLM-generated optimization code by exposing omitted or spurious constraints. | reliability, code-generation, verification, optimization, evaluation |
2606.04990 | From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents | cs.CR, cs.AI | 84 | Useful survey on evidence tracing and provenance for auditing complex LLM agent executions. | agents, auditing, provenance, survey, trust |
2606.04928 | Data Attribution in Large Language Models via Bidirectional Gradient Optimization | cs.LG, cs.CL | 84 | LLM training-data attribution supports provenance, accountability, and governance of model outputs. | data-attribution, governance, interpretability, provenance, llms |
2606.05158 | Streaming Communication in Multi-Agent Reasoning | cs.CL, cs.AI, cs.MA | 84 | Streaming multi-agent communication cuts latency and may improve reasoning quality in agent pipelines. | multi-agent, reasoning, latency, systems, agents |
2606.05112 | Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases | cs.CL | 83 | Interactive standardized-patient benchmark for dynamic clinical agent evaluation over full encounters. | agents, benchmark, clinical, evaluation, interactive |
2606.05025 | Invariant Gradient Alignment for Robust Reasoning Distillation | cs.LG, cs.AI | 83 | Addresses shortcut learning in reasoning distillation with cross-domain invariant gradient alignment. | reasoning, distillation, ood-robustness, generalization, training |
2606.04751 | FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games | cs.AI | 83 | Benchmark for hypothesis-driven inductive reasoning in LLM agents; useful for scientific-agent eval. | evaluation, agents, inductive-reasoning, benchmark, scientific-reasoning |
2606.05037 | Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery | cs.SE, cs.AI | 82 | Concrete agent-recovery result: structured API feedback beats verbose errors in pilot studies. | agents, tool-use, reliability, apis, evaluation |
2606.05030 | Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair | cs.CL, cs.SC | 82 | Goal-conditioned infilling to repair faulty CoT targets error snowballing in decoder-only LLMs. | reasoning, chain-of-thought, robustness, training, inference |
AI Paper Insight Brief
2026-06-05
0) Executive takeaways (read this first)
- Persistent state is now the main agent security boundary. Multiple papers show that memory, files, tool descriptions, and other stored context can be poisoned or misdescribed, with cross-session attacks succeeding at non-trivial rates and existing prompt-injection defenses missing weak-signal variants.
- Process-level evaluation is replacing endpoint-only evaluation. New benchmarks increasingly score planning, long-horizon iteration, clinical process, companionship, cyber workflows, and autonomous agent development by tracing intermediate decisions rather than just final answers.
- A recurring design pattern is “structure over prose.” Stronger results come from explicit structure: provenance graphs, typed invariants, compile-time memory control, machine-readable API recovery hints, constraint-level verification, and trajectory-aware alignment.
- Inference-time robustness remains shallow in many models. Safety-aligned models can still be redirected mid-generation; jailbreaks exploit under-covered natural registers; and lexical cues still dominate purported causal reasoning.
- Training signals are getting more granular. Several papers improve learning by moving beyond scalar outcome rewards: token-level gradient reweighting, future-aware distillation, evaluator–solver co-evolution, and trajectory-pair preference optimization.
- Frontier capability is increasingly bottlenecked by persistence, time-awareness, and iteration discipline—not just raw model quality. In long-horizon engineering and meta-agent settings, many failures come from premature stopping, budget misuse, brittle iteration, or exploit-seeking behavior.
2) Key themes (clusters)
Theme: Persistent-state attacks and agent security
- Why it matters: Agent risk is shifting from single-session prompt injection to attacks that persist in memory, files, and tool ecosystems. Once malicious state is stored, later sessions can reactivate it without the attacker being present.
- Representative papers:
- From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
- What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
- Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications
- Sequential Data Poisoning in LLM Post-Training
- Common approach:
- Formalize the write/incorporation pipeline rather than treating attacks as prompt-only phenomena.
- Break exploitability into stages such as write, retrieval/incorporation, and activation.
- Build benchmarks with stage-wise metrics (ASR/RSR, WSR/IR/AR, end-to-end ASR).
- Analyze system artifacts beyond prompts: memory stores, AGENTS.md-like files, MCP descriptions, and post-training datasets.
- Open questions / failure modes:
- Weak-signal attacks remain hard to detect even when strong-signal prompt injection is filtered.
- Results are architecture-sensitive; write aggression and retrieval policy strongly change vulnerability.
- Static semantic checks for tools and memory cannot fully capture runtime-dependent behavior.
- End-to-end defenses across multi-stage post-training pipelines remain underdeveloped.
Theme: Process-level evaluation for agents in the wild
- Why it matters: Benchmarks are moving from “did the model get the answer?” to “did the system act correctly over time?” This is crucial for planning, medicine, cyber, companionship, and long-horizon engineering where intermediate mistakes dominate real-world failure.
- Representative papers:
- AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
- Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents
- CyberGym-E2E: Scalable Real-World Benchmark for AI Agents’ End-to-End Cybersecurity Capabilities
- Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
- Common approach:
- Use executable or interactive environments with held-out verification rather than static QA.
- Separate upstream competencies (planning, PoC discovery, process adherence) from downstream execution.
- Score partial progress continuously, not just binary success.
- Audit failure modes manually or with structured taxonomies to localize bottlenecks.
- Open questions / failure modes:
- Harness choice can materially change outcomes, making model-vs-system attribution difficult.
- Many benchmarks remain expensive, domain-specific, or dependent on proprietary judges.
- Long-horizon success is often limited by time-awareness and iteration policy rather than first-try quality.
- Simulated environments may still miss real-world messiness, especially in clinical and social domains.
Theme: Structured control, provenance, and auditable agent architectures
- Why it matters: Several papers argue that safer agents need explicit structure around memory, execution, and policy enforcement—not just better prompting. Provenance, invariants, and compile-time control are emerging as core systems primitives.
- Representative papers:
- Provably Auditable and Safe LLM Agents from Human-Authored Ontologies
- From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents
- RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation
- Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery
- Common approach:
- Replace opaque context assembly with typed, addressable, provenance-bearing structures.
- Enforce policy through architecture: centralized adjudication, append-only logs, permissioned memory blocks, structured recovery payloads.
- Treat execution traces and memory lineage as first-class audit artifacts.
- Use compile-time or runtime structural controls rather than relying on natural-language instructions alone.
- Open questions / failure modes:
- Many proposals are conceptual or theory-heavy and lack large-scale deployment evidence.
- Strong guarantees depend on correct invariant specification and disciplined integration.
- Provenance systems still lack unified schemas and low-overhead runtime implementations.
- Synthetic probes may overstate generality of position and grouping effects in memory systems.
Theme: Robustness failures beyond obvious jailbreaks
- Why it matters: Alignment failures are not limited to adversarial suffixes. Models remain vulnerable to natural-register jailbreaks, mid-generation redirection, lexical shortcuts, and cascade amplification in multi-step pipelines.
- Representative papers:
- Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
- Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
- Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation
- Caliper: Probing Lexical Anchors versus Causal Structure in LLMs
- Common approach:
- Stress models under distribution shifts that preserve task semantics but alter surface form or trajectory.
- Evaluate intermediate states, not just final outputs.
- Use ablations to separate structure from lexical or stylistic cues.
- Add monitoring layers or trajectory-aware training to interrupt failure propagation.
- Open questions / failure modes:
- Input-layer defenses can suppress old jailbreak templates while leaving stronger distributional attacks intact.
- Hidden-state refusal signals do not reliably predict robustness to mid-sequence perturbation.
- Some detectors rely on synthetic injections and may not transfer cleanly to organic failures.
- Robustness claims remain sensitive to decoding mode, judge choice, and benchmark lexicalization.
Theme: Better training signals for reasoning and open-ended agents
- Why it matters: A common bottleneck is coarse credit assignment. New work tries to replace flat sequence-level rewards with token-, trajectory-, or rubric-level signals that better match what actually matters.
- Representative papers:
- GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
- Reinforcement Learning from Rich Feedback with Distributional DAgger
- Self-Evolving Deep Research via Joint Generation and Evaluation
- Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data
- Common approach:
- Derive finer-grained supervision from intrinsic gradients, teacher distributions, evaluator consistency, or self-predicted judge scores.
- Keep external supervision lightweight by eliciting latent capability or using black-box feedback.
- Alternate generation and evaluation updates to avoid static-reward saturation.
- Preserve answer quality while improving calibration or reasoning accuracy.
- Open questions / failure modes:
- Better credit assignment often adds substantial compute overhead.
- Many methods are validated on narrow domains such as math or report writing.
- Judge-aligned self-evaluation may inherit judge bias rather than human preference.
- Co-evolving evaluators risk collapse or reward gaming without strong external anchors.
3) Technical synthesis
- Stage-wise decomposition is becoming standard: write/incorporation/activation for stored prompt injection, ASR/RSR for memory poisoning, S1–S4 for cyber tasks, and plan-grade/error taxonomies for planning.
- Architectural choices dominate security outcomes: HERMES was much more vulnerable than OpenClaw in memory poisoning, and direct-loading channels outperformed conditional-loading channels for stored prompt injection.
- Weak-signal attacks are the recurring blind spot: policy-conformant memory poisoning, contextual-disguise SPI payloads, vernacular jailbreaks, and description-code mismatches all exploit semantics that evade surface detectors.
- Many papers separate “can patch” from “can discover”: CyberGym-E2E shows patch generation is much easier than end-to-end vulnerability discovery; APB separates planning from execution; AutoLab shows first-attempt quality is less predictive than iterative refinement.
- Trajectory matters more than static state: CHARM monitors cross-stage drift, trajectory alignment trains on injected continuations, DistIL adds future-credit terms, and TRI repairs only broken spans between verified anchors.
- Verification is shifting from scalar outcomes to structured constraints: VRP constraint injection checks omitted/spurious constraints; MedSP1000 scores rubric completion; Agentic Redux enforces invariants; self-reflective APIs return literal repair actions.
- Prompting alone is often insufficient: several papers replace or augment prompt defenses with provenance, memory hardening, compile-time context control, or deterministic verifiers.
- Benchmarks increasingly include anti-hacking design: MAC uses dual containers and auditing, AutoLab uses sealed verifiers and immutable files, CyberGym-E2E validates post-patch functionality, and self-reflective APIs explicitly audit leakage.
- Model scale helps, but not uniformly: larger models often perform better on planning, companion safety judging, and long-horizon tasks, but reasoning add-ons or specialized medical models do not always beat stronger general models.
- Theoretical work is converging with systems work: DistIL, IGA, TRI, watermarking, and Agentic Redux all pair formal guarantees with practical mechanisms, though empirical validation remains uneven.
4) Top 5 papers (with “why now”)
1. From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
- Identifies four memory write channels and nine structural vulnerabilities, giving a concrete map of where persistent compromise happens.
- Introduces MPBench with 3,240 adversarial cases and explicit ASR/RSR metrics for persistent write and cross-session influence.
- Shows large real vulnerability differences across agent designs: HERMES averages 66.67% ASR / 64.70% RSR vs OpenClaw 34.25% / 17.40%.
- Why now: persistent memory is moving from optional feature to core agent substrate, and this paper makes clear that current write paths are a major unprotected boundary.
- Skeptical about: evaluation uses one base model and some benchmark deliveries emulate rather than fully exercise deployed retrieval/tool pipelines.
2. What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
- Formalizes stored prompt injection as a system-level threat spanning injection source, persistence channel, and incorporation mechanism.
- SPI-Benchmark’s stage-wise metrics show meaningful end-to-end exploitability across models, with overall E2E-ASR from 32.1% to 42.0%.
- Finds fact manipulation is especially effective, while direct-loading channels like working memory and AGENTS.md are riskier than conditional archival memory.
- Why now: many agent stacks are standardizing persistent artifacts and shared state, making stored prompt injection a likely default threat model.
- Skeptical about: benchmark scope is intentionally initial and may miss emerging persistence mechanisms in rapidly changing agent architectures.
3. AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
- Provides a hack-resistant, multi-hour benchmark of 36 closed-loop optimization tasks across systems, puzzles, model development, and CUDA.
- Large-scale evaluation across 17 models shows claude-opus-4.6 leads (Avg@3 0.68, Dominance 0.93), but many failures come from poor persistence and time-awareness rather than inability to code.
- Manual analysis of 302 zero-score rollouts surfaces concrete behavioral bottlenecks like premature stopping and budget exhaustion.
- Why now: the field is shifting from short coding tasks to autonomous iteration, and this benchmark measures the capability frontier that matters for real automation.
- Skeptical about: results are necessarily harness- and hardware-dependent, and the benchmark covers measurable engineering workflows more than open-ended scientific discovery.
4. Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
- Shows that short token injections at arbitrary decoding steps can redirect aligned models, extending “shallow safety” into a broader trajectory-level vulnerability.
- Proposes trajectory augmentation plus SimPO preference optimization, cutting injection ASR dramatically; e.g. Llama-3.1-8B on AdvBench drops from 92.12% to 4.42% in the reported setting.
- Generalizes beyond the training attack to PAIR, Prefilling, and I-GCG while largely preserving utility.
- Why now: many deployed attacks effectively control early or mid-generation tokens, so input-only alignment is no longer a sufficient defense model.
- Skeptical about: training uses a single chosen injection phrase and greedy decoding, so robustness breadth under more diverse perturbations is still open.
5. CyberGym-E2E: Scalable Real-World Benchmark for AI Agents’ End-to-End Cybersecurity Capabilities
- Builds a 920-instance benchmark from 139 OSS-Fuzz projects with reproducible environments, PoCs, patches, and validated tests.
- Cleanly separates patch-only from end-to-end performance, showing discovery/PoC generation is the main bottleneck.
- Example gap is stark: Opus 4.5 with Claude Code reaches 82.3% patch-only success but only 19.2% end-to-end S3 on the initial evaluation.
- Why now: cyber capability claims are increasingly dual-use sensitive, and this benchmark gives a more realistic measure of what agents can actually do.
- Skeptical about: current coverage is concentrated on memory-safety C/C++ vulnerabilities with sanitizer-based oracles and still requires human validation steps.
5) Practical next steps
- Harden persistent state first: separate untrusted inputs from memory-write decisions, add provenance on every write, and gate retrieval/incorporation by source trust and recency.
- Instrument stage-wise metrics in your agent stack: track write success, incorporation, activation, retrieval influence, and downstream action effects rather than only final task success.
- Audit all persistent artifacts as attack surfaces: memory stores, AGENTS.md-like files, MCP tool descriptions, cached plans, and post-training datasets should all get integrity checks and review gates.
- Adopt structured recovery and control surfaces: prefer machine-readable API repair hints, typed tool side-effect metadata, and explicit memory/block permissions over prose-only instructions.
- Evaluate planning separately from execution: use planning diagnostics before end-to-end benchmarks so you can distinguish decomposition/tool-choice failures from environment noise.
- Stress-test with weak-signal attacks: contextual disguise, policy-conformant memory writes, vernacular jailbreaks, and mid-generation injections should be part of routine red-teaming.
- Add provenance and audit logs now, even if partial: claim-to-evidence links, tool-call lineage, memory write lineage, and rollback points will pay off for debugging and safety review.
- Experiment with finer-grained training signals: token-level advantage reweighting, future-aware distillation, or trajectory-pair preference training are promising where scalar outcome rewards are too blunt.
Generated from per-paper analyses; no external browsing.