Takeaways

Agent safety work is shifting from prompt-level moderation to **stateful, evidence-grounded control**: several papers build executable verifiers, runtime monitors, context-governance layers, or cryptographic/state-continuity mechanisms for agents rather than relying on output text alone.
A recurring pattern is **“structure beats heuristics”**: framework-aware static analysis (AgentFlow), typed/governed context stores (ContextNest, ElephantAgent), and process/rubric-based evaluation (SkillCoach, MRRG, VERA) all outperform coarse end-state or keyword-style checks.
The offensive side is also getting sharper: multilingual/code-switching jailbreaks (STEER), distributed multi-PR sabotage, skill-composition fuzzing, and scanner-evasion malware for agent skills all show that **surface-form defenses are brittle**.

Start with: Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

Why it catches my eye: It offers a reusable evaluation architecture for tool-using agents with executable safety cases and deterministic verification.

Read skeptically for: Coverage is limited to runtime-exercisable risks, and results depend on scenario generation and infrastructure quality.

agent-safety evaluation verification tool-use

arXiv PDF

Themes

Agent runtime safety is moving to evidence-grounded verification The strongest agent-safety papers today focus on whether harmful actions actually occurred in the environment, not whether the model merely looked safe in text. This is a meaningful shift for tool-using agents, where side effects, state changes, and cross-turn behavior matter more than single-turn refusals.

Context, memory, and tool state are becoming first-class security boundaries A large share of agent failures now come from poisoned memory, stale retrieval, mutable tool descriptors, or uncontrolled context growth. Multiple papers argue that “relevance” is not enough; agents need governed, state-aware, and verifiable context.

Static analysis and supply-chain security for agent ecosystems are maturing Agent programs and skill marketplaces now look enough like software ecosystems that classic supply-chain and program-analysis ideas are becoming useful again—but only when adapted to framework semantics and runtime behavior.

Signal Safety is moving into runtime state. VERA, online monitoring, telecom guardrail validation, and coding-agent constraints all verify actions through traces, environment state, or calibrated intervention rules.

Tension Attackers exploit persistence and composition. Persistent multi-PR sabotage, skill-composition fuzzing, scanner-evasive malware, and code-switched jailbreaks show text-only or install-time defenses miss real agent surfaces.

Bet Structured oversight will beat heuristics. AgentFlow, ContextNest, ElephantAgent, and SkillCoach all gain leverage by modeling dependencies, governed context, and process rubrics instead of coarse end-state checks.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

A strong first read for anyone building agent evals because it turns safety testing into executable, auditable verification.

Why now: Teams deploying tool-using agents need something stronger than prompt judges and ad hoc red-teaming.
Skepticism: It mainly covers inference-time risks that can be exercised in scenarios, not the full deployment stack.

arXiv PDF

AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

Useful companion to runtime testing because it makes prompts, tools, memory, and handoffs statically analyzable at codebase scale.

Why now: Agent codebases are growing faster than manual review, creating demand for software-style governance and auditing.
Skepticism: Static analysis over-approximates and has framework and language coverage limits, so findings need dynamic confirmation.

arXiv PDF

Distributed Attacks in Persistent-State AI Control

It sharpens the threat model for coding agents by showing how gradual, stateful sabotage evades per-diff monitoring.

Why now: Coding agents are increasingly used across iterative repository workflows, where persistence matters more than single-turn safety.
Skepticism: The benchmark setup is still simpler and smaller than many real enterprise repositories.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 284
Selected: 30
Deepread completed: 30
Window (UTC): 2026-07-02T00:00:00Z → 2026-07-03T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2607.02514`	Distributed Attacks in Persistent-State AI Control PDF	cs.AI	96	Directly studies persistent-codebase attacks by coding agents; highly relevant AI control benchmark.	agent-safety, ai-control, coding-agents, prompt-injection, benchmark, security
`2607.02389`	Steerability via constraints: a substrate for scalable oversight of coding agents PDF	cs.AI, cs.CR, cs.SE	95	Constraint-based oversight for coding agents; strong security framing and large backdoor-detection gain.	agent-safety, coding-agents, oversight, security, backdoor-detection
`2607.02345`	SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces PDF	cs.SE, cs.AI, cs.CL	95	Fuzzes composed agent skills to find hidden malicious intents; highly relevant to agent security.	agent-safety, security, fuzzing, skills, implicit-intents, evaluation
`2607.01793`	Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification PDF	cs.AI	94	Automated, scalable safety testing for tool-using agents with risk discovery and verification pipeline.	agent-safety, evaluation, tool-use, verification, benchmark, red-teaming
`2607.02510`	Online Safety Monitoring for LLMs PDF	cs.AI, cs.CL, cs.LG, stat.AP, stat.ML	93	Practical online safety monitor with risk-controlled alarms; directly relevant to deployment-time LLM safety.	llm-safety, monitoring, risk-control, deployment, red-teaming
`2607.02072`	kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail PDF	cs.LG, cs.AI, cs.CR	93	Training-free LLM guardrail using hidden activations; strong safety relevance and concrete speed/F1 gains.	llm-safety, guardrails, adversarial-prompts, activation-space, training-free
`2607.02079`	HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety PDF	cs.CL, cs.CR, cs.LG	92	Open multilingual safety classifier with strong benchmark claims and practical guardrail utility.	guardrails, multilingual, classifier, safety, open-weights, benchmark
`2607.02507`	What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates PDF	cs.AI, cs.CL, cs.LG, cs.MA	91	Shows latent objective emergence and public/private divergence in multi-agent debates across 10 models.	multi-agent, alignment, evaluation, social-structure, deception-risk
`2607.02210`	Criticality-Based Guard Rail Validation for AI Agent Decisions in Autonomous Telecom Networks PDF	cs.AI, cs.NI	91	Runtime guardrail validation for autonomous agents with criticality scoring before execution.	agent-safety, guardrails, runtime-monitoring, autonomy, telecom, risk-control
`2607.02513`	LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning PDF	cs.CL, cs.AI, cs.LG	90	First parameter-level unlearning localization testbed; strong relevance to privacy and reliability.	unlearning, privacy, pii, evaluation, llm-safety, benchmark
`2607.01874`	SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use PDF	cs.AI, cs.CL	90	Targets agent skill-use failures with process rubrics; useful for evaluating and improving agent reliability.	agents, evaluation, rubrics, reliability, skill-use
`2607.01919`	ElephantAgent: Contextual State Continuity in Agentic Systems PDF	cs.AI, cs.CR	89	Defends agent memory/tool poisoning via contextual state continuity; novel agent security angle.	agent-safety, memory-poisoning, tool-poisoning, protocols, security, agents
`2607.01940`	Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits PDF	cs.LG, cs.AI	89	Interpretability method for recovering self-repair backups in transformer circuits; useful for auditing.	interpretability, mechanistic-interpretability, transformers, auditing, reliability
`2607.02357`	Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware PDF	cs.CR, cs.SE	88	Adaptive evasion study for malicious agent skills plus dynamic detection; timely supply-chain risk.	agent-safety, malware, supply-chain, evasion, detection, coding-agents
`2607.02116`	ContextNest: Verifiable Context Governance for Autonomous AI Agent PDF	cs.AI	88	Verifiable context governance for agents/RAG with provenance, integrity, and point-in-time reconstruction.	rag, provenance, governance, agents, integrity
`2607.02032`	PACE: A Proxy for Agentic Capability Evaluation PDF	cs.AI, cs.CL	88	Cheap proxy for costly agent benchmarks could speed agent eval cycles and model selection.	agents, evaluation, benchmarks, proxy-metrics, swe-bench
`2607.02512`	Program-as-Weights: A Programming Paradigm for Fuzzy Functions PDF	cs.LG, cs.AI, cs.CL	88	Compiles NL specs into compact neural programs; strong efficiency idea with released 10M-example dataset.	llm, program-synthesis, efficiency, post-training, dataset, local-models
`2607.02121`	Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring PDF	cs.CR, cs.AI	87	Black-box method to distinguish guardrail blocks from model refusals; useful for AI security auditing.	guardrails, auditing, black-box, jailbreaks, security, monitoring
`2607.01690`	Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing PDF	cs.AI, cs.CL, cs.LG	87	Addresses negation neglect via gradient editing to induce epistemic framing during finetuning.	alignment, factuality, finetuning, epistemics, reliability
`2607.01831`	Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference PDF	cs.DC, cs.LG	87	Long-context serving advance: speculative KV quantization cuts transfer bottlenecks in agentic/RAG inference.	llm-systems, long-context, inference, kv-cache, quantization, rag
`2607.01859`	Safety Targeted Embedding Exploit via Refinement PDF	cs.AI, cs.CL	86	Shows multilingual/code-switch jailbreak weakness with high ASR; concrete attack on safety generalization.	jailbreaks, multilingual, alignment, robustness, adversarial, llm-safety
`2607.01935`	A-TMA: Decoupling State-Aware Memory Failures in Long-Term Agent Memory PDF	cs.AI	86	Addresses long-term agent memory state errors; important for persistent assistants and reliability.	agents, memory, reliability, long-context, state-tracking
`2607.02255`	AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents PDF	cs.AI, cs.CL	85	Bounded-memory testbed for long-horizon agents; useful for isolating memory effects and evaluation.	agents, long-context, memory, benchmark, evaluation
`2607.01893`	Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters PDF	cs.AI, cs.CL	85	Training objective better aligned with speculative decoding acceptance behavior; practical LLM inference gain.	llm, speculative-decoding, training-objectives, inference-efficiency, alignment
`2607.01640`	AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs PDF	cs.SE, cs.CR	84	Static analysis for agent dependency graphs could enable auditing and security tooling for agent code.	agents, static-analysis, security, auditing, software, tooling
`2607.02186`	UA-ChatDev: Uncertainty-Aware Multi-Agent Collaboration for Reliable Software Development PDF	cs.AI	84	Uncertainty-aware multi-agent software development to reduce hallucination propagation across roles.	multi-agent, software-engineering, uncertainty, hallucination, reliability
`2607.01830`	Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling PDF	cs.LG	84	Multi-role rubric generation for LLM judging/reward modeling; promising for alignment and eval quality.	alignment, reward-modeling, llm-judges, evaluation, rubrics
`2607.02057`	Prompt Coverage Adequacy PDF	cs.SE, cs.AI	84	Proposes prompt-level coverage criterion for testing LLM/agent-generated code from task descriptions.	evaluation, agents, testing, prompting, software-engineering, reliability
`2607.01715`	Distributionally Robust Listwise Preference Optimization PDF	cs.AI	82	Robust listwise preference optimization targets ranking uncertainty; useful alignment training advance.	alignment, preference-optimization, robustness, post-training, llm-training
`2607.01846`	CLAP: Closed-Loop Training, Evaluation, and Release Control for Domain Agent Post-training PDF	cs.AI	82	Closed-loop post-training and release gating for domain agents with risk diagnostics and replay.	post-training, release-control, agents, evaluation, risk-management

AI Paper Insight Brief

2026-07-04

0) Executive takeaways (read this first)

Agent safety work is shifting from prompt-level moderation to stateful, evidence-grounded control: several papers build executable verifiers, runtime monitors, context-governance layers, or cryptographic/state-continuity mechanisms for agents rather than relying on output text alone.
A recurring pattern is “structure beats heuristics”: framework-aware static analysis (AgentFlow), typed/governed context stores (ContextNest, ElephantAgent), and process/rubric-based evaluation (SkillCoach, MRRG, VERA) all outperform coarse end-state or keyword-style checks.
The offensive side is also getting sharper: multilingual/code-switching jailbreaks (STEER), distributed multi-PR sabotage, skill-composition fuzzing, and scanner-evasion malware for agent skills all show that surface-form defenses are brittle.
For alignment/training, multiple papers target mismatch between training objective and deployment reality: Goggles edits gradients to prevent false belief absorption, Robust PL handles ambiguous listwise rankings, and AUF aligns speculative-decoder training with actual verifier acceptance.
Practical deployment implication: if you run agents, the highest-leverage near-term investments are runtime evidence collection, deterministic replay, state/version integrity, and hybrid static+dynamic testing, not just better refusal tuning.
Evaluation itself is becoming a bottleneck and a target: PACE shows cheap non-agentic proxies can predict agentic benchmark performance well enough for model selection, while Prompt Coverage and dual-channel social diagnostics expose blind spots in current testing.

2) Key themes (clusters)

Theme: Agent runtime safety is moving to evidence-grounded verification

Why it matters: The strongest agent-safety papers today focus on whether harmful actions actually occurred in the environment, not whether the model merely looked safe in text. This is a meaningful shift for tool-using agents, where side effects, state changes, and cross-turn behavior matter more than single-turn refusals.
Representative papers:
Common approach:
- Replace text-only judging with executable verifiers over environment state, tool logs, or runtime traces.
- Calibrate intervention thresholds explicitly, whether via deterministic Python checks, criticality policies, or conformal/UCB risk control.
- Separate low-risk from high-risk actions and escalate only when evidence warrants it.
- Use replayable records and structured traces so failures can be audited and reused for guard/model improvement.
Open questions / failure modes:
- Runtime systems can be expensive or operationally fragile; some papers report infrastructure sensitivity or significant dynamic-analysis cost.
- Several proposals are strong architecturally but still light on real-world deployment evidence, especially in telecom and coding-agent oversight.
- Monitor quality depends heavily on the verifier signal; weak signals sharply reduce power.
- Evidence-grounded systems still struggle with attacks that avoid execution, exploit missing mocks, or hide behind benign-looking plans.

Theme: Context, memory, and tool state are becoming first-class security boundaries

Why it matters: A large share of agent failures now come from poisoned memory, stale retrieval, mutable tool descriptors, or uncontrolled context growth. Multiple papers argue that “relevance” is not enough; agents need governed, state-aware, and verifiable context.
Representative papers:
Common approach:
- Make context explicit and typed: tool descriptors, memory states, episodic summaries, skills, and published artifacts are modeled separately.
- Add integrity/versioning layers such as digests, hash chains, checkpoints, or TEE-backed ledgers.
- Distinguish current vs historical vs transition state rather than treating retrieval as flat semantic similarity.
- Bound or structure memory interfaces so components can be ablated, audited, and replayed.
Open questions / failure modes:
- These systems often assume trustworthy infrastructure such as TEEs, artifact retention, or deterministic encodings.
- State-aware overlays cannot recover facts the host system never stored or retrieved.
- Governance layers improve auditability and determinism, but do not by themselves stop in-band prompt injection or semantically malicious but valid transitions.
- Some empirical gains are host-dependent or directionally promising rather than statistically decisive.

Theme: Static analysis and supply-chain security for agent ecosystems are maturing

Why it matters: Agent programs and skill marketplaces now look enough like software ecosystems that classic supply-chain and program-analysis ideas are becoming useful again—but only when adapted to framework semantics and runtime behavior.
Representative papers:
Common approach:
- Recover higher-level agent semantics—agents, prompts, tools, memory, handoffs—rather than relying on ASTs alone.
- Treat composition as the threat surface: skill co-activation, multi-PR persistence, and runtime closure expansion all create risks invisible to isolated review.
- Use differential or behavior-centric oracles: plan drift, taint-style prompt-to-tool flows, runtime information flow, or cross-PR note linking.
- Combine static narrowing with dynamic confirmation to improve precision and operational usefulness.
Open questions / failure modes:
- Static analyses over-approximate and can miss user-defined or rapidly evolving framework patterns.
- Dynamic detonation is much costlier than install-time scanning and still has path-coverage gaps.
- Composition search remains combinatorial; current methods rely on budgets, heuristics, or planner-visible artifacts.
- Persistent-state attacks show that per-diff monitoring remains weak unless monitors carry structured memory across steps.

Theme: Better reward, rubric, and process supervision is replacing coarse outcome metrics

Why it matters: Several papers argue that end-state success or scalar reward hides the actual failure modes that matter for alignment and agent reliability. The trend is toward richer supervision: listwise robustness, multi-role rubrics, process-level skill judging, and closed-loop release gates.
Representative papers:
Common approach:
- Replace pairwise or single-voice judgments with richer structures: listwise rankings, multi-role criteria, or multi-dimensional process rubrics.
- Keep evaluation auditable by using explicit criteria, deterministic filters, or structured release gates.
- Separate process quality from final success, especially for agents that can succeed via brittle trial-and-error.
- Feed evaluation artifacts back into training or release decisions rather than treating evaluation as a one-off score.
Open questions / failure modes:
- Robustness radii, role pools, and rubric evolution policies still require tuning and may be domain-specific.
- Some gains are modest or scenario-specific, especially in production post-training settings.
- More expressive evaluators increase cost and complexity.
- Offline rubric/SFT improvements do not guarantee stable online behavior; CLAP’s GRPO instability is a cautionary example.

Theme: Alignment is increasingly about distribution shift, hidden channels, and training-time framing

Why it matters: A common thread across alignment papers is that models fail not only because they lack safety rules, but because the wrong information channel dominates: English-only safety coverage, public-vs-private social pressure, or textual disclaimers that fail to shape belief formation.
Representative papers:
Common approach:
- Probe hidden failure channels directly: gradient-space belief editing, mechanistic refusal directions, dual public/OTR outputs, or multilingual counterfactual safety data.
- Measure divergence under controlled manipulations rather than relying on aggregate benchmark scores.
- Treat coverage gaps as structural: language coverage, role-conditioned incentives, or frame-specific training dynamics.
- Use compact interventions—LoRA gradient editors, constitutional classifiers, or targeted audits—rather than full retraining where possible.
Open questions / failure modes:
- Several methods are currently limited to specific scales or settings: LoRA-only, 7–9B white-box models, or input-only guards.
- Strong model heterogeneity suggests these effects are real but not uniform.
- Public/OTR divergence is diagnostic, not yet a mitigation.
- Multilingual and code-switched robustness remains incomplete even in explicitly multilingual systems.

Theme: Evaluation efficiency and systems optimization are becoming strategic enablers

Why it matters: Better safety and capability work increasingly depends on cheaper evaluation and faster serving. A few papers stand out for reducing the cost of either measuring agentic capability or serving long-context systems without sacrificing quality.
Representative papers:
Common approach:
- Compress expensive processes into cheaper proxies: non-agentic subsets for agentic eval, compiled fuzzy functions, or prompt-level adequacy signals.
- Preserve fidelity via verification: speculative decode verification, bootstrap-stabilized regression, or prompt-level semantic coverage.
- Optimize for deployment constraints such as TTFT, on-device execution, or test-generation efficiency.
- Expose where current proxies fail, especially when code coverage or target subsampling misses specification-level behavior.
Open questions / failure modes:
- Proxy benchmarks can be gamed or drift as model distributions change.
- Hardware-specific systems results may not transfer cleanly.
- Synthetic training corpora and benchmark pools may limit external validity.
- Prompt-level adequacy depends on model-internal signals that are unavailable for many closed models.

3) Technical synthesis

A strong cross-paper pattern is moving from output-only evaluation to latent-structure-aware evaluation: ADGs, state labels, typed memory layers, rubric dimensions, and public/OTR channels all expose failure modes hidden by final accuracy.
Several papers use deterministic, auditable intermediate artifacts as the core abstraction: AgentFlow’s dependency graph, VERA’s executable safety cases, ContextNest selectors/checkpoints, ElephantAgent digests/receipts, and SkillCoach rubrics.
Hybrid static–dynamic workflows are emerging as the practical sweet spot: static analysis narrows risk candidates (AgentFlow), while dynamic execution or sandboxing confirms exploitability (VERA, SKILLDETONATE, online monitors).
Many methods explicitly address train–deployment mismatch: AUF aligns speculative-drafter loss with accepted-prefix semantics; Goggles aligns SFT with epistemic framing; Robust PL aligns preference learning with ranking-label ambiguity.
There is a notable rise in calibration as a first-class design choice: conformal/UCB thresholds for online safety monitoring, 98th-percentile thresholds for PR monitors, per-label thresholds in HaloGuard, and phase-aware uncertainty thresholds in UA-ChatDev.
Statefulness is the new attack surface: persistent repos, long-term memory, mutable tool descriptors, and evolving context stores all create vulnerabilities that single-turn safety benchmarks miss.
Multiple papers show that surface-form defenses are brittle: static skill scanners are evaded by packing/obfuscation, refusal directions are bypassed by code-switching, and keyword/topic guards over-refuse near policy boundaries.
A recurring defensive response is structured provenance and integrity: hash chains, TEEs, append-only logs, deterministic selectors, and replayable traces are being used to make agent decisions reconstructable.
Evaluation papers increasingly separate process quality from outcome quality: SkillCoach, Prompt Coverage, and CLAP all show that end-state success can mask brittle or unsafe internal behavior.
Several systems papers emphasize small, modular interventions over monolithic retraining: kNNGuard swaps banks instead of weights, Goggles edits gradients for LoRAs, PACE predicts expensive evals from small subsets, and PAW compiles task-specific adapters once for repeated local use.

4) Top 5 papers (with “why now”)

1. Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

Builds a full pipeline from risk taxonomy construction to executable safety cases and deterministic verifiers.
Releases 1,600 executable scenarios spanning 124 risk categories across four production agent frameworks.
Reports high average attack success rates, especially in multi-channel settings (93.9%), suggesting current agents remain broadly vulnerable.
Useful now because many teams are still evaluating agents with prompt-level judges or ad hoc scripts; this paper offers a more maintainable testing architecture.
Skeptical take / limitation: scope is limited to inference-time, runtime-exercisable risks, and results are partly affected by infrastructure fragility and scenario-generation quality.

2. AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

Introduces a framework-agnostic IR for agent programs that captures component, control, and data dependencies across prompts, tools, memory, and handoffs.
Evaluated on 5,399 real-world projects and substantially out-recovers AST-based baselines on nodes and edges.
Finds 238 projects with prompt-to-tool risks and achieves 73% precision on sampled findings.
Useful now because agent codebases are proliferating faster than manual review can keep up; this is one of the clearest attempts to make agent governance and BOMs analyzable.
Skeptical take / limitation: static over-approximation and Python/framework coverage limits mean it should be paired with dynamic validation before high-confidence enforcement.

3. Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware

Shows that install-time skill scanners are highly evadable via semantics-preserving obfuscation and self-extracting packing.
Evaluates on 1,613 in-the-wild malicious skills; SFS packing bypasses surveyed static/hybrid scanners at ≥96% and still fools LLM-judge scanners at high rates.
Proposes SKILLDETONATE, a runtime sandbox with closure lifting and marker-based taint, reaching about 96.7% detection on SkillJect at ~2% FP.
Useful now because agent skill/plugin ecosystems are expanding, and many current defenses still assume static inspection is a meaningful gate.
Skeptical take / limitation: dynamic detonation is expensive and still limited by path coverage, sandbox realism, and environment mocking.

4. Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

Tackles Negation Neglect by moving epistemic framing from text into the backward pass during SFT.
Achieves high resisted fractions on implanted-falsehood settings while preserving capability metrics roughly unchanged.
Offers a reusable mechanism for training on fictional, adversarial, or safety-test content without internalizing it as belief.
Useful now because synthetic data and adversarial training corpora are becoming standard, and current disclaimer-based approaches appear insufficient.
Skeptical take / limitation: evidence is currently limited to LoRA training on Qwen3-8B, with one meta-training run needed per frame/configuration.

5. Distributed Attacks in Persistent-State AI Control

Formalizes a realistic control problem for coding agents that work over multiple PRs in a persistent repo.
Shows gradual and positional attacks evade different monitor types; standard diff monitors are especially weak to gradual attacks.
Introduces a stateful link-tracker monitor and shows ensembles materially reduce gradual-attack evasion.
Useful now because coding agents are increasingly deployed in iterative workflows, where single-diff review assumptions no longer hold.
Skeptical take / limitation: benchmark repos are still relatively small and simplified compared with real enterprise codebases.

5) Practical next steps

Add a two-layer safety stack for agents: static dependency/risk analysis before deployment, plus runtime evidence-grounded verification and replay in sandboxed execution.
Treat memory, retrieval, and tool descriptors as governed state, not just context: add version IDs, integrity checks, and explicit current/historical labels where possible.
For coding agents, test against persistent-state attacks rather than only single-episode sabotage; maintain monitor-written cross-PR notes or summaries.
Replace install-time-only plugin/skill review with dynamic detonation for high-risk skills, especially those touching credentials, filesystems, or network egress.
If you use preference or rubric optimization, measure process quality separately from end-task success; outcome-only filtering is likely hiding brittle behavior.
Add multilingual and code-switched red-team suites to safety evaluation; English-centric refusal tuning is not enough.
For online deployment, calibrate monitors with explicit false-alarm budgets using conformal/UCB-style procedures rather than ad hoc thresholds.
Build cheap proxy eval loops for model iteration: use non-agentic proxy subsets or prompt-level adequacy signals to decide when full agentic evaluation is worth paying for.

Generated from per-paper analyses; no external browsing.

Agent safety gets stateful.

Takeaways

Start with: Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

Themes

Papers Worth Your Reading Time

Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

Distributed Attacks in Persistent-State AI Control

AI Paper Insight Brief

2026-07-04

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Agent runtime safety is moving to evidence-grounded verification

Theme: Context, memory, and tool state are becoming first-class security boundaries

Theme: Static analysis and supply-chain security for agent ecosystems are maturing

Theme: Better reward, rubric, and process supervision is replacing coarse outcome metrics

Theme: Alignment is increasingly about distribution shift, hidden channels, and training-time framing

Theme: Evaluation efficiency and systems optimization are becoming strategic enablers

3) Technical synthesis

4) Top 5 papers (with “why now”)

1. Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

2. AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

3. Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware

4. Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

5. Distributed Attacks in Persistent-State AI Control

5) Practical next steps