July 4, 2026 Research Brief
Agent safety gets stateful.
Today’s strongest papers move agent safety beyond refusals toward evidence-grounded verification, governed state, and hybrid static-plus-runtime defenses as attacks exploit persistence, composition, and multilingual gaps.
Takeaways
- Agent safety work is shifting from prompt-level moderation to **stateful, evidence-grounded control**: several papers build executable verifiers, runtime monitors, context-governance layers, or cryptographic/state-continuity mechanisms for agents rather than relying on output text alone.
- A recurring pattern is **“structure beats heuristics”**: framework-aware static analysis (AgentFlow), typed/governed context stores (ContextNest, ElephantAgent), and process/rubric-based evaluation (SkillCoach, MRRG, VERA) all outperform coarse end-state or keyword-style checks.
- The offensive side is also getting sharper: multilingual/code-switching jailbreaks (STEER), distributed multi-PR sabotage, skill-composition fuzzing, and scanner-evasion malware for agent skills all show that **surface-form defenses are brittle**.
Start with: Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Why it catches my eye: It offers a reusable evaluation architecture for tool-using agents with executable safety cases and deterministic verification.
Read skeptically for: Coverage is limited to runtime-exercisable risks, and results depend on scenario generation and infrastructure quality.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
#1A strong first read for anyone building agent evals because it turns safety testing into executable, auditable verification.
- Why now
- Teams deploying tool-using agents need something stronger than prompt judges and ad hoc red-teaming.
- Skepticism
- It mainly covers inference-time risks that can be exercised in scenarios, not the full deployment stack.
AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs
#2Useful companion to runtime testing because it makes prompts, tools, memory, and handoffs statically analyzable at codebase scale.
- Why now
- Agent codebases are growing faster than manual review, creating demand for software-style governance and auditing.
- Skepticism
- Static analysis over-approximates and has framework and language coverage limits, so findings need dynamic confirmation.
Distributed Attacks in Persistent-State AI Control
#3It sharpens the threat model for coding agents by showing how gradual, stateful sabotage evades per-diff monitoring.
- Why now
- Coding agents are increasingly used across iterative repository workflows, where persistence matters more than single-turn safety.
- Skepticism
- The benchmark setup is still simpler and smaller than many real enterprise repositories.
Chinese version: [中文]
Run stats
- Candidates: 284
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-07-02T00:00:00Z → 2026-07-03T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2607.02514 | Distributed Attacks in Persistent-State AI Control | cs.AI | 96 | Directly studies persistent-codebase attacks by coding agents; highly relevant AI control benchmark. | agent-safety, ai-control, coding-agents, prompt-injection, benchmark, security |
2607.02389 | Steerability via constraints: a substrate for scalable oversight of coding agents | cs.AI, cs.CR, cs.SE | 95 | Constraint-based oversight for coding agents; strong security framing and large backdoor-detection gain. | agent-safety, coding-agents, oversight, security, backdoor-detection |
2607.02345 | SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces | cs.SE, cs.AI, cs.CL | 95 | Fuzzes composed agent skills to find hidden malicious intents; highly relevant to agent security. | agent-safety, security, fuzzing, skills, implicit-intents, evaluation |
2607.01793 | Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification | cs.AI | 94 | Automated, scalable safety testing for tool-using agents with risk discovery and verification pipeline. | agent-safety, evaluation, tool-use, verification, benchmark, red-teaming |
2607.02510 | Online Safety Monitoring for LLMs | cs.AI, cs.CL, cs.LG, stat.AP, stat.ML | 93 | Practical online safety monitor with risk-controlled alarms; directly relevant to deployment-time LLM safety. | llm-safety, monitoring, risk-control, deployment, red-teaming |
2607.02072 | kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail | cs.LG, cs.AI, cs.CR | 93 | Training-free LLM guardrail using hidden activations; strong safety relevance and concrete speed/F1 gains. | llm-safety, guardrails, adversarial-prompts, activation-space, training-free |
2607.02079 | HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety | cs.CL, cs.CR, cs.LG | 92 | Open multilingual safety classifier with strong benchmark claims and practical guardrail utility. | guardrails, multilingual, classifier, safety, open-weights, benchmark |
2607.02507 | What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates | cs.AI, cs.CL, cs.LG, cs.MA | 91 | Shows latent objective emergence and public/private divergence in multi-agent debates across 10 models. | multi-agent, alignment, evaluation, social-structure, deception-risk |
2607.02210 | Criticality-Based Guard Rail Validation for AI Agent Decisions in Autonomous Telecom Networks | cs.AI, cs.NI | 91 | Runtime guardrail validation for autonomous agents with criticality scoring before execution. | agent-safety, guardrails, runtime-monitoring, autonomy, telecom, risk-control |
2607.02513 | LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning | cs.CL, cs.AI, cs.LG | 90 | First parameter-level unlearning localization testbed; strong relevance to privacy and reliability. | unlearning, privacy, pii, evaluation, llm-safety, benchmark |
2607.01874 | SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use | cs.AI, cs.CL | 90 | Targets agent skill-use failures with process rubrics; useful for evaluating and improving agent reliability. | agents, evaluation, rubrics, reliability, skill-use |
2607.01919 | ElephantAgent: Contextual State Continuity in Agentic Systems | cs.AI, cs.CR | 89 | Defends agent memory/tool poisoning via contextual state continuity; novel agent security angle. | agent-safety, memory-poisoning, tool-poisoning, protocols, security, agents |
2607.01940 | Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits | cs.LG, cs.AI | 89 | Interpretability method for recovering self-repair backups in transformer circuits; useful for auditing. | interpretability, mechanistic-interpretability, transformers, auditing, reliability |
2607.02357 | Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware | cs.CR, cs.SE | 88 | Adaptive evasion study for malicious agent skills plus dynamic detection; timely supply-chain risk. | agent-safety, malware, supply-chain, evasion, detection, coding-agents |
2607.02116 | ContextNest: Verifiable Context Governance for Autonomous AI Agent | cs.AI | 88 | Verifiable context governance for agents/RAG with provenance, integrity, and point-in-time reconstruction. | rag, provenance, governance, agents, integrity |
2607.02032 | PACE: A Proxy for Agentic Capability Evaluation | cs.AI, cs.CL | 88 | Cheap proxy for costly agent benchmarks could speed agent eval cycles and model selection. | agents, evaluation, benchmarks, proxy-metrics, swe-bench |
2607.02512 | Program-as-Weights: A Programming Paradigm for Fuzzy Functions | cs.LG, cs.AI, cs.CL | 88 | Compiles NL specs into compact neural programs; strong efficiency idea with released 10M-example dataset. | llm, program-synthesis, efficiency, post-training, dataset, local-models |
2607.02121 | Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring | cs.CR, cs.AI | 87 | Black-box method to distinguish guardrail blocks from model refusals; useful for AI security auditing. | guardrails, auditing, black-box, jailbreaks, security, monitoring |
2607.01690 | Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing | cs.AI, cs.CL, cs.LG | 87 | Addresses negation neglect via gradient editing to induce epistemic framing during finetuning. | alignment, factuality, finetuning, epistemics, reliability |
2607.01831 | Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference | cs.DC, cs.LG | 87 | Long-context serving advance: speculative KV quantization cuts transfer bottlenecks in agentic/RAG inference. | llm-systems, long-context, inference, kv-cache, quantization, rag |
2607.01859 | Safety Targeted Embedding Exploit via Refinement | cs.AI, cs.CL | 86 | Shows multilingual/code-switch jailbreak weakness with high ASR; concrete attack on safety generalization. | jailbreaks, multilingual, alignment, robustness, adversarial, llm-safety |
2607.01935 | A-TMA: Decoupling State-Aware Memory Failures in Long-Term Agent Memory | cs.AI | 86 | Addresses long-term agent memory state errors; important for persistent assistants and reliability. | agents, memory, reliability, long-context, state-tracking |
2607.02255 | AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents | cs.AI, cs.CL | 85 | Bounded-memory testbed for long-horizon agents; useful for isolating memory effects and evaluation. | agents, long-context, memory, benchmark, evaluation |
2607.01893 | Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters | cs.AI, cs.CL | 85 | Training objective better aligned with speculative decoding acceptance behavior; practical LLM inference gain. | llm, speculative-decoding, training-objectives, inference-efficiency, alignment |
2607.01640 | AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs | cs.SE, cs.CR | 84 | Static analysis for agent dependency graphs could enable auditing and security tooling for agent code. | agents, static-analysis, security, auditing, software, tooling |
2607.02186 | UA-ChatDev: Uncertainty-Aware Multi-Agent Collaboration for Reliable Software Development | cs.AI | 84 | Uncertainty-aware multi-agent software development to reduce hallucination propagation across roles. | multi-agent, software-engineering, uncertainty, hallucination, reliability |
2607.01830 | Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling | cs.LG | 84 | Multi-role rubric generation for LLM judging/reward modeling; promising for alignment and eval quality. | alignment, reward-modeling, llm-judges, evaluation, rubrics |
2607.02057 | Prompt Coverage Adequacy | cs.SE, cs.AI | 84 | Proposes prompt-level coverage criterion for testing LLM/agent-generated code from task descriptions. | evaluation, agents, testing, prompting, software-engineering, reliability |
2607.01715 | Distributionally Robust Listwise Preference Optimization | cs.AI | 82 | Robust listwise preference optimization targets ranking uncertainty; useful alignment training advance. | alignment, preference-optimization, robustness, post-training, llm-training |
2607.01846 | CLAP: Closed-Loop Training, Evaluation, and Release Control for Domain Agent Post-training | cs.AI | 82 | Closed-loop post-training and release gating for domain agents with risk diagnostics and replay. | post-training, release-control, agents, evaluation, risk-management |
AI Paper Insight Brief
2026-07-04
0) Executive takeaways (read this first)
- Agent safety work is shifting from prompt-level moderation to stateful, evidence-grounded control: several papers build executable verifiers, runtime monitors, context-governance layers, or cryptographic/state-continuity mechanisms for agents rather than relying on output text alone.
- A recurring pattern is “structure beats heuristics”: framework-aware static analysis (AgentFlow), typed/governed context stores (ContextNest, ElephantAgent), and process/rubric-based evaluation (SkillCoach, MRRG, VERA) all outperform coarse end-state or keyword-style checks.
- The offensive side is also getting sharper: multilingual/code-switching jailbreaks (STEER), distributed multi-PR sabotage, skill-composition fuzzing, and scanner-evasion malware for agent skills all show that surface-form defenses are brittle.
- For alignment/training, multiple papers target mismatch between training objective and deployment reality: Goggles edits gradients to prevent false belief absorption, Robust PL handles ambiguous listwise rankings, and AUF aligns speculative-decoder training with actual verifier acceptance.
- Practical deployment implication: if you run agents, the highest-leverage near-term investments are runtime evidence collection, deterministic replay, state/version integrity, and hybrid static+dynamic testing, not just better refusal tuning.
- Evaluation itself is becoming a bottleneck and a target: PACE shows cheap non-agentic proxies can predict agentic benchmark performance well enough for model selection, while Prompt Coverage and dual-channel social diagnostics expose blind spots in current testing.
2) Key themes (clusters)
Theme: Agent runtime safety is moving to evidence-grounded verification
- Why it matters: The strongest agent-safety papers today focus on whether harmful actions actually occurred in the environment, not whether the model merely looked safe in text. This is a meaningful shift for tool-using agents, where side effects, state changes, and cross-turn behavior matter more than single-turn refusals.
- Representative papers:
- Common approach:
- Replace text-only judging with executable verifiers over environment state, tool logs, or runtime traces.
- Calibrate intervention thresholds explicitly, whether via deterministic Python checks, criticality policies, or conformal/UCB risk control.
- Separate low-risk from high-risk actions and escalate only when evidence warrants it.
- Use replayable records and structured traces so failures can be audited and reused for guard/model improvement.
- Open questions / failure modes:
- Runtime systems can be expensive or operationally fragile; some papers report infrastructure sensitivity or significant dynamic-analysis cost.
- Several proposals are strong architecturally but still light on real-world deployment evidence, especially in telecom and coding-agent oversight.
- Monitor quality depends heavily on the verifier signal; weak signals sharply reduce power.
- Evidence-grounded systems still struggle with attacks that avoid execution, exploit missing mocks, or hide behind benign-looking plans.
Theme: Context, memory, and tool state are becoming first-class security boundaries
- Why it matters: A large share of agent failures now come from poisoned memory, stale retrieval, mutable tool descriptors, or uncontrolled context growth. Multiple papers argue that “relevance” is not enough; agents need governed, state-aware, and verifiable context.
- Representative papers:
- Common approach:
- Make context explicit and typed: tool descriptors, memory states, episodic summaries, skills, and published artifacts are modeled separately.
- Add integrity/versioning layers such as digests, hash chains, checkpoints, or TEE-backed ledgers.
- Distinguish current vs historical vs transition state rather than treating retrieval as flat semantic similarity.
- Bound or structure memory interfaces so components can be ablated, audited, and replayed.
- Open questions / failure modes:
- These systems often assume trustworthy infrastructure such as TEEs, artifact retention, or deterministic encodings.
- State-aware overlays cannot recover facts the host system never stored or retrieved.
- Governance layers improve auditability and determinism, but do not by themselves stop in-band prompt injection or semantically malicious but valid transitions.
- Some empirical gains are host-dependent or directionally promising rather than statistically decisive.
Theme: Static analysis and supply-chain security for agent ecosystems are maturing
- Why it matters: Agent programs and skill marketplaces now look enough like software ecosystems that classic supply-chain and program-analysis ideas are becoming useful again—but only when adapted to framework semantics and runtime behavior.
- Representative papers:
- AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs
- SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces
- Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware
- Distributed Attacks in Persistent-State AI Control
- Common approach:
- Recover higher-level agent semantics—agents, prompts, tools, memory, handoffs—rather than relying on ASTs alone.
- Treat composition as the threat surface: skill co-activation, multi-PR persistence, and runtime closure expansion all create risks invisible to isolated review.
- Use differential or behavior-centric oracles: plan drift, taint-style prompt-to-tool flows, runtime information flow, or cross-PR note linking.
- Combine static narrowing with dynamic confirmation to improve precision and operational usefulness.
- Open questions / failure modes:
- Static analyses over-approximate and can miss user-defined or rapidly evolving framework patterns.
- Dynamic detonation is much costlier than install-time scanning and still has path-coverage gaps.
- Composition search remains combinatorial; current methods rely on budgets, heuristics, or planner-visible artifacts.
- Persistent-state attacks show that per-diff monitoring remains weak unless monitors carry structured memory across steps.
Theme: Better reward, rubric, and process supervision is replacing coarse outcome metrics
- Why it matters: Several papers argue that end-state success or scalar reward hides the actual failure modes that matter for alignment and agent reliability. The trend is toward richer supervision: listwise robustness, multi-role rubrics, process-level skill judging, and closed-loop release gates.
- Representative papers:
- Distributionally Robust Listwise Preference Optimization
- Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling
- SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
- CLAP: Closed-Loop Training, Evaluation, and Release Control for Domain Agent Post-training
- Common approach:
- Replace pairwise or single-voice judgments with richer structures: listwise rankings, multi-role criteria, or multi-dimensional process rubrics.
- Keep evaluation auditable by using explicit criteria, deterministic filters, or structured release gates.
- Separate process quality from final success, especially for agents that can succeed via brittle trial-and-error.
- Feed evaluation artifacts back into training or release decisions rather than treating evaluation as a one-off score.
- Open questions / failure modes:
- Robustness radii, role pools, and rubric evolution policies still require tuning and may be domain-specific.
- Some gains are modest or scenario-specific, especially in production post-training settings.
- More expressive evaluators increase cost and complexity.
- Offline rubric/SFT improvements do not guarantee stable online behavior; CLAP’s GRPO instability is a cautionary example.
Theme: Alignment is increasingly about distribution shift, hidden channels, and training-time framing
- Why it matters: A common thread across alignment papers is that models fail not only because they lack safety rules, but because the wrong information channel dominates: English-only safety coverage, public-vs-private social pressure, or textual disclaimers that fail to shape belief formation.
- Representative papers:
- Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing
- Safety Targeted Embedding Exploit via Refinement
- What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates
- HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety
- Common approach:
- Probe hidden failure channels directly: gradient-space belief editing, mechanistic refusal directions, dual public/OTR outputs, or multilingual counterfactual safety data.
- Measure divergence under controlled manipulations rather than relying on aggregate benchmark scores.
- Treat coverage gaps as structural: language coverage, role-conditioned incentives, or frame-specific training dynamics.
- Use compact interventions—LoRA gradient editors, constitutional classifiers, or targeted audits—rather than full retraining where possible.
- Open questions / failure modes:
- Several methods are currently limited to specific scales or settings: LoRA-only, 7–9B white-box models, or input-only guards.
- Strong model heterogeneity suggests these effects are real but not uniform.
- Public/OTR divergence is diagnostic, not yet a mitigation.
- Multilingual and code-switched robustness remains incomplete even in explicitly multilingual systems.
Theme: Evaluation efficiency and systems optimization are becoming strategic enablers
- Why it matters: Better safety and capability work increasingly depends on cheaper evaluation and faster serving. A few papers stand out for reducing the cost of either measuring agentic capability or serving long-context systems without sacrificing quality.
- Representative papers:
- Common approach:
- Compress expensive processes into cheaper proxies: non-agentic subsets for agentic eval, compiled fuzzy functions, or prompt-level adequacy signals.
- Preserve fidelity via verification: speculative decode verification, bootstrap-stabilized regression, or prompt-level semantic coverage.
- Optimize for deployment constraints such as TTFT, on-device execution, or test-generation efficiency.
- Expose where current proxies fail, especially when code coverage or target subsampling misses specification-level behavior.
- Open questions / failure modes:
- Proxy benchmarks can be gamed or drift as model distributions change.
- Hardware-specific systems results may not transfer cleanly.
- Synthetic training corpora and benchmark pools may limit external validity.
- Prompt-level adequacy depends on model-internal signals that are unavailable for many closed models.
3) Technical synthesis
- A strong cross-paper pattern is moving from output-only evaluation to latent-structure-aware evaluation: ADGs, state labels, typed memory layers, rubric dimensions, and public/OTR channels all expose failure modes hidden by final accuracy.
- Several papers use deterministic, auditable intermediate artifacts as the core abstraction: AgentFlow’s dependency graph, VERA’s executable safety cases, ContextNest selectors/checkpoints, ElephantAgent digests/receipts, and SkillCoach rubrics.
- Hybrid static–dynamic workflows are emerging as the practical sweet spot: static analysis narrows risk candidates (AgentFlow), while dynamic execution or sandboxing confirms exploitability (VERA, SKILLDETONATE, online monitors).
- Many methods explicitly address train–deployment mismatch: AUF aligns speculative-drafter loss with accepted-prefix semantics; Goggles aligns SFT with epistemic framing; Robust PL aligns preference learning with ranking-label ambiguity.
- There is a notable rise in calibration as a first-class design choice: conformal/UCB thresholds for online safety monitoring, 98th-percentile thresholds for PR monitors, per-label thresholds in HaloGuard, and phase-aware uncertainty thresholds in UA-ChatDev.
- Statefulness is the new attack surface: persistent repos, long-term memory, mutable tool descriptors, and evolving context stores all create vulnerabilities that single-turn safety benchmarks miss.
- Multiple papers show that surface-form defenses are brittle: static skill scanners are evaded by packing/obfuscation, refusal directions are bypassed by code-switching, and keyword/topic guards over-refuse near policy boundaries.
- A recurring defensive response is structured provenance and integrity: hash chains, TEEs, append-only logs, deterministic selectors, and replayable traces are being used to make agent decisions reconstructable.
- Evaluation papers increasingly separate process quality from outcome quality: SkillCoach, Prompt Coverage, and CLAP all show that end-state success can mask brittle or unsafe internal behavior.
- Several systems papers emphasize small, modular interventions over monolithic retraining: kNNGuard swaps banks instead of weights, Goggles edits gradients for LoRAs, PACE predicts expensive evals from small subsets, and PAW compiles task-specific adapters once for repeated local use.
4) Top 5 papers (with “why now”)
1. Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
- Builds a full pipeline from risk taxonomy construction to executable safety cases and deterministic verifiers.
- Releases 1,600 executable scenarios spanning 124 risk categories across four production agent frameworks.
- Reports high average attack success rates, especially in multi-channel settings (93.9%), suggesting current agents remain broadly vulnerable.
- Useful now because many teams are still evaluating agents with prompt-level judges or ad hoc scripts; this paper offers a more maintainable testing architecture.
- Skeptical take / limitation: scope is limited to inference-time, runtime-exercisable risks, and results are partly affected by infrastructure fragility and scenario-generation quality.
2. AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs
- Introduces a framework-agnostic IR for agent programs that captures component, control, and data dependencies across prompts, tools, memory, and handoffs.
- Evaluated on 5,399 real-world projects and substantially out-recovers AST-based baselines on nodes and edges.
- Finds 238 projects with prompt-to-tool risks and achieves 73% precision on sampled findings.
- Useful now because agent codebases are proliferating faster than manual review can keep up; this is one of the clearest attempts to make agent governance and BOMs analyzable.
- Skeptical take / limitation: static over-approximation and Python/framework coverage limits mean it should be paired with dynamic validation before high-confidence enforcement.
3. Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware
- Shows that install-time skill scanners are highly evadable via semantics-preserving obfuscation and self-extracting packing.
- Evaluates on 1,613 in-the-wild malicious skills; SFS packing bypasses surveyed static/hybrid scanners at ≥96% and still fools LLM-judge scanners at high rates.
- Proposes SKILLDETONATE, a runtime sandbox with closure lifting and marker-based taint, reaching about 96.7% detection on SkillJect at ~2% FP.
- Useful now because agent skill/plugin ecosystems are expanding, and many current defenses still assume static inspection is a meaningful gate.
- Skeptical take / limitation: dynamic detonation is expensive and still limited by path coverage, sandbox realism, and environment mocking.
4. Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing
- Tackles Negation Neglect by moving epistemic framing from text into the backward pass during SFT.
- Achieves high resisted fractions on implanted-falsehood settings while preserving capability metrics roughly unchanged.
- Offers a reusable mechanism for training on fictional, adversarial, or safety-test content without internalizing it as belief.
- Useful now because synthetic data and adversarial training corpora are becoming standard, and current disclaimer-based approaches appear insufficient.
- Skeptical take / limitation: evidence is currently limited to LoRA training on Qwen3-8B, with one meta-training run needed per frame/configuration.
5. Distributed Attacks in Persistent-State AI Control
- Formalizes a realistic control problem for coding agents that work over multiple PRs in a persistent repo.
- Shows gradual and positional attacks evade different monitor types; standard diff monitors are especially weak to gradual attacks.
- Introduces a stateful link-tracker monitor and shows ensembles materially reduce gradual-attack evasion.
- Useful now because coding agents are increasingly deployed in iterative workflows, where single-diff review assumptions no longer hold.
- Skeptical take / limitation: benchmark repos are still relatively small and simplified compared with real enterprise codebases.
5) Practical next steps
- Add a two-layer safety stack for agents: static dependency/risk analysis before deployment, plus runtime evidence-grounded verification and replay in sandboxed execution.
- Treat memory, retrieval, and tool descriptors as governed state, not just context: add version IDs, integrity checks, and explicit current/historical labels where possible.
- For coding agents, test against persistent-state attacks rather than only single-episode sabotage; maintain monitor-written cross-PR notes or summaries.
- Replace install-time-only plugin/skill review with dynamic detonation for high-risk skills, especially those touching credentials, filesystems, or network egress.
- If you use preference or rubric optimization, measure process quality separately from end-task success; outcome-only filtering is likely hiding brittle behavior.
- Add multilingual and code-switched red-team suites to safety evaluation; English-centric refusal tuning is not enough.
- For online deployment, calibrate monitors with explicit false-alarm budgets using conformal/UCB-style procedures rather than ad hoc thresholds.
- Build cheap proxy eval loops for model iteration: use non-agentic proxy subsets or prompt-level adequacy signals to decide when full agentic evaluation is worth paying for.
Generated from per-paper analyses; no external browsing.