July 3, 2026 Research Brief

Agent safety moves runtime.

Today’s papers shift AI safety from prompt hardening to runtime control and behavioral audits, as realistic agent attacks and broken evaluation proxies expose weaknesses in deployed workflows.

Takeaways

  1. Agent security is shifting from prompt-only threats to **workflow and infrastructure threats**: today’s strongest papers show practical attacks on mobile agents, function-calling systems, and agentic RAG by exploiting screenshots, tool traces, validation loops, and public reasoning signals rather than just user prompts.
  2. Several papers argue that **current evaluation proxies are misleading**: perplexity/NLL for test-time training, CLIP/FID for T2I safety, aggregate pass/fail for pragmatic safety, and benchmark leaderboards for coding/perf agents can all overstate real capability or safety.
  3. A recurring design pattern is **runtime governance over static alignment**: gear-based action gating, object-level context garbage collection, task-state wrappers, budgeted DB sessions, and uncertainty propagation all add control at execution time rather than trusting the base model.
#1

Start with: Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

Why it catches my eye: It gives a reusable evaluation ladder for testing deployment-memory claims and shows proxy gains can completely miss behavioral recall.

Read skeptically for: The core negative result is centered on one-step LoRA and Qwen3, so generality across memory methods remains open.

evaluation memory behavior reliability

Themes

Agent attack surfaces are moving below the prompt The most practical attacks in this batch do not rely on clever wording alone. They exploit the execution substrate around agents—screenshots, tool schemas, public traces, retrieval chains, and host-side channels—where many deployed systems still assume trusted context.
Runtime governance is becoming the practical safety layer Multiple papers converge on the idea that static alignment is insufficient once agents act over long horizons or in physical/data systems. Safety is increasingly being implemented as runtime control over authority, context, and execution budgets.
Evaluation proxies are breaking under deployment claims Several papers show that standard metrics can support claims they do not actually justify. This is especially important for memory, safety alignment, and benchmark-driven progress reporting.
Signal Prompt security is no longer enough. Mobile agents, function-calling systems, and agentic RAG are attacked through screenshots, tool traces, and retrieval channels rather than prompts alone.
Tension Proxy metrics keep overstating safety. Papers on test-time memory, text-to-image safety, pragmatic safety, and coding-agent benchmarks all show standard metrics can miss real behavioral failures.
Bet Runtime control will beat static alignment. Gear-based governance, context management, task-state wrappers, budgeted DB sessions, and uncertainty propagation all add safety at execution time.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

#1

Useful if you make or evaluate memory claims: it separates adaptation from actual recall with a concrete behavioral framework.

Why now
Memory features and test-time training claims are spreading faster than matched deployment evidence.
Skepticism
Its strongest demonstration uses one-step LoRA on one model family, so it is more calibration than final verdict.

(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents

#2

A strong companion read because it shows where deployed agents actually break: screenshots, control channels, and host-side execution.

Why now
Mobile and desktop agents are moving into real workflows while many teams still defend mainly at the prompt layer.
Skepticism
Results are on third-party Android agent stacks, so transfer to first-party or iOS systems is not guaranteed.

Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

#3

Worth opening for its clear taxonomy of open-world tool-use failures and its distinction between SFT and RL weaknesses.

Why now
Tool-using agents are leaving static benchmarks for changing APIs, schemas, and environments.
Skepticism
Most evidence comes from a controlled sandbox with one backbone and one RL setup.

Chinese version: [中文]

Run stats

  • Candidates: 250
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-07-01T00:00:00Z → 2026-07-02T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2607.00481Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces
PDF
cs.CR, cs.AI95Black-box jailbreak on function-calling LLMs exposes a key agent security flaw beyond prompts.jailbreak, function-calling, agent-security, prompt-injection, black-box
2607.01208Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
PDF
cs.CL, cs.AI, cs.LG95Targets stealth LLM bias detection under supply-chain threat; strong safety relevance and concrete method.llm-safety, bias-detection, supply-chain, distillation, auditing
2607.01153Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity
PDF
cs.CL, cs.AI, cs.SE93Benchmark targets ambiguity, embedded commands, and instruction conflict in safety evaluation.safety-eval, benchmark, instruction-following, embedded-commands, agents
2607.00422KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems
PDF
cs.CR92Black-box poisoning attack on agentic RAG is highly relevant to deployed retrieval agents.RAG, poisoning, agent-security, black-box, adversarial
2607.00402The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models
PDF
cs.CV, cs.AI, cs.LG92Shows safety-image alignment can hide major semantic utility loss under coarse metrics.safety, diffusion, evaluation, multimodal, utility, benchmark
2607.00415A Mechanistic View of Authority Hierarchy in LLM Sycophancy
PDF
cs.CL, cs.LG92Mechanistic study of authority-driven sycophancy; directly relevant to LLM reliability and alignment.sycophancy, mechanistic-interpretability, reliability, alignment, medical-qa
2607.01071MemSyco-Bench: Benchmarking Sycophancy in Agent Memory
PDF
cs.IR, cs.AI91Benchmark targets memory-induced sycophancy in agents, a concrete and underexplored safety risk.agent-safety, benchmark, memory, sycophancy, evaluation
2607.00572HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
PDF
cs.AI, cs.CR90Mechanistic safety work on harmfulness/refusal directions could inform robust anti-jailbreak alignment.alignment, interpretability, jailbreaks, refusal, mechanistic
2607.00692Self-GC: Self-Governing Context for Long-Horizon LLM Agents
PDF
cs.AI90Structured context governance for long-horizon agents addresses memory, evidence retention, and control.agents, long-context, memory, context-management, tool-use
2607.00871Self-Evolving Agents with Anytime-Valid Certificates
PDF
cs.AI, cs.CL89Auditable certificates for self-evolving agents address a core safety gap in self-modifying systems.agents, safety, verification, self-modification, auditing
2607.00334Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems
PDF
cs.AI89Runtime governance framework for agent autonomy with formal safety/stability claims.agents, safety, governance, multi-agent, runtime-control, formal-methods
2607.00972Bayesian Uncertainty Propagation for Agentic RAG Pipelines: A Proof-of-Concept Study on Multi-Hop Question Answering
PDF
cs.AI88Uncertainty propagation for agentic RAG directly supports reliability, monitoring, and failure diagnosis.agentic-rag, uncertainty, reliability, bayesian, evaluation
2607.00502A Task-State Representation for Long-Horizon Mobile GUI Agents
PDF
cs.CL88Training-free task-state wrapper for long-horizon GUI agents; practical reliability gain for agent execution.agents, gui-agents, task-state, long-horizon, reliability
2607.00751SessionBound: Turning Enterprise Task Approval into Budgeted Database Sessions
PDF
cs.DB, cs.CR87Practical permissioning framework for enterprise agents with bounded, auditable DB sessions.agent-security, authorization, databases, auditing, enterprise
2607.00368Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
PDF
cs.CL87Argues TTT memory claims need behavioral evidence, not proxy metrics alone.llm, test-time-training, evaluation, memory, reliability, behavior
2607.00447Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors
PDF
cs.CL87Studies hallucination as inference misalignment and introduces a controlled benchmark for testing it.hallucination, reasoning, reliability, benchmark, inference
2607.01084Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use
PDF
cs.AI86Open-world tool-use benchmark reveals fragility of static agent training under realistic shifts.agents, tool-use, generalization, benchmark, robustness
2607.01211Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
PDF
cs.SE, cs.AI86Audits coding-agent benchmarks and exposes reliability issues in reported progress.agents, coding, benchmark, evaluation, reliability, software-engineering
2607.01181Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
PDF
cs.LG, cs.AI, cs.CL86Combines verifiable rewards with human demos to reduce reward hacking and unnatural RLVR behavior.rlvr, alignment, reward-hacking, post-training, human-feedback
2607.00724MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark
PDF
cs.CL86New multilingual cultural QA benchmark exposing limits of apparent alignment beyond language fluency.evaluation, benchmark, multilingual, cultural-alignment, llms
2607.00361ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models
PDF
cs.CR85Backdoor attack on VLM reasoning traces is highly relevant to model security.security, backdoor, vlm, reasoning, adversarial, safety
2607.00605Auditing Forgetting in Limited Memory Language Models
PDF
cs.CL, cs.AI, cs.LG84Causal auditing of forgetting in memory-augmented LMs is useful for unlearning and leakage analysis.unlearning, memory, auditing, privacy, reliability
2607.01138Antaeus: Hunting Repository-Level Logic Vulnerabilities via Context-Grounded LLM Reasoning
PDF
cs.CR84Repository-grounded LLM reasoning for logic vuln detection targets real agent limits.llm, security, code, vulnerability-detection, agents, repository-context
2607.01232Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
PDF
cs.LG, cs.CL84Layer-wise RL post-training result could materially change efficient alignment and adaptation practice.llm-training, rl-post-training, efficiency, alignment, transformers
2607.00597Multi-Turn Agentic Scientific Literature Search via Workflow Induction
PDF
cs.CL, cs.IR84Agentic literature search via explicit workflows improves inspectability and controllability of search agents.agents, scientific-search, workflow-induction, inspectability, information-retrieval
2607.00895Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents
PDF
cs.CL82Span-level hallucination benchmark extends grounding checks to code, tools, and structured evidence.hallucination, grounding, benchmark, code-agents, RAG
2607.01087Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering
PDF
cs.SE, cs.AI82Case study on governable coding agents emphasizes inspectability and correction loops.agents, governance, software-engineering, coding, oversight, deployment
2607.00990SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests
PDF
cs.SE, cs.AI82Improves software agents with runtime diagnosis from bug tests; relevant to agent reliability and tooling.software-agents, runtime-diagnosis, tool-use, reliability, evaluation
2607.00394When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers
PDF
cs.DB, cs.CL82Semantic cache replacement for LLM memory buffers with strong empirical finding against standard heuristics.retrieval, agent-memory, semantic-cache, efficiency, memorybench
2607.00333(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents
PDF
cs.CR81Identifies novel attack surfaces in VLM-powered mobile agents with real deployment relevance.mobile-agents, VLM, security, attack-surface, agents

AI Paper Insight Brief

2026-07-03

0) Executive takeaways (read this first)

  • Agent security is shifting from prompt-only threats to workflow and infrastructure threats: today’s strongest papers show practical attacks on mobile agents, function-calling systems, and agentic RAG by exploiting screenshots, tool traces, validation loops, and public reasoning signals rather than just user prompts.
  • Several papers argue that current evaluation proxies are misleading: perplexity/NLL for test-time training, CLIP/FID for T2I safety, aggregate pass/fail for pragmatic safety, and benchmark leaderboards for coding/perf agents can all overstate real capability or safety.
  • A recurring design pattern is runtime governance over static alignment: gear-based action gating, object-level context garbage collection, task-state wrappers, budgeted DB sessions, and uncertainty propagation all add control at execution time rather than trusting the base model.
  • Memory is emerging as a major reliability/safety fault line: papers show failures in semantic cache replacement, deployment-memory claims, memory-induced sycophancy, and deletion-based unlearning audits, suggesting that “memory” needs much more explicit structure and auditing.
  • Mechanistic and low-dimensional views are proving useful: authority-induced sycophancy localizes to late-layer representation erasure; harmfulness/refusal can be coupled in a small subspace; RL gains concentrate in middle transformer layers; hidden biases can be amplified via tiny prefix adapters.
  • For practitioners, the immediate implication is to instrument agents like distributed systems: secure channels, provenance checks, runtime gates, explicit state objects, calibrated uncertainty, and benchmark audits now look more actionable than another round of generic prompt hardening.

2) Key themes (clusters)

Theme: Agent attack surfaces are moving below the prompt

Theme: Runtime governance is becoming the practical safety layer

Theme: Evaluation proxies are breaking under deployment claims

Theme: Memory is now a systems problem, not just a retrieval feature

Theme: Mechanistic and low-dimensional interventions are paying off

Theme: Open-world and long-horizon agents need explicit structure

3) Technical synthesis

  • A strong cross-paper pattern is moving from token-level to trajectory-level evaluation: ReShift targets CoT trajectories, KidnapRAG measures reasoning-path divergence, MemSyco-Bench audits post-retrieval decisions, and adversarial pragmatics uses minimal-pair contrasts instead of aggregate refusal labels.
  • Several papers expose proxy/behavior gaps: lower NLL without recall in TTT memory, stable CLIP/FID with degraded TIFA in T2I safety, benchmark scores unstable under replay/scoring changes in coding optimization, and local judge agreement varying sharply by label family in pragmatic safety evals.
  • Runtime wrappers beat monolithic retraining in many settings: TSR for GUI agents, Self-GC for context, SessionBound for DB access, and EntropyRuntime for CPS all leave the base model mostly intact while constraining execution.
  • Security work increasingly assumes black-box or low-privilege attackers rather than white-box omniscience: KidnapRAG only publishes documents, SMT only uses public function-calling APIs, and the mobile-agent attack uses a non-root malicious app.
  • Multiple papers rely on structured intermediate artifacts as the control point: JSON task state, typed workflow DAGs, diagnosis records, signed task tokens, indexed context objects, and repository context bundles.
  • There is a notable rise in causal decomposition methods: deletion audits split parametric leakage vs retrieval-mediated correctness; sycophancy work separates suppression from erasure; benchmark audits separate scoring artifacts from true task difficulty.
  • Low-dimensional adaptation appears repeatedly: HARC couples a small harmfulness/refusal subspace, D2D uses tiny prefix cartridges, and single-layer RL often matches full-parameter training.
  • Several methods use formal guarantees with explicit assumptions rather than informal safety claims: EntropyRuntime’s theorems, SOLAR’s competitive ratio/regret bounds, ReShift’s entropy/KL theorem, and SEA’s anytime-valid gating framework.
  • Across agent papers, exact evidence preservation is a recurring requirement: Self-GC preserves recoverable anchors, SWE-Doctor uses runtime-grounded traces, Antaeus adds local and repository-level code evidence, and mobile-agent attacks exploit when such evidence channels are unauthenticated.
  • A practical systems lesson is that memory, retrieval, and context are now first-class safety surfaces: cache replacement, retrieval poisoning, memory-induced sycophancy, forgetting audits, and context GC all point to the same operational bottleneck.

4) Top 5 papers (with “why now”)

  • (A)I Sees What You Don’t: Exploiting New Attack Surfaces in Third-Party Mobile Agents
    • Shows seven concrete attacks against five open-source mobile-agent frameworks, with all agents vulnerable to at least six of seven attacks.
    • Demonstrates that screenshot perception and repurposed control/debug channels are enough for credential theft, workflow hijack, and host-side RCE.
    • Especially useful because the attacker only needs a low-privilege Android app, making the threat model operationally realistic.
    • Skeptical about: evaluation is on third-party Android agents using ADB/Accessibility; first-party and iOS systems may differ.
  • Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
    • Gives a clean S/B/D evidence ladder that separates stream adaptation from true deployment-time memory claims.
    • The diagnostic result is sharp: one-step LoRA lowers support/answer NLL yet yields 0% generated recall across tested Qwen3 sizes.
    • Useful now because “memory” claims are proliferating in product and research narratives without matched behavioral evidence.
    • Skeptical about: the controlled experiment centers on one-step LoRA and one model family, so it is a calibration paper more than a universal negative result.
  • Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use
    • Provides one of the clearest controlled taxonomies of open-world tool-use failure: perception, interaction, reasoning, internalization.
    • Distinguishes SFT and RL failure modes rather than just reporting aggregate degradation, then proposes PAFT as a practical fix.
    • Useful now because many tool-using agents are moving from benchmark sandboxes to changing APIs and schemas.
    • Skeptical about: most evidence comes from a POI-focused sandbox with one backbone and one RL setup.
  • HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
    • Connects mechanistic interpretability to practical safety tuning by coupling harmfulness and refusal directions at prompt and response positions.
    • Reports strong robustness-capability-usability tradeoffs and multi-model scaling, with 4.67×–4.75× ASR reductions versus base models.
    • Useful because it offers a targeted alternative to broad safety fine-tuning that often causes over-refusal.
    • Skeptical about: the defense can be undone by adversarial fine-tuning with weight access, and it depends on the base model already encoding harmfulness signals.
  • Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
    • Shows that RL gains are highly non-uniform across depth, with middle layers often recovering most or more than full-parameter RL gains.
    • Turns that insight into practical layer-aware strategies that outperform uniform RL and into ensembles with complementary strengths.
    • Useful now because RL post-training is expensive and noisy; this suggests a simpler optimization target with interpretability benefits.
    • Skeptical about: guided strategies are validated mainly on math in the main results, and some larger-model scans are partial.

5) Practical next steps

  • Audit every agent pipeline for non-prompt trust boundaries: screenshot acquisition, tool schemas, validation messages, retrieval traces, broadcast channels, and host-shell construction.
  • Add runtime enforcement layers before execution: scoped permissions, signed task/session tokens, utility or confidence gates, and explicit refusal/abstention paths for unsolved states.
  • Replace proxy-heavy evals with behavioral tests matched to the claim: no-context recall for memory, structured utility for T2I, minimal-pair pragmatic tests for prompt-injection resistance, and cross-machine replay for performance benchmarks.
  • Treat memory as a governed subsystem: measure post-retrieval misuse, interference, stale-memory effects, and deletion closure; do not rely on hit rate or NLL alone.
  • For long-horizon agents, externalize state into structured objects rather than raw transcript growth: task-state summaries, workflow DAGs, diagnosis records, or indexed context objects with recoverable anchors.
  • Add provenance and anomaly checks to retrieval/tooling: source credibility, chain-consistency checks, signed tool outputs, and retrieval-path divergence monitors.
  • Explore low-dimensional safety interventions first when fine-tuning: targeted LoRA/subspace coupling, layer-selective RL, or adapter-based audits before full-model retraining.
  • Build eval suites that separate capability failure from governance failure: retrieval succeeded but decision failed, model knew the fact but chose the shortcut, benchmark score changed because of aggregation not capability.

Generated from per-paper analyses; no external browsing.