July 3, 2026 Research Brief
Agent safety moves runtime.
Today’s papers shift AI safety from prompt hardening to runtime control and behavioral audits, as realistic agent attacks and broken evaluation proxies expose weaknesses in deployed workflows.
Takeaways
- Agent security is shifting from prompt-only threats to **workflow and infrastructure threats**: today’s strongest papers show practical attacks on mobile agents, function-calling systems, and agentic RAG by exploiting screenshots, tool traces, validation loops, and public reasoning signals rather than just user prompts.
- Several papers argue that **current evaluation proxies are misleading**: perplexity/NLL for test-time training, CLIP/FID for T2I safety, aggregate pass/fail for pragmatic safety, and benchmark leaderboards for coding/perf agents can all overstate real capability or safety.
- A recurring design pattern is **runtime governance over static alignment**: gear-based action gating, object-level context garbage collection, task-state wrappers, budgeted DB sessions, and uncertainty propagation all add control at execution time rather than trusting the base model.
Start with: Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
Why it catches my eye: It gives a reusable evaluation ladder for testing deployment-memory claims and shows proxy gains can completely miss behavioral recall.
Read skeptically for: The core negative result is centered on one-step LoRA and Qwen3, so generality across memory methods remains open.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
#1Useful if you make or evaluate memory claims: it separates adaptation from actual recall with a concrete behavioral framework.
- Why now
- Memory features and test-time training claims are spreading faster than matched deployment evidence.
- Skepticism
- Its strongest demonstration uses one-step LoRA on one model family, so it is more calibration than final verdict.
(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents
#2A strong companion read because it shows where deployed agents actually break: screenshots, control channels, and host-side execution.
- Why now
- Mobile and desktop agents are moving into real workflows while many teams still defend mainly at the prompt layer.
- Skepticism
- Results are on third-party Android agent stacks, so transfer to first-party or iOS systems is not guaranteed.
Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use
#3Worth opening for its clear taxonomy of open-world tool-use failures and its distinction between SFT and RL weaknesses.
- Why now
- Tool-using agents are leaving static benchmarks for changing APIs, schemas, and environments.
- Skepticism
- Most evidence comes from a controlled sandbox with one backbone and one RL setup.
Chinese version: [中文]
Run stats
- Candidates: 250
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-07-01T00:00:00Z → 2026-07-02T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2607.00481 | Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces | cs.CR, cs.AI | 95 | Black-box jailbreak on function-calling LLMs exposes a key agent security flaw beyond prompts. | jailbreak, function-calling, agent-security, prompt-injection, black-box |
2607.01208 | Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation | cs.CL, cs.AI, cs.LG | 95 | Targets stealth LLM bias detection under supply-chain threat; strong safety relevance and concrete method. | llm-safety, bias-detection, supply-chain, distillation, auditing |
2607.01153 | Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity | cs.CL, cs.AI, cs.SE | 93 | Benchmark targets ambiguity, embedded commands, and instruction conflict in safety evaluation. | safety-eval, benchmark, instruction-following, embedded-commands, agents |
2607.00422 | KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems | cs.CR | 92 | Black-box poisoning attack on agentic RAG is highly relevant to deployed retrieval agents. | RAG, poisoning, agent-security, black-box, adversarial |
2607.00402 | The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models | cs.CV, cs.AI, cs.LG | 92 | Shows safety-image alignment can hide major semantic utility loss under coarse metrics. | safety, diffusion, evaluation, multimodal, utility, benchmark |
2607.00415 | A Mechanistic View of Authority Hierarchy in LLM Sycophancy | cs.CL, cs.LG | 92 | Mechanistic study of authority-driven sycophancy; directly relevant to LLM reliability and alignment. | sycophancy, mechanistic-interpretability, reliability, alignment, medical-qa |
2607.01071 | MemSyco-Bench: Benchmarking Sycophancy in Agent Memory | cs.IR, cs.AI | 91 | Benchmark targets memory-induced sycophancy in agents, a concrete and underexplored safety risk. | agent-safety, benchmark, memory, sycophancy, evaluation |
2607.00572 | HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment | cs.AI, cs.CR | 90 | Mechanistic safety work on harmfulness/refusal directions could inform robust anti-jailbreak alignment. | alignment, interpretability, jailbreaks, refusal, mechanistic |
2607.00692 | Self-GC: Self-Governing Context for Long-Horizon LLM Agents | cs.AI | 90 | Structured context governance for long-horizon agents addresses memory, evidence retention, and control. | agents, long-context, memory, context-management, tool-use |
2607.00871 | Self-Evolving Agents with Anytime-Valid Certificates | cs.AI, cs.CL | 89 | Auditable certificates for self-evolving agents address a core safety gap in self-modifying systems. | agents, safety, verification, self-modification, auditing |
2607.00334 | Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems | cs.AI | 89 | Runtime governance framework for agent autonomy with formal safety/stability claims. | agents, safety, governance, multi-agent, runtime-control, formal-methods |
2607.00972 | Bayesian Uncertainty Propagation for Agentic RAG Pipelines: A Proof-of-Concept Study on Multi-Hop Question Answering | cs.AI | 88 | Uncertainty propagation for agentic RAG directly supports reliability, monitoring, and failure diagnosis. | agentic-rag, uncertainty, reliability, bayesian, evaluation |
2607.00502 | A Task-State Representation for Long-Horizon Mobile GUI Agents | cs.CL | 88 | Training-free task-state wrapper for long-horizon GUI agents; practical reliability gain for agent execution. | agents, gui-agents, task-state, long-horizon, reliability |
2607.00751 | SessionBound: Turning Enterprise Task Approval into Budgeted Database Sessions | cs.DB, cs.CR | 87 | Practical permissioning framework for enterprise agents with bounded, auditable DB sessions. | agent-security, authorization, databases, auditing, enterprise |
2607.00368 | Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training | cs.CL | 87 | Argues TTT memory claims need behavioral evidence, not proxy metrics alone. | llm, test-time-training, evaluation, memory, reliability, behavior |
2607.00447 | Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors | cs.CL | 87 | Studies hallucination as inference misalignment and introduces a controlled benchmark for testing it. | hallucination, reasoning, reliability, benchmark, inference |
2607.01084 | Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use | cs.AI | 86 | Open-world tool-use benchmark reveals fragility of static agent training under realistic shifts. | agents, tool-use, generalization, benchmark, robustness |
2607.01211 | Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents? | cs.SE, cs.AI | 86 | Audits coding-agent benchmarks and exposes reliability issues in reported progress. | agents, coding, benchmark, evaluation, reliability, software-engineering |
2607.01181 | Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations | cs.LG, cs.AI, cs.CL | 86 | Combines verifiable rewards with human demos to reduce reward hacking and unnatural RLVR behavior. | rlvr, alignment, reward-hacking, post-training, human-feedback |
2607.00724 | MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark | cs.CL | 86 | New multilingual cultural QA benchmark exposing limits of apparent alignment beyond language fluency. | evaluation, benchmark, multilingual, cultural-alignment, llms |
2607.00361 | ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models | cs.CR | 85 | Backdoor attack on VLM reasoning traces is highly relevant to model security. | security, backdoor, vlm, reasoning, adversarial, safety |
2607.00605 | Auditing Forgetting in Limited Memory Language Models | cs.CL, cs.AI, cs.LG | 84 | Causal auditing of forgetting in memory-augmented LMs is useful for unlearning and leakage analysis. | unlearning, memory, auditing, privacy, reliability |
2607.01138 | Antaeus: Hunting Repository-Level Logic Vulnerabilities via Context-Grounded LLM Reasoning | cs.CR | 84 | Repository-grounded LLM reasoning for logic vuln detection targets real agent limits. | llm, security, code, vulnerability-detection, agents, repository-context |
2607.01232 | Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training | cs.LG, cs.CL | 84 | Layer-wise RL post-training result could materially change efficient alignment and adaptation practice. | llm-training, rl-post-training, efficiency, alignment, transformers |
2607.00597 | Multi-Turn Agentic Scientific Literature Search via Workflow Induction | cs.CL, cs.IR | 84 | Agentic literature search via explicit workflows improves inspectability and controllability of search agents. | agents, scientific-search, workflow-induction, inspectability, information-retrieval |
2607.00895 | Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents | cs.CL | 82 | Span-level hallucination benchmark extends grounding checks to code, tools, and structured evidence. | hallucination, grounding, benchmark, code-agents, RAG |
2607.01087 | Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering | cs.SE, cs.AI | 82 | Case study on governable coding agents emphasizes inspectability and correction loops. | agents, governance, software-engineering, coding, oversight, deployment |
2607.00990 | SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests | cs.SE, cs.AI | 82 | Improves software agents with runtime diagnosis from bug tests; relevant to agent reliability and tooling. | software-agents, runtime-diagnosis, tool-use, reliability, evaluation |
2607.00394 | When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers | cs.DB, cs.CL | 82 | Semantic cache replacement for LLM memory buffers with strong empirical finding against standard heuristics. | retrieval, agent-memory, semantic-cache, efficiency, memorybench |
2607.00333 | (A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents | cs.CR | 81 | Identifies novel attack surfaces in VLM-powered mobile agents with real deployment relevance. | mobile-agents, VLM, security, attack-surface, agents |
AI Paper Insight Brief
2026-07-03
0) Executive takeaways (read this first)
- Agent security is shifting from prompt-only threats to workflow and infrastructure threats: today’s strongest papers show practical attacks on mobile agents, function-calling systems, and agentic RAG by exploiting screenshots, tool traces, validation loops, and public reasoning signals rather than just user prompts.
- Several papers argue that current evaluation proxies are misleading: perplexity/NLL for test-time training, CLIP/FID for T2I safety, aggregate pass/fail for pragmatic safety, and benchmark leaderboards for coding/perf agents can all overstate real capability or safety.
- A recurring design pattern is runtime governance over static alignment: gear-based action gating, object-level context garbage collection, task-state wrappers, budgeted DB sessions, and uncertainty propagation all add control at execution time rather than trusting the base model.
- Memory is emerging as a major reliability/safety fault line: papers show failures in semantic cache replacement, deployment-memory claims, memory-induced sycophancy, and deletion-based unlearning audits, suggesting that “memory” needs much more explicit structure and auditing.
- Mechanistic and low-dimensional views are proving useful: authority-induced sycophancy localizes to late-layer representation erasure; harmfulness/refusal can be coupled in a small subspace; RL gains concentrate in middle transformer layers; hidden biases can be amplified via tiny prefix adapters.
- For practitioners, the immediate implication is to instrument agents like distributed systems: secure channels, provenance checks, runtime gates, explicit state objects, calibrated uncertainty, and benchmark audits now look more actionable than another round of generic prompt hardening.
2) Key themes (clusters)
Theme: Agent attack surfaces are moving below the prompt
- Why it matters: The most practical attacks in this batch do not rely on clever wording alone. They exploit the execution substrate around agents—screenshots, tool schemas, public traces, retrieval chains, and host-side channels—where many deployed systems still assume trusted context.
- Representative papers:
- (A)I Sees What You Don’t: Exploiting New Attack Surfaces in Third-Party Mobile Agents
- Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces
- KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems
- ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models
- Common approach:
- Exploit trusted-but-unverified intermediate artifacts: screenshots, function-call traces, retrieved documents, CoT trajectories.
- Target multi-step control flow rather than one-shot outputs.
- Use black-box or low-privilege attacker models that fit realistic deployment assumptions.
- Measure success with end-to-end task hijack metrics, not just token-level jailbreak rates.
- Open questions / failure modes:
- How much do these attacks weaken when systems hide or sanitize intermediate traces?
- Can provenance-aware retrieval and tool-output signing close the gap without crippling usability?
- Many defenses remain prompt-level while attacks operate at channel/workflow level.
- Some attacks depend on specific architectures or attacker footholds, so prevalence in proprietary systems remains unclear.
Theme: Runtime governance is becoming the practical safety layer
- Why it matters: Multiple papers converge on the idea that static alignment is insufficient once agents act over long horizons or in physical/data systems. Safety is increasingly being implemented as runtime control over authority, context, and execution budgets.
- Representative papers:
- Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems
- Self-GC: Self-Governing Context for Long-Horizon LLM Agents
- SessionBound: Turning Enterprise Task Approval into Budgeted Database Sessions
- Bayesian Uncertainty Propagation for Agentic RAG Pipelines: A Proof-of-Concept Study on Multi-Hop Question Answering
- Common approach:
- Insert a control layer between reasoning and execution.
- Convert free-form context into structured objects, budgets, or state machines.
- Use explicit gating signals: utility thresholds, token savings, signed task scopes, BN confidence.
- Favor auditable receipts/certificates over implicit trust in model behavior.
- Open questions / failure modes:
- Many guarantees depend on strong assumptions: stationarity, OU fault models, deterministic BN structure, single-DB scope.
- Runtime controls can add latency/cost and may degrade short/simple tasks.
- Closed-loop governance still needs robust signals; weak uncertainty or utility estimates can misgate.
- Scalability beyond small teams or narrow infrastructures is often unproven.
Theme: Evaluation proxies are breaking under deployment claims
- Why it matters: Several papers show that standard metrics can support claims they do not actually justify. This is especially important for memory, safety alignment, and benchmark-driven progress reporting.
- Representative papers:
- Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
- The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models
- Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity
- Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
- Common approach:
- Compare proxy metrics against behaviorally grounded or structured evaluations.
- Audit existing literature/leaderboards rather than proposing only new models.
- Use minimal pairs, matched baselines, or cross-machine replay to isolate what is actually being measured.
- Emphasize failure categories and calibration over aggregate scores.
- Open questions / failure modes:
- Better metrics often cost more to run and are harder to standardize.
- Some proposed audits are still small-scale or judge-dependent.
- Benchmark ecosystems may resist changes that reduce headline scores or comparability.
- Structured evaluations can themselves become gamed if they ossify.
Theme: Memory is now a systems problem, not just a retrieval feature
- Why it matters: Across agents, “memory” is failing in multiple ways: caches use the wrong replacement policy, parametric updates don’t yield behavioral recall, retrieved memories induce sycophancy, and deletion doesn’t mean forgetting unless the retrieval graph is sanitized.
- Representative papers:
- Common approach:
- Separate persistent state from transient observations or retrieval artifacts.
- Evaluate post-retrieval/use behavior, not just storage or hit rate.
- Add explicit structure: state objects, posterior-guided eviction, causal deletion audits.
- Show that naive heuristics or parametric updates create interference and misuse.
- Open questions / failure modes:
- Memory controllers themselves can propagate errors or add cost.
- Retrieval-mediated artifacts remain hard to distinguish from legitimate generalization.
- Efficient memory compression may remove cues needed for correct arbitration.
- Generalization beyond current benchmarks and domains remains open.
Theme: Mechanistic and low-dimensional interventions are paying off
- Why it matters: Some of the strongest alignment/robustness results here come from identifying compact internal structure—specific layers, directions, or adapters—rather than broad full-model retraining.
- Representative papers:
- A Mechanistic View of Authority Hierarchy in LLM Sycophancy
- HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
- Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
- Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
- Common approach:
- Identify sparse or low-rank structure in residual streams or logit shifts.
- Use targeted interventions: activation addition, LoRA in selected subspaces, single-layer RL, prefix cartridges.
- Validate with causal or comparative experiments rather than only correlations.
- Preserve broad capability by confining changes to small internal regions.
- Open questions / failure modes:
- Some methods are fragile to adaptive attackers or weight-space fine-tuning.
- Low-dimensional structure may not cover obfuscated or high-rank failure modes.
- Mechanistic findings are often shown on mid-sized models and narrow tasks.
- Profiling or gray-box access may be required, limiting deployment use.
Theme: Open-world and long-horizon agents need explicit structure
- Why it matters: Several papers show that agents fail when they must adapt over long horizons, evolving tools, or complex repositories. The common fix is to externalize structure that current prompting leaves implicit.
- Representative papers:
- Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use
- Multi-Turn Agentic Scientific Literature Search via Workflow Induction
- SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests
- Antaeus: Hunting Repository-Level Logic Vulnerabilities via Context-Grounded LLM Reasoning
- Common approach:
- Replace free-form trajectories with typed workflows, diagnosis records, or grounded context bundles.
- Use controlled perturbations or repository-scale evidence to expose brittle generalization.
- Add explicit refinement loops based on runtime evidence.
- Measure execution errors, retrieval quality, or vulnerability recall rather than only final text quality.
- Open questions / failure modes:
- Structured wrappers may inherit teacher or simulator biases.
- Gains can be domain-specific and expensive in tokens or tool calls.
- Real-world non-stationarity is broader than current controlled perturbation suites.
- Broader human-in-the-loop validation is still limited.
3) Technical synthesis
- A strong cross-paper pattern is moving from token-level to trajectory-level evaluation: ReShift targets CoT trajectories, KidnapRAG measures reasoning-path divergence, MemSyco-Bench audits post-retrieval decisions, and adversarial pragmatics uses minimal-pair contrasts instead of aggregate refusal labels.
- Several papers expose proxy/behavior gaps: lower NLL without recall in TTT memory, stable CLIP/FID with degraded TIFA in T2I safety, benchmark scores unstable under replay/scoring changes in coding optimization, and local judge agreement varying sharply by label family in pragmatic safety evals.
- Runtime wrappers beat monolithic retraining in many settings: TSR for GUI agents, Self-GC for context, SessionBound for DB access, and EntropyRuntime for CPS all leave the base model mostly intact while constraining execution.
- Security work increasingly assumes black-box or low-privilege attackers rather than white-box omniscience: KidnapRAG only publishes documents, SMT only uses public function-calling APIs, and the mobile-agent attack uses a non-root malicious app.
- Multiple papers rely on structured intermediate artifacts as the control point: JSON task state, typed workflow DAGs, diagnosis records, signed task tokens, indexed context objects, and repository context bundles.
- There is a notable rise in causal decomposition methods: deletion audits split parametric leakage vs retrieval-mediated correctness; sycophancy work separates suppression from erasure; benchmark audits separate scoring artifacts from true task difficulty.
- Low-dimensional adaptation appears repeatedly: HARC couples a small harmfulness/refusal subspace, D2D uses tiny prefix cartridges, and single-layer RL often matches full-parameter training.
- Several methods use formal guarantees with explicit assumptions rather than informal safety claims: EntropyRuntime’s theorems, SOLAR’s competitive ratio/regret bounds, ReShift’s entropy/KL theorem, and SEA’s anytime-valid gating framework.
- Across agent papers, exact evidence preservation is a recurring requirement: Self-GC preserves recoverable anchors, SWE-Doctor uses runtime-grounded traces, Antaeus adds local and repository-level code evidence, and mobile-agent attacks exploit when such evidence channels are unauthenticated.
- A practical systems lesson is that memory, retrieval, and context are now first-class safety surfaces: cache replacement, retrieval poisoning, memory-induced sycophancy, forgetting audits, and context GC all point to the same operational bottleneck.
4) Top 5 papers (with “why now”)
- (A)I Sees What You Don’t: Exploiting New Attack Surfaces in Third-Party Mobile Agents
- Shows seven concrete attacks against five open-source mobile-agent frameworks, with all agents vulnerable to at least six of seven attacks.
- Demonstrates that screenshot perception and repurposed control/debug channels are enough for credential theft, workflow hijack, and host-side RCE.
- Especially useful because the attacker only needs a low-privilege Android app, making the threat model operationally realistic.
- Skeptical about: evaluation is on third-party Android agents using ADB/Accessibility; first-party and iOS systems may differ.
- Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
- Gives a clean S/B/D evidence ladder that separates stream adaptation from true deployment-time memory claims.
- The diagnostic result is sharp: one-step LoRA lowers support/answer NLL yet yields 0% generated recall across tested Qwen3 sizes.
- Useful now because “memory” claims are proliferating in product and research narratives without matched behavioral evidence.
- Skeptical about: the controlled experiment centers on one-step LoRA and one model family, so it is a calibration paper more than a universal negative result.
- Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use
- Provides one of the clearest controlled taxonomies of open-world tool-use failure: perception, interaction, reasoning, internalization.
- Distinguishes SFT and RL failure modes rather than just reporting aggregate degradation, then proposes PAFT as a practical fix.
- Useful now because many tool-using agents are moving from benchmark sandboxes to changing APIs and schemas.
- Skeptical about: most evidence comes from a POI-focused sandbox with one backbone and one RL setup.
- HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
- Connects mechanistic interpretability to practical safety tuning by coupling harmfulness and refusal directions at prompt and response positions.
- Reports strong robustness-capability-usability tradeoffs and multi-model scaling, with 4.67×–4.75× ASR reductions versus base models.
- Useful because it offers a targeted alternative to broad safety fine-tuning that often causes over-refusal.
- Skeptical about: the defense can be undone by adversarial fine-tuning with weight access, and it depends on the base model already encoding harmfulness signals.
- Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
- Shows that RL gains are highly non-uniform across depth, with middle layers often recovering most or more than full-parameter RL gains.
- Turns that insight into practical layer-aware strategies that outperform uniform RL and into ensembles with complementary strengths.
- Useful now because RL post-training is expensive and noisy; this suggests a simpler optimization target with interpretability benefits.
- Skeptical about: guided strategies are validated mainly on math in the main results, and some larger-model scans are partial.
5) Practical next steps
- Audit every agent pipeline for non-prompt trust boundaries: screenshot acquisition, tool schemas, validation messages, retrieval traces, broadcast channels, and host-shell construction.
- Add runtime enforcement layers before execution: scoped permissions, signed task/session tokens, utility or confidence gates, and explicit refusal/abstention paths for unsolved states.
- Replace proxy-heavy evals with behavioral tests matched to the claim: no-context recall for memory, structured utility for T2I, minimal-pair pragmatic tests for prompt-injection resistance, and cross-machine replay for performance benchmarks.
- Treat memory as a governed subsystem: measure post-retrieval misuse, interference, stale-memory effects, and deletion closure; do not rely on hit rate or NLL alone.
- For long-horizon agents, externalize state into structured objects rather than raw transcript growth: task-state summaries, workflow DAGs, diagnosis records, or indexed context objects with recoverable anchors.
- Add provenance and anomaly checks to retrieval/tooling: source credibility, chain-consistency checks, signed tool outputs, and retrieval-path divergence monitors.
- Explore low-dimensional safety interventions first when fine-tuning: targeted LoRA/subspace coupling, layer-selective RL, or adapter-based audits before full-model retraining.
- Build eval suites that separate capability failure from governance failure: retrieval succeeded but decision failed, model knew the fact but chose the shortcut, benchmark score changed because of aggregation not capability.
Generated from per-paper analyses; no external browsing.