Takeaways

The strongest thread today is a shift from average-case benchmark scores toward **operational guarantees and failure localization**: papers focus on wrong-action budgets, instruction-hierarchy preservation, persistent-state governance, and rubric verification under long contexts.
**Inference-time control is getting more practical and targeted**: IHDec enforces role hierarchy during multi-turn decoding, ADAPT steers multimodal cross-attention when grounding degrades, and NPM/CPE use internal activations or low-rank perturbations to recover latent skills or behaviors without full retraining.
Security work is increasingly about **system surfaces, not just model outputs**: model hubs, web agents, skill registries, prompt injection, ASCII-art moderation bypasses, and model-merging defenses all show that deployment infrastructure and composition layers are major attack surfaces.

Start with: Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds

Why it catches my eye: It turns a declared wrong-action budget into an auditable autonomy rule for multi-agent systems with human escalation.

Read skeptically for: Its guarantees rely on local bias-envelope and representation-gap assumptions that may break under harder deployment shifts.

agents reliability calibration human-in-the-loop

arXiv PDF

Themes

Inference-time control and mechanistic steering A notable share of today’s work tries to improve behavior without expensive retraining, using decoding controls, activation steering, or localized weight perturbations. This is attractive for safety because it can be deployed faster, audited more directly, and targeted to specific failure modes.

Safety evaluation is moving from outputs to operating conditions The most useful evaluations today are less about “can the model answer?” and more about “can the system act safely under budget, timing, hierarchy, and long-context constraints?” This is closer to deployment reality.

Security is shifting to ecosystem and composition attacks The attack surface is no longer just the base model. Today’s strongest security papers target model hubs, web agents, skill registries, model merging, multilingual jailbreaks, and moderation bypasses—places where composition and infrastructure create exploitable gaps.

Signal Safety is becoming operational. Act-or-defer bounds, rubric-verification benchmarks, child-safety audits, and emotional-support evaluation all measure safe action under constraints, not just answer quality.

Tension Internal control beats internal monitoring. IHDec, ADAPT, and activation steering show targeted inference-time gains, while pre-action probes report negative results for reliably detecting misaligned actions.

Bet System surfaces will dominate failures. Model hubs, web agents, prompt injection, model merging, and skill supply chains all expose attack paths beyond the base model output.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds

Useful if you need deployable autonomy thresholds with explicit human escalation and measurable wrong-action budgets.

Why now: Agent deployments increasingly need auditable abstention policies rather than higher average accuracy alone.
Skepticism: The reliability bounds are conditional on assumptions that may be hard to validate in messy real settings.

arXiv PDF

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

A strong companion paper because it tests whether the judges used to score agent trajectories are trustworthy enough for deployment.

Why now: LLM judges are now embedded in rewards, filtering, and safety audits for long-horizon agents.
Skepticism: It covers only two domains and binary rubric labels, so generality is still limited.

arXiv PDF

IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

Worth opening for a practical inference-time defense against multi-turn role-conflict and prompt-injection failures.

Why now: Instruction-hierarchy failures are becoming central as agents operate over longer, more adversarial conversations.
Skepticism: It needs multiple counterfactual forward passes and logit access, which limits cheap or API-only deployment.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 1416
Selected: 30
Deepread completed: 30
Window (UTC): 2026-07-03T00:00:00Z → 2026-07-04T00:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2606.29685`	CAREBench: A Child-Safety Risk Benchmark for Language Models PDF	cs.LG	95	Child-safety benchmark for upstream LM risks; highly relevant safety eval with concrete categories.	safety, benchmark, evaluation, child-safety, risk-assessment
`2606.30449`	Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring PDF	cs.LG	94	Important negative result on internal-monitoring for agent misalignment; directly safety-relevant.	ai-safety, monitoring, interpretability, agents, negative-results
`2606.29920`	Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios? PDF	cs.CL	94	Benchmarking LLM-judge reliability for agentic rubric verification is highly relevant to safe evals.	evaluation, llm-as-judge, agents, benchmark, reliability
`2606.30899`	Curvature-Guided Module Localization for Low-Rank Detoxification of Backdoored Large Language Models PDF	cs.CR, cs.AI	93	Targets LLM backdoors with mechanistic localization and low-rank repair; strong security relevance.	llm-security, backdoor, detoxification, mechanistic-interpretability, model-repair
`2607.02329`	Grounded autonomous research: a fault-tolerant LLM pipeline from corpus to manuscript in frontier computational physics PDF	cs.AI, cond-mat.mtrl-sci, physics.comp-ph	93	Grounded autonomous research pipeline tackles hallucination/calibration in agentic science workflows.	agents, llm, grounding, hallucination, scientific-ai, evaluation
`2606.29602`	An Empirical Evaluation of Prompt Injection Vulnerabilities in Large Language Models Across Multilingual and Obfuscated Attack Scenarios PDF	cs.CR	92	Broad empirical study of prompt injection across models, languages, and obfuscation scenarios.	LLM-security, prompt-injection, multilingual, adversarial-evaluation, benchmarking
`2606.29654`	Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds PDF	cs.AI, cs.MA	92	Act-or-defer reliability bounds for multi-agent LLM deliberation with human escalation.	agents, reliability, calibration, human-in-the-loop, multi-agent
`2606.30306`	Always-OnAgents:A Survey of Persistent Memory, State, and Governance in LLMAgents PDF	cs.MA, cs.AI	92	Comprehensive survey of persistent-state LLM agents with governance, audit, rollback, and authority axes.	agents, memory, governance, survey, safety
`2606.29649`	Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages PDF	cs.CL	92	Directly probes VLM moderation failure on ASCII-art jailbreaks; strong safety relevance.	VLM, jailbreak, content-moderation, robustness, evaluation
`2606.29171`	Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies PDF	cs.LG, cs.AI, cs.CL	92	Mechanistic data attribution for refusal behavior; strong alignment interpretability angle.	alignment, interpretability, data-attribution, refusal, SAE, LLM
`2606.30119`	On the Internet, Nobody Knows You're an LLM Bot: Unmasking Web Agents with Multi-Layer Fingerprinting PDF	cs.CR	92	Directly targets detection of LLM web agents; strong agent security relevance.	agent-safety, web-agents, bot-detection, security, fingerprinting
`2607.01595`	Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model PDF	cs.AI, cs.CL	91	Verifies LLM-generated recovery plans with neuro-symbolic world model; strong agent safety angle.	agent-safety, verification, neuro-symbolic, planning, reliability, cloud
`2606.30256`	EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots PDF	cs.AI, cs.CY	91	Multilingual multi-turn safety benchmark for emotional-support chatbots with auditor-judge setup.	safety, benchmark, chatbots, multilingual, evaluation
`2607.00700`	LLVM-Bench: Benchmarking and Advancing Large Language Models for LLVM Compiler Issue Resolution PDF	cs.SE, cs.AI, cs.PL	91	Strong LLM agent benchmark/framework for real LLVM issue resolution with validated tasks and eval gym.	llm, agents, benchmark, code, software-engineering, evaluation
`2606.29315`	Hierarchical Experimentalist Agents PDF	cs.AI, cs.LG	91	Agent learns via active experimentation and reusable skills; strong agentic capability relevance.	agents, active-learning, self-improvement, long-horizon, experimentation
`2606.30573`	SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions PDF	cs.LG	91	Interactive long-horizon coding-agent benchmark with evolving requirements; highly reusable eval.	agents, evaluation, coding-agents, benchmark, interactive, SWE
`2607.01136`	Skills Are Not Islands: Measuring Dependency and Risk in Agent Skill Supply Chains PDF	cs.SE, cs.AI	90	Introduces agent skill supply-chain risk framing plus dependency analysis benchmark/tooling.	agents, supply-chain-security, provenance, dependencies, benchmark
`2607.02201`	The Eticas AI Risk Taxonomy: Open Infrastructure for Operationalizing AI Audits PDF	cs.CY, cs.AI	90	Operationalizes AI audits with concrete risk testing; strong governance and evaluation relevance.	ai-auditing, risk-taxonomy, evaluation, governance, privacy
`2606.30518`	Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts PDF	cs.CL	90	Targets RAG failures under conflicting knowledge, including adversarial context, with regime-aware training.	RAG, grounding, adversarial, reliability, knowledge-conflict
`2606.30479`	COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies PDF	cs.NI, cs.AI, cs.CR, cs.MA	90	Automates network hardening via multi-agent LLMs and offensive replay on realistic emulated topologies.	agents, cybersecurity, defense, multi-agent, evaluation
`2606.31054`	ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs PDF	cs.CV, cs.AI, cs.CL, cs.MM	90	Targets MLLM hallucination via cross-attention dynamics with preference tuning; strong reliability relevance.	MLLM, hallucination, faithfulness, attention, preference-tuning
`2606.29960`	IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies PDF	cs.CL	89	Training-free defense for multi-turn instruction hierarchy failures, central to agent robustness.	LLM-safety, instruction-hierarchy, contrastive-decoding, multi-turn, robustness
`2606.30373`	Your Space is My Zone: Demystifying the Security Risks of AI-Powered Applications on Pre-Trained Model Hubs PDF	cs.CR	89	Systematic security analysis of AI-app hubs exposes real deployment attack surfaces.	security, ai-apps, model-hubs, deployment, owasp
`2607.00436`	PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents PDF	cs.AI	89	Useful benchmark for tool-augmented scientific agents; shows tool access can both help and hurt.	benchmark, agents, tool-use, evaluation, scientific-llms, reliability
`2606.29604`	Mechanistically Eliciting Latent Behaviors in Language Models PDF	cs.LG, cs.AI	89	Unsupervised method to elicit latent LLM behaviors; useful for risk discovery and interpretability.	interpretability, llms, behavior-elicitation, safety-evaluation, lora
`2606.30360`	On the Vulnerability of Parameter-Level Defenses to Model Merging PDF	cs.LG, cs.CV	89	Shows model-merging defenses can be bypassed; concrete attack on AI model protection.	security, model-merging, attack, defense-evasion, weights
`2606.29824`	Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering PDF	cs.CL, cs.AI	89	Agent memory via activation steering is novel, reusable, and directly relevant to LLM agents.	llm-agents, memory, activation-steering, reliability
`2607.01751`	MedStreamBench: A Time-Aware Benchmark for Streaming and Proactive Medical Video Understanding PDF	cs.CV, cs.AI	89	Time-aware benchmark for when medical video models should answer, defer, or proactively alert.	benchmark, evaluation, multimodal, medical-ai, streaming, reliability
`2606.29445`	Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction PDF	cs.CV, cs.AI	89	Benchmark for video-guided GUI agents; evaluates tutorial-to-action transfer in agentic settings.	agents, benchmark, multimodal, GUI-agents, evaluation
`2606.30182`	MirrorCode: AI can rebuild entire programs from behavior alone PDF	cs.AI	89	Long-horizon coding benchmark for rebuilding whole programs from behavior alone.	agents, coding, benchmark, software-engineering, evaluation, autonomy

AI Paper Insight Brief

2026-07-06

0) Executive takeaways (read this first)

The strongest thread today is a shift from average-case benchmark scores toward operational guarantees and failure localization: papers focus on wrong-action budgets, instruction-hierarchy preservation, persistent-state governance, and rubric verification under long contexts.
Inference-time control is getting more practical and targeted: IHDec enforces role hierarchy during multi-turn decoding, ADAPT steers multimodal cross-attention when grounding degrades, and NPM/CPE use internal activations or low-rank perturbations to recover latent skills or behaviors without full retraining.
Security work is increasingly about system surfaces, not just model outputs: model hubs, web agents, skill registries, prompt injection, ASCII-art moderation bypasses, and model-merging defenses all show that deployment infrastructure and composition layers are major attack surfaces.
Tool use helps, but often non-monotonically: simulator access, interactive coding, and long-horizon SWE settings improve aggregate performance while also causing regressions on previously solved items, making retention and trajectory-level diagnostics more important than headline accuracy.
Several papers argue that judge reliability is now a first-class bottleneck: rubric verification in agentic settings, emotional-support auditing, and child-safety evaluation all show that uncalibrated judges can flatten meaningful differences or miss subtle harms.
For frontier safety work, the actionable pattern is clear: build systems that can defer, audit, replay, localize, and rollback, rather than assuming a single aligned model or benchmark score is sufficient.

2) Key themes (clusters)

Theme: Inference-time control and mechanistic steering

Why it matters: A notable share of today’s work tries to improve behavior without expensive retraining, using decoding controls, activation steering, or localized weight perturbations. This is attractive for safety because it can be deployed faster, audited more directly, and targeted to specific failure modes.
Representative papers:
Common approach:
- Use internal signals as control surfaces: role-level JSD influence, cross-attention anchors, residual-stream steering vectors, or rank-1 LoRA perturbations.
- Intervene sparsely or conditionally rather than globally, e.g. only when hierarchy violations or attention drift are detected.
- Favor training-free or low-cost adaptation loops that work on frozen models or with small adapters.
- Evaluate on behaviorally meaningful tasks: jailbreaks, hallucination, sandbagging, procedural execution, and multi-turn hierarchy conflicts.
Open questions / failure modes:
- Most methods require internal access to logits, activations, or attention, limiting applicability to closed APIs.
- Several gains are scenario-specific; cross-domain generalization remains under-tested.
- Steering can move behavior for the wrong reasons, as shown by negative results on pre-action monitoring and specificity controls.
- Inference overhead is real for counterfactual decoding and anchor-building methods.

Theme: Safety evaluation is moving from outputs to operating conditions

Why it matters: The most useful evaluations today are less about “can the model answer?” and more about “can the system act safely under budget, timing, hierarchy, and long-context constraints?” This is closer to deployment reality.
Representative papers:
Common approach:
- Replace single scalar accuracy with richer metrics: wrong-action budget usage, acted-on accuracy, responsiveness, stability, rubric-level balanced accuracy, or failure-rate by risk type.
- Treat judges as instruments that need calibration, not as ground truth.
- Evaluate full trajectories or deployed-system behavior rather than isolated single-turn outputs.
- Use explicit decomposition of uncertainty sources: calibration error, representation gap, prompt sensitivity, or judge leniency.
Open questions / failure modes:
- Guarantees are often conditional on assumptions about local smoothness, state compression, or judge behavior.
- Benchmarks remain narrow in modality, language, or domain despite better methodology.
- Long-context and multi-turn settings expose judge brittleness, especially for dispersed evidence.
- Human-grounded validation is still limited in several high-stakes domains.

Theme: Security is shifting to ecosystem and composition attacks

Why it matters: The attack surface is no longer just the base model. Today’s strongest security papers target model hubs, web agents, skill registries, model merging, multilingual jailbreaks, and moderation bypasses—places where composition and infrastructure create exploitable gaps.
Representative papers:
Common approach:
- Analyze full stacks: code, containers, logs, browser fingerprints, TLS, dependency graphs, or transformed checkpoints.
- Show that hidden transitive structure matters: inherited package exposure, anchor-dominated protected weights, or cross-layer fingerprints.
- Pair large-scale measurement with concrete exploit or recovery procedures.
- Emphasize governance artifacts such as SBOM-like manifests, audit trails, and platform-side mitigations.
Open questions / failure modes:
- Many findings are snapshot-dependent because platforms, models, and defenses evolve quickly.
- Detection and mitigation often require privileged access or platform cooperation.
- Some attacks exploit fundamental geometry or ecosystem incentives, not just implementation bugs.
- Precision/recall tradeoffs remain significant in large-scale scanning pipelines.

Theme: Tool-augmented agents help, but interfaces and workflows dominate outcomes

Why it matters: Multiple papers show that giving agents tools, simulators, or interactive users can unlock large gains—but also introduces new failure modes. The bottleneck is often interface design, retrieval structure, or workflow decomposition rather than raw model capability.
Representative papers:
Common approach:
- Evaluate end-to-end loops: propose, test, inspect, revise, and commit.
- Add structured external memory or skill banks to amortize exploration across episodes.
- Measure not just solve rate but retention, gained/lost items, step budgets, token cost, and failure trajectories.
- Use realistic constraints: hidden tests, simulator APIs, user feedback, or long-horizon execution budgets.
Open questions / failure modes:
- Tool access can reduce retention on items models previously solved without tools.
- Mid-tier models often struggle more with navigation overhead than with underlying reasoning.
- Large gains can require substantial inference budgets, making evaluation expensive.
- Benchmarks still cover narrow environments relative to real-world deployment diversity.

Theme: Better diagnostics for hidden behavior, attribution, and monitoring

Why it matters: Several papers push beyond output-level evaluation to ask what internal policy, training data, or latent mode is actually driving behavior. This is useful for alignment, but today’s results also show how easy it is to overclaim from internal probes.
Representative papers:
Common approach:
- Define an intermediate object of analysis: symbolic policy over SAE features, module-level trigger pathways, or pre-action probe states.
- Use causal or quasi-causal decompositions: activation patching, finite-difference influence, Fisher/K-FAC curvature, or threshold-crossing tests.
- Stress-test whether internal signals generalize across scenarios and remain specific to the claimed behavior.
- Prefer localized interventions over full-model retraining when repairing or auditing behavior.
Open questions / failure modes:
- Internal readouts can correlate with situation cues rather than action precursors.
- Symbolic or linear surrogates leave substantial variance unexplained.
- Causal validation via retraining or removal remains rare.
- Controlled trigger or single-model settings limit external validity.

3) Technical synthesis

A recurring design pattern is conditional intervention: act only when a confidence bound, attention score, or hierarchy-violation signal crosses a threshold.
Several papers use same-scale auxiliary models or peers instead of larger teachers: HExA’s evolver, RAPS-DA’s regime specialists, and judge ensembles all avoid assuming a stronger oracle.
Counterfactual comparison is central across methods: ablated-role decoding in IHDec, clean-vs-trigger activation patching for backdoor repair, full-vs-ablated prompt influence, and no-tool vs tool-augmented retention analysis.
Many evaluations now separate aggregate gains from item-level regressions, especially in tool use and interactive coding; “gained/lost/kept” is becoming more informative than mean accuracy.
There is a strong move toward structured external artifacts: skill banks, SkillBOMs, persistent-state ledgers, visible/hidden test harnesses, and event-stream audit protocols.
Calibration is no longer just probability calibration; it includes judge calibration, local bias envelopes, severity bands for audits, and threshold selection for sparse interventions.
Multiple papers expose a geometry problem: anchor-dominant protected weights in model merging, local neighborhood bias in act-or-defer bounds, and layer-specific separability or non-separability in probes and steering.
Long-context agent evaluation increasingly relies on evidence localization rather than holistic scoring: rubric verification, keyframe search, and TOC-based simulator output access all try to reduce search burden.
Security papers repeatedly show that transitive structure dominates direct signals: transitive package exposure in skill supply chains, inherited platform risk in AI-app hubs, and cross-layer fingerprints for web agents.
A notable methodological split is emerging between papers that use internal signals for control and papers that use them for monitoring; today’s negative results suggest control may currently be easier than reliable pre-action detection.

4) Top 5 papers (with “why now”)

Your Space is My Zone: Demystifying the Security Risks of AI-Powered Applications on Pre-Trained Model Hubs
- Analyzes 972,546 public AI-apps across major model hubs, making this one of the broadest ecosystem security measurements in the batch.
- Finds both platform-design flaws and app-level issues: Ghost Token, Identifier Reuse, credential leakage, vulnerable SDKs, backdoors, and cryptojacking.
- Useful now because model hubs are becoming default deployment surfaces, and this paper shows the risk is not hypothetical but already measurable at scale.
- Skepticism / limitation: the scanner is a screening tool with nontrivial precision limits, and the study focuses mainly on public containerized apps.
Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds
- Converts a declared wrong-action budget into a deployable stopping rule for multi-agent deliberation.
- Empirically uses only ~9–12% of the declared budget on activated datasets while reaching up to 84% automation and 96% acted-on accuracy.
- Useful now because many agent deployments need auditable autonomy thresholds, not just better average accuracy.
- Skepticism / limitation: guarantees depend on local bias-envelope and representation-gap assumptions that are diagnosable but not automatically verified.
IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies
- Targets a concrete deployment failure—lower-priority turns overriding system instructions in multi-turn settings.
- Shows large gains in conflict scenarios while preserving benign utility, with reported scaling benefits on larger Qwen models.
- Useful now because prompt injection and role confusion are increasingly multi-turn and agentic, where training-only defenses lag.
- Skepticism / limitation: requires multiple counterfactual forward passes and logit access, so deployment cost and API compatibility are constraints.
Hierarchical Experimentalist Agents
- Demonstrates a training-free actor–evolver–retriever loop that turns experimental trajectories into reusable skills.
- Delivers large gains on Interphyre, including strong zero-shot cross-level transfer and better low-data adaptation than matched-budget GRPO early on.
- Useful now because it offers a practical path to sample-efficient agent improvement even for closed models.
- Skepticism / limitation: evidence is confined to a 2D physics domain, and the asymptotic ceiling versus gradient RL remains unclear.
Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?
- Introduces a 2,458-instance benchmark for rubric verification over long, agentic outputs rather than short-form judging.
- Shows frontier judges can be strong but still noisy, especially in coding trajectories with long contexts and dispersed evidence.
- Useful now because rubric verification is increasingly used for rewards, filtering, and monitoring in agent pipelines.
- Skepticism / limitation: benchmark scope is limited to two domains and binary rubric labels.

5) Practical next steps

Add retention accounting to agent evaluations: for any tool-augmented or interactive setup, track kept/gained/lost items rather than only net accuracy.
Pilot act-or-defer policies for high-risk agent actions using local confidence bounds or calibrated abstention, especially where human review is available.
Test multi-turn hierarchy defenses under real prompt-injection workloads; if logit access exists, benchmark inference-time controls like role-aware contrastive decoding.
Build judge calibration suites before relying on LLM judges for reward modeling or safety audits; include strict rubrics, cross-family judges, and long-context stress tests.
Treat persistent memory and skills as governed state, not just retrieval context: add provenance, deletion, rollback, and authority metadata to memory/skill stores.
For multimodal systems, instrument internal grounding signals such as cross-attention drift and compare sparse intervention against output-only hallucination mitigations.
Run ecosystem-level security reviews on deployment surfaces: model hubs, runtime logs, embedded apps, skill registries, and browser/TLS fingerprints for agents.
For interpretability-based safety claims, require scenario-generalization and specificity controls before promoting probes into production monitors.

Generated from per-paper analyses; no external browsing.

Safety moves to operations.

Takeaways

Start with: Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds

Themes

Papers Worth Your Reading Time

Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

AI Paper Insight Brief

2026-07-06

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Inference-time control and mechanistic steering

Theme: Safety evaluation is moving from outputs to operating conditions

Theme: Security is shifting to ecosystem and composition attacks

Theme: Tool-augmented agents help, but interfaces and workflows dominate outcomes

Theme: Better diagnostics for hidden behavior, attribution, and monitoring

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps