Takeaways

**Agent memory is shifting from static retrieval to adaptive, governed, and budgeted systems.** Multiple papers converge on step-wise retrieval, active reconstruction, write-time retention, and explicit memory governance rather than “retrieve once at episode start.”
**Safety work is moving from generic refusal to system-level control surfaces.** The strongest ideas today are not just better classifiers, but typed skill graphs, autonomy gating, consequence-aware compute routing, contradiction-safe memory writes, and two-stage memory-use safeguards.
**Benchmarks are getting closer to deployment reality.** New evaluations emphasize underspecified user intent, multi-round refinement, adaptive defense, first-person normative action generation, memory-use boundaries, and joint memory+long-document reasoning.

Start with: Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Why it catches my eye: It reframes inference routing around consequence, offering a reusable deployment method for asymmetric-risk decisions rather than another average-accuracy gain.

Read skeptically for: The evidence is mainly offline and depends on coarse consequence labels rather than live production interventions.

risk-aware test-time compute reliability deployment

arXiv PDF

Themes

Adaptive memory becomes the core agent bottleneck A large share of today’s papers argue that agent failures come less from raw model capability and more from how experience is stored, updated, and re-used over long horizons. The common move is away from static top-k retrieval toward adaptive, state-aware, or budget-aware memory operations.

Governance and control planes for agent autonomy A second cluster focuses on making agent behavior governable at runtime: who authorized what, when autonomy should increase, and how to recover when quality drifts. This is especially relevant for enterprise and high-stakes deployments.

Realistic benchmarks are replacing toy one-shot evaluations Several papers argue that current benchmarks miss the actual failure modes of deployed agents: underspecified requests, iterative repair, adaptive defenders, long documents plus memory, and privacy-sensitive personalization.

Signal Memory is becoming a control plane. AdaMEM, Graph Memory, EMBER, TOKI, and memory-boundary evaluation all separate what gets stored, exposed, and used.

Tension More structure helps, but adds fragility. Governed memory, skill graphs, and adaptive workflows improve control, yet latency, judge dependence, and maintenance remain recurring limits.

Bet Deployment wins will come from routing. Consequence-aware compute, state-grounded retrieval, replay reuse, and prompt optimization suggest smarter allocation may beat brute-force scaling.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Useful if you deploy tiered models: it shows consequence-aware routing can outperform difficulty-based compute allocation.

Why now: Inference budgets increasingly matter in products where some mistakes are far costlier than others.
Skepticism: The setup is offline and may not capture live routing behavior or richer consequence definitions.

arXiv PDF

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

A strong companion paper because it argues long-horizon agents need active memory reconstruction, not static top-k retrieval.

Why now: Persistent assistants are hitting the limits of simple RAG-style memory under long tasks and token constraints.
Skepticism: Sequential reconstruction may trade better recall for higher latency and harder memory maintenance.

arXiv PDF

When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents

It gives a deployment-relevant evaluation lens for when agents should avoid using sensitive or unnecessary memory.

Why now: Memory-augmented assistants are moving into privacy-sensitive settings without clear boundaries for acceptable recall.
Skepticism: Boundary judgments can be task- and norm-dependent, limiting direct transfer across products.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 634
Selected: 30
Deepread completed: 30
Window (UTC): 2026-06-05T00:00:00Z → 2026-06-06T00:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2606.05958`	Steering Vectors are an Adversarial Attack Surface PDF	cs.LG	95	Poisoned steering vectors jailbreak models while preserving benign behavior; strong new LLM attack surface.	llm-safety, jailbreak, activation-steering, data-poisoning, adversarial-ml
`2606.06055`	When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents PDF	cs.AI	95	Evaluates when agents should avoid using sensitive memory; strong privacy/safety relevance.	agent-safety, memory, privacy, evaluation, conversational-agents
`2606.05567`	ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense PDF	cs.CR, cs.MA	93	Closed-loop attacker-defender benchmark for LLM pentesting adds realism, consistency, and auditability.	agent-security, red-teaming, penetration-testing, evaluation, llm-agents
`2606.06244`	Steering LLM Viewpoints through Fabricated Evidence Injection PDF	cs.CR	93	Fabricated evidence injection exploits LLM trust in context; directly relevant to RAG and persuasion safety.	llm-safety, rag, context-poisoning, misinformation, adversarial-evaluation
`2606.04402`	Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation PDF	cs.AI	92	Allocates reasoning compute by consequence, not just difficulty; strong deployment-safety framing.	reasoning, test-time-compute, risk-aware, reliability, deployment
`2606.05566`	GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection PDF	cs.AI, cs.CR	91	Directly targets prompt injection and jailbreak detection with efficient guardrail ensemble design.	prompt-injection, jailbreaks, guardrails, llm-security, detection
`2606.05609`	SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks PDF	cs.CR, cs.AI, cs.LG	91	Finds positional jailbreak vulnerability and proposes slot-based attack scoring; useful for red-teaming defenses.	llm-safety, jailbreak, prompt-injection, adversarial-attacks, evaluation
`2606.04321`	The Digital Apprentice: A Framework for Human-Directed Agentic AI Development PDF	cs.AI	91	Human-directed autonomy tiers for safer agent deployment; strong governance framing for agentic AI.	agents, safety, governance, human-in-the-loop, alignment
`2606.05684`	AdaMEM: Test-Time Adaptive Memory for Language Agents PDF	cs.AI	91	Adaptive memory for language agents at test time; strong agent capability relevance.	agents, memory, test-time adaptation, long-horizon, llm
`2606.04465`	SePO: Self-Evolving Prompt Agent for System Prompt Optimization PDF	cs.CL, cs.AI	91	Self-optimizing system prompts for agents; directly relevant to agent behavior and controllability.	agents, prompt-optimization, system-prompts, self-improvement, llm
`2606.05922`	Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts PDF	cs.AI, cs.CL, cs.LG	91	Self-supervised agent harness optimization from past trajectories; strong agent improvement relevance.	agents, self-improvement, trajectory-optimization, post-training, evaluation
`2606.04781`	AIP: A Graph Representation for Learning and Governing Agent Skills PDF	cs.AI, cs.LG	90	Structured skill graphs for agents target reliability and governance of agent behavior.	agents, agent-skills, governance, reliability, framework
`2606.05646`	Enhancing Software Engineering Through Closed-Loop Memory Optimization PDF	cs.SE, cs.AI	90	Closed-loop memory eval for SE agents with validated downstream impact; strong agent capability relevance.	llm-agents, memory, software-engineering, evaluation
`2606.04391`	Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval PDF	cs.AI	90	State-grounded skill retrieval for web agents targets realistic long-horizon agent behavior.	agents, web-agents, skill-learning, retrieval, automation
`2606.04806`	NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning PDF	cs.CV, cs.AI	89	Benchmark for grounded normative action reasoning in first-person settings; strong agent safety relevance.	agent-safety, benchmark, normative-reasoning, multimodal, evaluation
`2606.06388`	Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration PDF	cs.AI, cs.CL	89	Action-level mental-model dataset for human-agent collaboration; valuable supervision for safer collaborative agents.	agents, dataset, human-ai-collaboration, mental-models, evaluation
`2606.06462`	Benchmark Everything Everywhere All at Once PDF	cs.AI	89	Autonomous benchmark-building agent; high reuse value for LLM/VLM evaluation.	benchmarking, agents, evaluation, llm, multimodal
`2606.04560`	Rollout-Level Advantage-Prioritized Experience Replay for GRPO PDF	cs.LG, cs.AI	89	Improves GRPO sample efficiency for reasoning LLM post-training with concrete replay design.	llm, reasoning, post-training, grpo, rl
`2606.06036`	Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents PDF	cs.AI, cs.IR	89	Graph memory with active reconstruction for LLM agents; promising for long-horizon reasoning.	agents, memory, reasoning, graph-memory, long-context
`2606.05894`	EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon Agents PDF	cs.CL	89	Long-horizon agent memory retention under token budgets; practical and reusable for agent systems.	agents, memory, long-context, retrieval, efficiency
`2606.05670`	Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows PDF	cs.AI	88	Careful protocol-aligned study questions whether multi-agent workflows actually help over single agents.	agents, evaluation, multi-agent, tool-use, benchmarking
`2606.05920`	Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement PDF	cs.SE, cs.CL	88	Code-agent benchmark for underspecified intent and multi-round refinement; realistic eval.	code agents, benchmark, evaluation, interactive, web
`2606.05952`	Learning of Robot Safety Policies via Adversarial Synthetic Scenarios PDF	cs.RO, cs.AI	88	Adversarial red-team/blue-team synthetic scenarios for robot safety policy learning; clear safety focus.	robot-safety, red-teaming, adversarial-training, physical-ai
`2606.06087`	LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents PDF	cs.CL, cs.AI	88	Moves agent skills from prompt text to latent adapters, improving efficiency and modularity.	agents, skills, efficiency, LoRA, modularity
`2606.04703`	Rethinking Continual Experience Internalization for Self-Evolving LLM Agents PDF	cs.CL, cs.LG	87	Studies continual learning failure modes in self-evolving LLM agents and proposes more durable internalization.	llm-agents, continual-learning, reliability, self-improvement, agent-memory
`2606.05799`	CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction PDF	cs.LG, cs.CL	87	Calibrates LLM confidence via robustness to distractors; directly targets reliability under misleading context.	llm, calibration, robustness, reliability, uncertainty
`2606.04780`	PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents PDF	cs.CL	87	Structured long-term memory for person understanding in LLM agents with explicit evidence paths.	agents, memory, person-modeling, long-context, reliability
`2606.06240`	TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory PDF	cs.DB, cs.AI	87	Formalizes contradiction resolution in LLM-agent memory with isolation/provenance guarantees.	agents, memory, formal methods, reliability, provenance
`2606.04442`	MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning PDF	cs.CL, cs.AI	87	Benchmark for joint conversational memory and long-document reasoning; useful for agent evaluation.	benchmark, long-context, memory, reasoning, evaluation
`2606.06058`	MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following PDF	cs.LG, cs.AI, cs.CL	87	Stabilizes GRPO for multi-constraint instruction following; relevant post-training advance.	RLHF, GRPO, instruction-following, post-training, alignment

AI Paper Insight Brief

2026-06-08

0) Executive takeaways (read this first)

Agent memory is shifting from static retrieval to adaptive, governed, and budgeted systems. Multiple papers converge on step-wise retrieval, active reconstruction, write-time retention, and explicit memory governance rather than “retrieve once at episode start.”
Safety work is moving from generic refusal to system-level control surfaces. The strongest ideas today are not just better classifiers, but typed skill graphs, autonomy gating, consequence-aware compute routing, contradiction-safe memory writes, and two-stage memory-use safeguards.
Benchmarks are getting closer to deployment reality. New evaluations emphasize underspecified user intent, multi-round refinement, adaptive defense, first-person normative action generation, memory-use boundaries, and joint memory+long-document reasoning.
Several papers expose underappreciated attack surfaces in the stack around the model. Notable examples: positional jailbreak slots, poisoned steering vectors, fabricated-evidence viewpoint steering, and contamination-sensitive guardrail evaluation.
Lightweight structural changes often beat brute-force scaling. Examples include state-grounded skill retrieval for web agents, rollout replay for GRPO, prompt-agent self-evolution, and shallow ensemble guardrails with calibrated thresholds.
A recurring design pattern is “separate write-time from read-time.” This appears in memory retention, contradiction resolution, preference logging, and auditability: systems improve when they explicitly track what gets stored, why, and under what contract.

2) Key themes (clusters)

Theme: Adaptive memory becomes the core agent bottleneck

Why it matters: A large share of today’s papers argue that agent failures come less from raw model capability and more from how experience is stored, updated, and re-used over long horizons. The common move is away from static top-k retrieval toward adaptive, state-aware, or budget-aware memory operations.
Representative papers:
Common approach:
- Replace one-shot retrieval with step-wise or iterative memory access conditioned on current state.
- Distinguish long-term storage from short-term strategy synthesis or active reconstruction.
- Optimize memory using downstream utility signals rather than heuristic storage rules.
- Make memory source-backed and auditable, especially under token budgets.
Open questions / failure modes:
- Retrieval quality can improve while generation still misuses exposed memory.
- Deep reconstruction and iterative retrieval can raise latency and call-count costs.
- Most evaluations remain on benchmarks rather than live deployments.
- Several systems still underuse failure trajectories or lack robust memory maintenance/forgetting.

Theme: Governance and control planes for agent autonomy

Why it matters: A second cluster focuses on making agent behavior governable at runtime: who authorized what, when autonomy should increase, and how to recover when quality drifts. This is especially relevant for enterprise and high-stakes deployments.
Representative papers:
Common approach:
- Introduce typed or explicit control surfaces: autonomy tiers, execution graphs, contradiction operators, editable harnesses.
- Preserve audit trails through preference tuples, provenance rows, or structured CTI-style reports.
- Use runtime diagnostics to trigger recalibration, demotion, or localized repair.
- Treat prompts/skills/workflows as optimizable system artifacts, not fixed glue code.
Open questions / failure modes:
- Many results are from single-corpus or single-model pilots.
- Governance mechanisms can still rely on LLM judges, creating replay or bias risks.
- Some systems remain specification-only, without enforced runtime execution.
- Human oversight can fail through reviewer disengagement or weak approval workflows.

Theme: Realistic benchmarks are replacing toy one-shot evaluations

Why it matters: Several papers argue that current benchmarks miss the actual failure modes of deployed agents: underspecified requests, iterative repair, adaptive defenders, long documents plus memory, and privacy-sensitive personalization.
Representative papers:
Common approach:
- Build tasks that require multi-stage reasoning, not just final-answer correctness.
- Evaluate behavior under feedback loops, hidden requirements, or adaptive environments.
- Use structured decompositions of failure: action alignment vs grounding, retrieval exposure vs integration, doc-only vs hybrid.
- Validate automated judges with human agreement checks where possible.
Open questions / failure modes:
- Many datasets are synthetic or partially LLM-generated, limiting realism.
- Evaluator dependence remains high; some benchmarks use the same model family for generation and judging.
- Coverage is often English-only and domain-limited.
- Strong benchmark gains may still reflect protocol or evaluator choices, not general capability.

Theme: New attack surfaces beyond classic prompt jailbreaks

Why it matters: Security papers today broaden the threat model from prompt suffixes to the full agent stack: retrieval context, steering artifacts, insertion position, and benchmark contamination. This suggests defenses need to cover infrastructure and artifacts, not just prompts.
Representative papers:
Common approach:
- Identify a previously implicit assumption—suffix-only attacks, trusted steering bundles, benign retrieved evidence, uncontaminated benchmarks.
- Show that small structural perturbations can materially change harmful behavior.
- Evaluate under blind or defense-aware settings rather than only in-distribution dev sets.
- Pair attacks with simple mitigations such as orthogonalization, threshold calibration, or safeguard policies.
Open questions / failure modes:
- Some attacks require white-box access or open weights.
- Several defenses remain partial; stronger models or filters still outperform lightweight guards in some settings.
- Blind evaluation sets are often small, and contamination remains hard to rule out.
- Real-world transfer to closed products and production pipelines is still underexplored.

Theme: Efficiency through smarter allocation, replay, and modularity

Why it matters: Another strong thread is improving capability without retraining giant models end-to-end. Papers show gains from better compute routing, replay reuse, prompt optimizer evolution, and modular skill compilation.
Representative papers:
Common approach:
- Reallocate scarce resources based on marginal utility, not average difficulty.
- Reuse expensive trajectories or prompts via replay, evolution, or compilation.
- Move from plaintext prompt artifacts to modular latent or executable representations.
- Validate with ablations that isolate the contribution of each mechanism.
Open questions / failure modes:
- Many studies are still offline or benchmark-bound.
- Hyperparameter sensitivity remains nontrivial in RL and prompt evolution setups.
- Transfer across model families and domains is often untested.
- Some gains may saturate quickly with deeper search or larger budgets.

3) Technical synthesis

A common systems move is to decouple storage from use: long-term memory vs short-term strategy (AdaMEM), retained evidence vs read-time retrieval (EMBER), current vs audit rows (TOKI), and preference logging vs model updates (Digital Apprentice).
Several papers replace scalar quality with multi-dimensional telemetry: Digital Apprentice uses a 6D rubric; NORA decomposes action alignment, factual grounding, and support binding; RBI-Eval separates exposure from integration.
State-conditioned adaptation is emerging as the default for agents: SGDR retrieves skills per step from current webpage state, AdaMEM refreshes strategy during episodes, and MRAgent chooses traversal actions based on accumulated evidence.
Multiple works show that difficulty is a poor proxy for value: consequence-aware routing finds difficulty can anti-correlate with premium-tier marginal gain, while memory papers show more retrieval is not always better if it increases noise.
There is a broad shift from free-form text artifacts to structured intermediate representations: AIP graphs, PersonaTree hierarchies, Cue–Tag–Content graphs, evidence capsules, contradiction operators, and latent skill adapters.
Several papers use judge-mediated optimization loops, but also expose their fragility: Digital Apprentice, RHO, MemoryDocDataSet, and Ghostwriter all depend on LLM judges, while TOKI explicitly argues keyed logging is needed for replay consistency.
RL papers converge on stability fixes for sparse/discrete rewards: rollout age caps and fresh anchoring in replay, dual-anchor advantages and asymmetric KL in MDP-GRPO.
Benchmark papers increasingly evaluate closed-loop behavior, not static outputs: Asuka-Bench, ZERO-APT, BenchAgent, and RHO all measure iterative adaptation under shared protocols or active opposition.
Security papers repeatedly show that artifact-level trust is unsafe: steering vectors, retrieved evidence, benchmark datasets, and insertion positions all become attack surfaces once shared or reused.
A recurring empirical pattern is that retrieval/filtering reduces exposure but not misuse after exposure—seen clearly in RBI-Eval and echoed by memory and security papers where generation-time safeguards remain necessary.

4) Top 5 papers (with “why now”)

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation
- Reframes adaptive compute as a cost-weighted routing problem, not an accuracy-maximization problem.
- Shows consequence is roughly orthogonal to difficulty, so standard difficulty-aware routing can waste premium compute.
- At matched budgets, reported 21.8% lower cost-weighted loss vs difficulty-aware routing, and 30.7% with priority-aware routing.
- Useful now because frontier deployments increasingly need budgeted inference with asymmetric failure costs.
- Skeptical about: consequence labels are coarse and the main study is an offline multi-model tier experiment, not a live token-budget intervention.
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents
- Makes a strong conceptual shift: memory access should be active and sequential, not one-shot retrieval.
- Combines a Cue–Tag–Content graph with LLM-guided traversal and includes a formal expressivity separation over passive retrieval.
- Reports large gains on LoCoMo and LONGMEMEVAL plus major token-cost reductions.
- Useful now because long-horizon assistants are hitting the limits of static RAG-style memory.
- Skeptical about: deeper reconstruction raises latency, and the graph currently lacks robust maintenance/consolidation.
Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement
- Introduces a benchmark that actually matches how many coding tasks happen: underspecified requests plus iterative user feedback.
- Separates first-pass generation from repair-from-feedback, which many current benchmarks miss.
- Shows wide spread across models and that even strong systems are far from saturated in 3 rounds.
- Useful now because code agents are increasingly sold as interactive builders, not one-shot code generators.
- Skeptical about: evaluator dependence is high, with GPT-5.4 used in evaluator roles.
The Digital Apprentice: A Framework for Human-Directed Agentic AI Development
- Offers a concrete governance model where autonomy is earned per skill and gated by empirical checks plus human authorization.
- ADAPT adds a practical control plane: multi-policy inference, telemetry, preference emission, and runtime recalibration.
- The pilot suggests policy switching can recover drifted dimensions like actionability.
- Useful now because enterprises need deployable patterns for auditable autonomy escalation, not just abstract alignment principles.
- Skeptical about: evidence is from a single-corpus, judge-measured pilot without inter-rater agreement or significance testing.
Steering LLM Viewpoints through Fabricated Evidence Injection
- Identifies a practical alignment vulnerability: models can internalize pseudo-authoritative fabricated evidence rather than merely quote it.
- Ghostwriter shows this works across HVD, BBQ, and ToxiGen, including against some classifier-guarded systems.
- Also provides a concrete mitigation path with a tailored safeguard policy reporting ~80.5% detection on attacked HVD responses.
- Useful now because retrieval, tool use, and third-party context channels are becoming standard attack paths.
- Skeptical about: the main hazardous-viewpoint dataset is LLM-generated, and the paper does not claim compromise of official deployed products.

5) Practical next steps

Add two-stage memory safeguards to any persistent assistant: first filter sensitive retrieval exposure, then separately audit whether the generator actually integrates exposed memory.
For agent memory stacks, test step-wise retrieval/refresh against your current episode-start retrieval baseline; measure not just task success but token cost, latency, and failure recovery.
If you run premium/cheap model routing, replace difficulty-only heuristics with consequence- or marginal-gain-aware scheduling and track cost-weighted loss, not just accuracy.
Treat prompts, skills, and workflows as versioned system artifacts with audit logs; consider typed skill graphs or explicit harness diffs instead of prose-only instructions.
Red-team beyond suffix jailbreaks: evaluate multi-slot insertion, fabricated-evidence context injection, and artifact poisoning for any shared steering vectors or skill bundles.
For long-horizon agents, instrument the full write/read chain: what was stored, what was retrieved, what was shown to the model, and what was actually used in the answer.
Benchmark agents under closed-loop, protocol-aligned settings before adopting multi-agent workflows; measure whether extra agents improve accuracy enough to justify token and latency overhead.
In RLVR or GRPO pipelines, test fresh-anchored replay and stability-oriented advantage shaping on strict constraint tasks before scaling rollout budgets.

Generated from per-paper analyses; no external browsing.

Agent control gets concrete.

Takeaways

Start with: Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Themes

Papers Worth Your Reading Time

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents

AI Paper Insight Brief

2026-06-08

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Adaptive memory becomes the core agent bottleneck

Theme: Governance and control planes for agent autonomy

Theme: Realistic benchmarks are replacing toy one-shot evaluations

Theme: New attack surfaces beyond classic prompt jailbreaks

Theme: Efficiency through smarter allocation, replay, and modularity

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps