June 7, 2026 Research Brief
Agent evaluation turns adversarial.
Today’s strongest papers show that agent progress depends less on raw task wins and more on cheating-resistant evaluation, runtime defenses, and structured process signals for tool use and evidence.
Takeaways
- Agent research is shifting from raw task completion to **process quality**: multiple papers introduce rewards, benchmarks, or memory structures that explicitly optimize exploration quality, tool-use decisions, evidence selection, and efficiency rather than just final success.
- **Evaluation itself is under attack or mis-specified**. Several papers show that current benchmarks can overstate capability because models exploit language priors, accessible tests, wild-only security datasets, or coarse aggregate metrics.
- A strong pattern across safety/security work is **runtime, structure-aware defense**: manifold-trajectory jailbreak detection, capped coding evaluation, UI repair proxies, and runtime-verified malicious-skill benchmarks all move beyond static prompt or code inspection.
Start with: Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
Why it catches my eye: It targets a core failure mode in agent progress claims: agents can exploit evaluations unless tests and rewards are designed against cheating.
Read skeptically for: Evidence is centered on coding evaluations, so transfer to broader agent settings remains unproven.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
#1Useful if you evaluate coding agents: it directly tests whether benchmark gains survive anti-cheating design.
- Why now
- Coding agents are improving fast, and inflated evals can mislead both training and deployment decisions.
- Skepticism
- The main evidence is in coding tasks, not the full range of tool-using agents.
Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning
#2A complementary paper on how to improve agent behavior itself, not just measure it, by reducing overconfident tool mistakes.
- Why now
- Tool-use errors are a common hidden cost in deployed agents, and standard RL can worsen them.
- Skepticism
- Its uncertainty signal is based on perplexity, which may miss richer trajectory-level uncertainty.
Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
#3Worth opening for a concrete runtime defense that treats jailbreaks as dynamic representation shifts rather than static prompts.
- Why now
- Adaptive jailbreaks are making static prompt filtering less credible as a primary defense.
- Skepticism
- Attackers may eventually learn jailbreaks that stay closer to benign manifold trajectories.
Chinese version: [中文]
Run stats
- Candidates: 248
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-06-05T00:00:00Z → 2026-06-06T00:00:00Z (explicit, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2606.07131 | MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills | cs.CR, cs.SE | 95 | Runtime-verified benchmark for malicious agent skills; highly relevant to agent security evaluation. | agent-safety, benchmark, malicious-skills, supply-chain, security-evaluation |
2606.07379 | Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests | cs.LG, cs.AI, cs.CL, stat.ME | 95 | Targets agent cheating in coding evals with randomized tests and anti-cheating reward design. | agents, evaluation, deception, coding, reward-design, robustness |
2606.06976 | Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning | cs.AI | 93 | Targets agent tool-use reliability by aligning RL with uncertainty to reduce overconfident mistakes. | agents, tool-use, uncertainty, reinforcement-learning, reliability, safety |
2606.07335 | Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics | cs.CR | 92 | Jailbreak defense with adaptive-attack focus; strong deployment relevance for LLM safety. | jailbreak, defense, robustness, deployment-safety, adversarial |
2606.07150 | From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability | cs.CR, cs.AI, cs.MA, cs.NI | 92 | Highlights metadata leakage in agent protocols; strong security relevance for interoperable agents. | agent-safety, security, privacy, protocols, MCP, A2A, workflow-integrity |
2606.07130 | Explicit Evidence Grounding via Structured Inline Citation Generation | cs.CL | 91 | Structured inline citations for claim-level evidence grounding directly improve factuality and auditability. | grounding, citations, factuality, RAG, faithfulness, evaluation |
2606.07462 | Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle | cs.AI | 91 | Benchmarking frontier research agents on ethics, judgment, and lifecycle tasks is highly safety-relevant. | agents, evaluation, research-agents, safety, benchmark |
2606.06959 | OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios | cs.CL, cs.AI | 89 | Unified hallucination detection benchmark across settings; useful for reliable LLM evaluation. | hallucination, benchmark, evaluation, reliability, truthfulness |
2606.07402 | M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions | cs.CL | 89 | Realistic multimodal memory benchmark for user-agent interactions; exposes key gaps in long-horizon agent memory. | benchmark, agents, multimodal, memory, evaluation, user-interaction |
2606.07074 | SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating | cs.LG, cs.AI | 88 | Efficiency-aware web agents with adaptive reward gating; relevant for scalable, safer agent deployment. | web-agents, efficiency, reinforcement-learning, tool-use, training, deployment |
2606.07040 | Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling | cs.CL | 88 | Reusable evaluation skills for reward modeling could improve scalable judging beyond ad hoc rubrics. | reward-modeling, evaluation, alignment, judges, preference-learning |
2606.06797 | Korean Culture into LLM Alignment: Toward Cultural Coherence | cs.CL | 88 | Concrete DPO alignment pipeline for culturally coherent safe responses in Korean across open LLMs. | alignment, safety, DPO, multilingual, cultural-alignment |
2606.06914 | DPAgent-in-the-Middle: Agentic Defense and Repair Against AI-Groomed Deceptive Patterns | cs.CR | 87 | Agentic defense against AI-groomed deceptive patterns and data-void manipulation threats. | agent-safety, privacy, deceptive-patterns, data-poisoning, security |
2606.07297 | SWE-Explore: Benchmarking How Coding Agents Explore Repositories | cs.SE, cs.CL | 87 | Fine-grained benchmark for repository exploration, a core capability and failure point of coding agents. | coding-agents, benchmark, evaluation, repository-understanding, SWE |
2606.07412 | Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills | cs.SE, cs.AI | 86 | Self-evolving coding agents from trace-derived skills could materially improve real-world agent capability. | coding-agents, self-improvement, training-data, software-engineering, agents |
2606.07027 | StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents | cs.AI | 86 | Process rewards for GUI agents with evidence linking address long-horizon credit assignment. | agents, GUI-agents, process-reward-models, RL, credit-assignment |
2606.07515 | How reliable are LLMs when it comes to playing dice? | cs.CL, cs.AI, cs.HC, math.PR | 86 | Strong reliability benchmark exposing token bias and prompt susceptibility in probabilistic reasoning. | reliability, reasoning, evaluation, prompting, robustness |
2606.07017 | The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective | cs.AI, cs.CL, cs.ET | 85 | Frames FM-agent robustness as sim-to-real MDP gap; strong agenda-setting relevance. | agents, robustness, sim-to-real, evaluation, reliability |
2606.07512 | MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism | cs.CV, cs.AI, cs.CL | 85 | Agentic retrieval plus hierarchical memory for long-video understanding looks broadly reusable and impactful. | multimodal, long-context, memory, agentic-retrieval, video-understanding, MLLM |
2606.06833 | Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks | cs.LG, cs.AI, cs.CR | 85 | Shows LLM priors can strengthen real-time ASR attacks; notable AI security implication. | security, adversarial-attacks, ASR, LLMs, robustness |
2606.06946 | Auditing Training Data in Domain-adapted LLMs: LoRA-MINT | cs.CL, cs.AI | 84 | Audits training-data membership in LoRA-adapted LLMs; concrete privacy/IP relevance. | privacy, membership-inference, LoRA, data-auditing, llm-security |
2606.07271 | Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path | cs.LG, cs.AI, cs.SD | 84 | Analyzes membership leakage in rectified flows; strong privacy relevance for deployed generative models. | privacy, membership-inference, generative-models, security, rectified-flows |
2606.06890 | Diagnosing Visual Ignorance in Vision-Language Models | cs.CV, cs.LG | 84 | Mechanistic analysis of VLM visual grounding failures; useful for multimodal reliability and evaluation. | VLM, interpretability, grounding, multimodal, reliability |
2606.06893 | Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition | cs.AI | 82 | Automatic skill construction for agents with explicit safety/rollback structure in representation. | agents, skills, workflow, safety, tool-use |
2606.07437 | Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability | cs.RO, cs.AI, cs.HC, cs.SE, eess.SY | 82 | Reframes AV safety with auditable predictability/transferability concepts; notable safety governance relevance. | autonomous-vehicles, safety, auditability, predictability, governance |
2606.07020 | MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights | cs.CL | 82 | Agentic multilingual diagnosis framework for benchmark results offers reusable evaluation tooling. | evaluation, agents, multilingual, benchmarks, analysis |
2606.07218 | HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG | cs.IR, cs.CL | 82 | Multi-hop RAG evidence organization with hypergraph keys; practical for grounded retrieval pipelines. | RAG, retrieval, multi-hop, grounding, knowledge |
2606.07000 | Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization | cs.AI | 81 | Dense tutoring signals for multimodal RLVR may improve post-training without answer leakage. | multimodal, RLVR, post-training, distillation, reasoning |
2606.07299 | DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning | cs.AI | 80 | Auditable multi-agent deep-research system targeting planning, verification, and hallucination risk. | agents, auditability, multi-agent, deep-research, grounding |
2606.07210 | A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization | cs.SD, cs.CR | 80 | Per-speaker privacy analysis reveals uneven re-identification risk hidden by averages; useful evaluation lens. | privacy, speech, anonymization, evaluation, security, risk-analysis |
AI Paper Insight Brief
2026-06-07
0) Executive takeaways (read this first)
- Agent research is shifting from raw task completion to process quality: multiple papers introduce rewards, benchmarks, or memory structures that explicitly optimize exploration quality, tool-use decisions, evidence selection, and efficiency rather than just final success.
- Evaluation itself is under attack or mis-specified. Several papers show that current benchmarks can overstate capability because models exploit language priors, accessible tests, wild-only security datasets, or coarse aggregate metrics.
- A strong pattern across safety/security work is runtime, structure-aware defense: manifold-trajectory jailbreak detection, capped coding evaluation, UI repair proxies, and runtime-verified malicious-skill benchmarks all move beyond static prompt or code inspection.
- For retrieval and grounding, the frontier is moving from “retrieve relevant chunks” to organize evidence into usable structures: hypergraphs for multi-hop RAG, structured inline citations, multimodal memory surrogates, and graph memory for long video all improve downstream reasoning by controlling evidence form.
- Privacy risks are becoming more adaptation- and protocol-specific: LoRA fine-tuning leaks membership, rectified flows leak along specific interpolation regions, speech anonymization hides worst-case speaker risk, and agent interoperability leaks workflow intent through metadata even with encrypted payloads.
- Practical implication: teams building frontier agents should invest less in monolithic end-to-end scaling and more in auditable intermediate representations, calibrated rewards, stress-test suites, and cost-aware runtime controls.
2) Key themes (clusters)
Theme: Agent training is becoming reward-engineering for behavior, not just outcomes
- Why it matters: Several papers argue that end-task success alone produces brittle agents: overconfident tool calls, bloated web search, weak GUI credit assignment, and poor coding exploration. The common fix is to shape rewards around uncertainty, efficiency, process evidence, or trace-derived skills.
- Representative papers:
- Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning
- SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
- StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents
- Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
- Common approach:
- Replace scalar success rewards with structured signals: uncertainty separation, tool/token efficiency, entity-linked process rewards, or execution-grounded repair rewards.
- Use intermediate artifacts as training targets: key-turn annotations, minimal necessary paths, entity-state traces, or distilled skills from prior trajectories.
- Validate with ablations showing the shaping term is necessary, not just helpful.
- Open questions / failure modes:
- Many methods rely on proxy uncertainty or proxy process signals that may not generalize beyond text or fixed tool spaces.
- Several approaches add substantial training complexity or verifier cost.
- Reward shaping can still be gamed if anchors, gates, or process verifiers are incomplete.
Theme: Benchmarks are increasingly measuring the wrong thing
- Why it matters: A recurring message is that current evaluations often conflate capabilities or reward shortcuts. This creates false confidence in model quality and makes progress hard to interpret.
- Representative papers:
- Diagnosing Visual Ignorance in Vision-Language Models
- OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories
- Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
- Common approach:
- Decompose end-to-end performance into narrower measurable subproblems: exploration, hallucination detection, visual grounding, or cheating-resistant pass rates.
- Introduce stress tests or controlled perturbations: progressive blur, capped randomized tests, restricted-context repair, access-aware detector comparisons.
- Emphasize cost-aware or process-aware metrics rather than single leaderboard scores.
- Open questions / failure modes:
- Many new benchmarks still depend on LLM judges, curated subsets, or trajectory-derived labels.
- Better diagnostics do not automatically produce better training signals.
- Coverage gaps remain for multimodal, long-context, closed-source, and interactive agent settings.
Theme: Security defenses are moving to runtime and system level
- Why it matters: Static filtering is proving insufficient against adaptive attacks, hybrid artifacts, and supply-chain threats. The stronger papers here defend at the point where behavior becomes executable or observable.
- Representative papers:
- Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
- MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills
- DPAgent-in-the-Middle: Agentic Defense and Repair Against AI-Groomed Deceptive Patterns
- Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks
- Common approach:
- Model attacks as dynamic processes: layer trajectories, live UI interactions, runtime skill execution, or streaming audio prefixes.
- Evaluate under adaptive or realistic threat models rather than static held-out attacks.
- Use system instrumentation or proxy interception to observe behavior where it matters.
- Open questions / failure modes:
- Runtime defenses can be expensive and operationally brittle.
- Some threats remain architecture-specific or fail to transfer broadly.
- Benchmarks still struggle to cover the full hybrid space of prompt, code, tool, and UI attacks.
Theme: Evidence organization is becoming a first-class design problem
- Why it matters: Better retrieval is no longer just about finding relevant text; it is about structuring evidence so the reader or agent can actually reason over it. Several papers show large gains from changing evidence form rather than changing the base model.
- Representative papers:
- Explicit Evidence Grounding via Structured Inline Citation Generation
- HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG
- M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions
- MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
- Common approach:
- Separate storage/indexing from reasoning: textual surrogates, hypergraph keys, hierarchical graph memory, or posthoc citation alignment.
- Use structured evidence units rather than flat chunks: spans, hyperedges, modality-tagged surrogates, event graphs.
- Add retrieval controllers or agentic tool loops to query memory iteratively.
- Open questions / failure modes:
- Gains often depend on upstream extraction quality; selection improves, but extraction remains a bottleneck.
- Structured memory can lose information if summaries or surrogates are too lossy.
- Many results are on fixed substrates or dev splits rather than full end-to-end deployments.
Theme: Privacy leakage is increasingly localized, conditional, and hard to see in averages
- Why it matters: The privacy papers show that leakage is often hidden by average-case reporting. Risk depends on adaptation method, architecture, protocol metadata, or even specific interpolation regions during generation.
- Representative papers:
- Auditing Training Data in Domain-adapted LLMs: LoRA-MINT
- Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path
- A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization
- From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability
- Common approach:
- Replace average metrics with localized diagnostics: per-speaker linkability, λ-resolved membership profiles, metadata-view inference, or LoRA-specific perplexity thresholds.
- Study threat models tied to deployment reality: PEFT adaptation, passive metadata observers, semi-informed attackers.
- Show that leakage can remain high even when standard utility or validation metrics look stable.
- Open questions / failure modes:
- Several methods assume white-box or partially privileged access.
- Calibration often depends on synthetic references, simulated generators, or fixed attacker settings.
- Defenses are less mature than the attacks and diagnostics.
Theme: Locale, culture, and researcher-quality behavior are entering alignment evaluation
- Why it matters: Alignment work is broadening beyond generic refusal and generic task success toward locale-specific coherence and professional norms. This is a sign that “safe enough globally” is no longer a sufficient target.
- Representative papers:
- Common approach:
- Define constructive criteria, not just prohibited outputs: sociolegal anchoring, demographic specificity, multilingual sensitivity, researcher-like integrity.
- Build diagnosis pipelines that surface slice-level failures rather than aggregate scores.
- Use agentic analysis systems to turn benchmark outputs into actionable remediation plans.
- Open questions / failure modes:
- Human validation remains limited in several cases.
- Locale-specific alignment can become stale as norms and laws change.
- Benchmarking professional behavior is still small-scale and partly dependent on handcrafted tasks.
3) Technical synthesis
- A common design move is decoupling: perception from reasoning (MemDreamer), planning from search (DuMate), workflow from semantics/attachments (Workflow-to-Skill), and retrieval from evidence organization (HKVM-RAG, M3Proctor).
- Many papers replace raw hidden states or outputs with structured intermediate signals: rank trajectories for jailbreak detection, stain concentrations for GUI rewards, hyperedges for multi-hop evidence, and λ-resolved reconstruction gaps for membership inference.
- Several strong results come from offline artifact synthesis rather than online generation: Eval-Skill’s reusable judging skills, Korean cultural triplets, trace-derived SWE skills, and M3Proctor’s textual surrogates.
- Ablation-driven causal claims are a norm in the stronger papers: removing uncertainty coefficients, correctness gates, global/local stain modules, or skill registries consistently degrades performance.
- There is a broad shift from average-case metrics to worst-case or slice-aware evaluation: per-speaker privacy, PMPs for jailbreak detectors, multilingual slice diagnosis, and line-level repository exploration.
- Multiple papers show that selection is the bottleneck more often than generation: support selection in HKVM-RAG, line-level evidence finding in SWE-Explore, visual grounding in VLMs, and snippet localization in FullCite.
- Cost is now a first-class metric in evaluation: OpenHalDet profiles evidence acquisition cost, SlimSearcher optimizes tool/token usage, M3Proctor reduces retrieval tokens, and MemDreamer cuts active context by ~40×.
- Security work increasingly assumes adaptive attackers: detector-aware jailbreaks, streaming ASR attackers with LLM priors, malicious skill supply chains, and metadata observers inferring future workflows.
- Several papers use LLMs as infrastructure rather than endpoints: judges, safe-response generators, skill distillers, task generators, and diagnostic agents.
- A recurring limitation is dependence on curated substrates: fixed candidate sets, cached extractors, synthetic references, or benchmark-specific annotations, which improves control but may narrow external validity.
4) Top 5 papers (with “why now”)
- OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios
- Standardizes hallucination detection across 17 datasets and 16 detectors under black-/gray-/white-box access regimes.
- Main takeaway is operational: detector rankings are scenario- and backbone-dependent, and evidence acquisition often dominates cost.
- Useful now because teams are shipping detectors without a fair way to compare them under realistic access constraints.
- Skeptical about: labels rely on an LLM judge and coverage excludes multimodal, long-context, and interactive agent settings.
- Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
- Introduces a zero-shot jailbreak detector based on layer-wise nearest-benign rank trajectories rather than static features.
- Reports strong AUROC, low PMP false positives, and resilience under adaptive attacks, plus transfer to VLMs.
- Useful now because jailbreak defense is increasingly an adaptive-attack problem, not a static classification problem.
- Skeptical about: the defense assumes jailbreaks induce detectable manifold irregularities; stronger attacks may learn to stay on-manifold.
- Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning
- Shows standard RL can make tool-using agents more overconfident on wrong actions, then fixes this with uncertainty-aligned rewards.
- Delivers gains on When2Call, BFCL-V4, and ToolSandbox while restoring separation between correct and incorrect decision uncertainty.
- Useful now because tool-use errors are a major source of downstream agent failures and hidden costs.
- Skeptical about: uncertainty is instantiated via perplexity, which may miss richer semantic or trajectory-level uncertainty.
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories
- Separates repository exploration from patch synthesis and evaluates ranked line-level evidence selection under a fixed budget.
- Shows agentic explorers beat classical retrieval, but line-level recall remains low and strongly predicts downstream repair.
- Useful now because coding-agent progress is increasingly bottlenecked by localization, not just patch generation.
- Skeptical about: ground truth is trajectory-derived and limited to issues solved by at least two successful runs.
- MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills
- Builds a runtime-verified benchmark of malicious skills spanning code injection, prompt injection, and mixed attacks.
- Demonstrates that wild-only evaluation is badly biased and that existing detectors either over-trigger or miss hybrid attacks.
- Useful now because agent ecosystems are starting to import third-party skills and plugins faster than security tooling is adapting.
- Skeptical about: limitations around verification noise and platform breadth are not fully characterized in the provided analysis.
5) Practical next steps
- Add process-level telemetry to agent training and eval: uncertainty traces, tool-call counts, evidence windows, line-level exploration logs, and retrieval cost.
- Stress-test any deployed evaluator or benchmark with shortcut probes: blurred images, randomized capped tests, PMPs, wild-vs-synthetic splits, and restricted-context patching.
- For tool-using agents, try reward shaping with correctness gates plus efficiency/uncertainty terms before scaling model size or context length.
- Build retrieval stacks around structured evidence objects rather than flat chunks: spans, hyperedges, event graphs, modality-tagged surrogates, or executable skills.
- Audit PEFT and generative systems for privacy with adaptation-specific probes: LoRA membership tests, per-user worst-case metrics, and trajectory-aware leakage scans.
- Treat agent security as a runtime systems problem: inspect live UI state, skill execution traces, and internal representation trajectories rather than relying only on prompt filters.
- For multilingual or locale-sensitive deployments, define constructive alignment rubrics that specify what a good local response should contain, not just what to suppress.
- Track cost-quality Pareto fronts explicitly in benchmarks and training loops; several papers show accuracy gains can come with avoidable token, tool, or evidence-acquisition overhead.
Generated from per-paper analyses; no external browsing.