Takeaways

Evaluation is shifting from static capability tests to deployment-shaped benchmarks: today’s strongest papers stress dynamic scheduling, long-form judging, UX, value conflicts, coding harnesses, and enterprise pre-deployment assurance rather than raw task accuracy alone.
A recurring pattern is that scaffolding often matters as much as the base model: harnesses, critics, verifiers, adapters, and controllers produced large gains in multimodal tasks, GUI control, coding agents, and safety-aligned on-device deployment.
RAG and context-bearing systems remain a major attack surface, but the failure modes are diversifying: beyond classic prompt injection, papers show cost-exhaustion via poisoned retrieval, brand suppression from safety overreaction, and long-horizon context poisoning.

Start with: Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Why it catches my eye: It challenges a widely used evaluation shortcut by showing long-form LLM judges are only moderately reliable under realistic document lengths.

Read skeptically for: Benchmark breadth is strong, but it does not test retrieval-backed or multi-agent judge architectures.

evaluation llm-as-judge long-form reliability

arXiv PDF

Themes

Deployment-realistic evaluation replaces static benchmarks Many papers argue current benchmarks overestimate readiness because they ignore partial observability, long documents, user behavior, regulatory constraints, or harness effects. The result is a stronger push toward evaluations that mirror real operating conditions and expose failure modes earlier.

Harnesses, critics, and verifiers are becoming first-class capability multipliers Multiple papers show that frozen or modestly tuned models can improve substantially when wrapped with better execution logic, verification, or critique. This suggests near-term gains may come more from systems design than from retraining larger base models.

RAG and persistent-context systems face new attack classes The attack surface is moving from direct prompt attacks to indirect manipulation of retrieved documents, persistent memory, and safety-trained behaviors. This is operationally important because these attacks can scale through shared corpora and standard agent pipelines.

Signal Static evals are losing credibility. Workplace agents, long-form judges, UXBench, coding harnesses, and enterprise assurance all test operating conditions that static accuracy benchmarks miss.

Tension Scaffolds help, but add fragility. MUSE, grounded critics, hierarchical tool learning, and integrity gates improve outcomes, yet depend on verifiers, orchestration, and extra system complexity.

Bet RAG security becomes operational. Today’s attacks target cost inflation, brand suppression, route safety, and context poisoning, pushing defenses toward audits, controllers, and enforceable boundaries.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Useful if you rely on scalable evaluation: it shows long-form judge accuracy is much weaker than short-form results imply.

Why now: Research agents and review workflows increasingly depend on long-output judging.
Skepticism: It leaves newer judge designs, including retrieval-backed or multi-agent variants, largely untested.

arXiv PDF

Inference Cost Attacks for Retrieval-Augmented Large Language Models

It reframes RAG poisoning as an availability and cost problem, not just a factuality problem.

Why now: RAG is default infrastructure, so token-cost amplification is becoming a practical production risk.
Skepticism: Results are shown on three QA datasets and assume attackers can inject retrievable documents.

arXiv PDF

MUSE: A Unified Agentic Harness for MLLMs

A strong example of system design beating raw model scaling through verifiers, tools, and repair loops.

Why now: Harness-level gains are one of the few durable levers across rapidly changing multimodal base models.
Skepticism: Its gains may depend on having reliable task-specific verifiers and deterministic tools.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 2527
Selected: 30
Deepread completed: 30
Window (UTC): 2026-06-12T00:00:00Z → 2026-06-13T00:00:00Z (weekend_backlog_unknown, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2606.10747`	The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment PDF	cs.AI	95	Directly targets monitoring emergent misalignment in multi-agent LLM conversations.	agent-safety, multi-agent, monitoring, misalignment, evaluation
`2606.09315`	Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents PDF	cs.CR, cs.AI	94	Novel audit framework for BCI-LLM agent routing attacks; strong agent safety relevance.	agent-safety, prompt-injection, BCI, auditing, tool-use, security
`2606.09204`	The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection PDF	cs.LG, cs.CL, cs.CR	94	Concrete prompt-injection finding in RAG; safety training causes measurable brand suppression side effect.	RAG, prompt-injection, LLM-safety, security, evaluation
`2606.02643`	Inference Cost Attacks for Retrieval-Augmented Large Language Models PDF	cs.CR, cs.AI, cs.DB	93	Targets RAG via KB poisoning to inflate inference cost; practical security risk with clear attack model.	RAG, security, data-poisoning, inference-cost, adversarial
`2606.02947`	BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks PDF	cs.LG, cs.CV	92	Concrete defense against VLM backdoor attacks in open-ended fine-tuning settings.	security, backdoor-defense, VLM, robustness, fine-tuning
`2606.09388`	Distilling Safe LLM Systems via Soft Prompts for On Device Settings PDF	cs.LG	92	Practical safety distillation for on-device LLMs; strong relevance to deployable guardrails.	llm-safety, distillation, on-device, guardrails, alignment
`2606.04037`	Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification PDF	cs.AI, cs.LG, cs.SE	91	Pre-deployment assurance framework for enterprise AI agents with scenario generation and certification.	agents, assurance, verification, certification, enterprise-ai, safety
`2606.03793`	Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models PDF	cs.CL, cs.CV	91	Systematic multilingual MLLM safety/robustness study shows cross-lingual adversarial transfer.	multimodal, safety, adversarial, multilingual, evaluation
`2606.09178`	Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis PDF	cs.CL, cs.AI	91	Shows translated safety benchmarks miss culturally grounded risks; strong red-teaming relevance.	red-teaming, multilingual, safety-evaluation, jailbreaks, benchmark
`2606.12344`	Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks PDF	cs.LG, cs.CL	91	Benchmark for coding agents with fair harness comparison; highly reusable for agent evaluation.	agents, benchmark, coding, evaluation, SWE-bench
`2606.02958`	Echelon: Auditable Aggregate-Only Language-Model Adaptation Across Privacy Boundaries PDF	cs.CR, cs.AI	91	Boundary-first LM adaptation with auditable aggregate-only exchange is highly relevant to privacy-safe deployment.	privacy, federated-learning, auditing, governance, lm-adaptation, security
`2606.09570`	UXBench: Benchmarking User Experience in AI Assistants PDF	cs.CL, cs.HC	91	Real-user UX benchmark for assistants; strong alignment/eval relevance and broad reuse potential.	benchmark, alignment, evaluation, user-experience, assistants
`2606.11078`	A History-Aware Visually Grounded Critic for Computer Use Agents PDF	cs.AI, cs.CL, cs.CV	91	History-aware, visually grounded critic for computer-use agents; strong agent reliability relevance.	agents, computer-use, multimodal, test-time, reliability, GUI
`2606.01629`	Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation PDF	cs.CL	91	Long-form LLM-as-judge benchmark targets a key reliability gap in scalable evaluation.	evaluation, llm-as-a-judge, reliability, benchmark, long-form
`2606.09499`	Targeting World Models to Compromise Robot Learning Pipelines PDF	cs.RO, cs.AI, cs.CR	90	Shows stealthy poisoning of robot learning via world models; important AI supply-chain risk.	robotics, data-poisoning, world-models, supply-chain, safety, security
`2606.03695`	Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings PDF	cs.CL	90	Knowledge erasure for safety/compliance with adversarial recovery explicitly considered.	unlearning, model-editing, safety, compliance, robustness
`2606.09371`	Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs PDF	cs.AI	90	Joint planner-executor RL for tool LLMs; strong agentic relevance and concrete benchmark gains.	agents, tool-use, hierarchical-RL, alignment, evaluation
`2606.03312`	RobotValues: Evaluating Household Robots When Human Values Conflict PDF	cs.RO, cs.AI	90	10K benchmark for robot value conflicts directly targets embodied AI alignment and evaluation.	robotics, alignment, benchmark, human-values, evaluation, safety
`2606.09475`	Emergent alignment and the projectability of ethical personas PDF	cs.AI, cs.LG	90	Directly studies emergent alignment via finetuning and ethical personas; strong alignment relevance.	alignment, finetuning, personas, constitutional-ai, safety
`2606.09118`	ComplexConstraints and Beyond: Expert Rubrics for RLVR PDF	cs.AI	89	Rubric-based evaluation for complex instruction following and enterprise agents is broadly reusable.	evaluation, agents, instruction-following, rubrics, llm-judge
`2606.02866`	When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning PDF	cs.AI, cs.CL, cs.MA	89	Large study of multi-agent debate finds when it helps or harms; actionable reliability insight.	multi-agent, debate, reliability, evaluation, data-cleaning
`2606.11145`	OpenPCC: Open and Confidential LLM Serving on Commodity TEEs PDF	cs.CR	89	Confidential LLM serving on commodity TEEs; strong privacy/security relevance for deployed agents.	llm-security, privacy, TEE, deployment, confidential-compute
`2606.05748`	UNIVID: Unified Vision-Language Model for Video Moderation PDF	cs.MM, cs.AI, cs.CL	89	Unified VLM for video moderation with interpretable policy-aware captions; strong safety deployment relevance.	multimodal, moderation, safety, policy-alignment, evaluation
`2606.09500`	Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture PDF	cs.AI, cs.DL	89	Auditable integrity gates for LLM writing; concrete verification architecture for high-stakes use.	safety, verification, auditing, clinical, hallucination
`2606.12212`	Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps PDF	cs.SE, cs.CR	89	First empirical study of LLM API key leakage in iOS apps with dynamic analysis framework.	security, LLM-apps, credential-leakage, mobile, evaluation
`2606.03005`	MUSE: A Unified Agentic Harness for MLLMs PDF	cs.CV, cs.AI	89	Agentic harness for frozen MLLMs with verification/repair could strongly improve reliability.	agents, multimodal, tool-use, verification, reasoning
`2606.10322`	Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs PDF	cs.CR, cs.MA	88	Targets prompt injection and context poisoning across turns with controller-based MCP defense.	prompt-injection, context-poisoning, multi-agent, MCP, security, LLM-agents
`2601.08173`	The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios PDF	cs.AI	88	Dynamic workplace benchmark for agent learning, exploration, and scheduling beyond static tasks.	agents, benchmark, evaluation, exploration, scheduling
`2606.05792`	Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation PDF	cs.AI, cs.LG, cs.LO, cs.SE	88	Systematic evaluation of LLMs generating formal specs; strong reliability signal.	evaluation, reliability, formal-methods, code, LLMs
`2606.02800`	Cosmos 3: Omnimodal World Models for Physical AI PDF	cs.CV, cs.AI, cs.LG, cs.MM, cs.RO	88	Potentially major frontier omnimodal world model with broad agent impact and strong claimed results.	frontier-models, multimodal, world-models, agents, physical-AI

AI Paper Insight Brief

2026-06-14

0) Executive takeaways (read this first)

Evaluation is shifting from static capability tests to deployment-shaped benchmarks: today’s strongest papers stress dynamic scheduling, long-form judging, UX, value conflicts, coding harnesses, and enterprise pre-deployment assurance rather than raw task accuracy alone.
A recurring pattern is that scaffolding often matters as much as the base model: harnesses, critics, verifiers, adapters, and controllers produced large gains in multimodal tasks, GUI control, coding agents, and safety-aligned on-device deployment.
RAG and context-bearing systems remain a major attack surface, but the failure modes are diversifying: beyond classic prompt injection, papers show cost-exhaustion via poisoned retrieval, brand suppression from safety overreaction, and long-horizon context poisoning.
Several papers expose “false confidence” in current oversight tools: LLM judges are only moderately reliable on long-form outputs, direct-translation safety evals understate multilingual risk, and low unsafe rates can reflect comprehension failure rather than real alignment.
Multi-agent methods are not uniformly beneficial: debate can hurt generation while helping detection, and monitoring/controller layers need explicit grounding, budgets, and recovery logic to avoid emergent misalignment or context drift.
Security/privacy work is becoming more operational: auditable aggregate-only training, confidential TEE-based serving, iOS API-key leakage measurement, and deterministic integrity gates all emphasize enforceable system contracts over aspirational policy claims.

2) Key themes (clusters)

Theme: Deployment-realistic evaluation replaces static benchmarks

Why it matters: Many papers argue current benchmarks overestimate readiness because they ignore partial observability, long documents, user behavior, regulatory constraints, or harness effects. The result is a stronger push toward evaluations that mirror real operating conditions and expose failure modes earlier.
Representative papers:
Common approach:
- Build benchmarks around realistic artifacts: streaming tasks, 9k-token outputs, real user logs, or repo-level coding workflows.
- Separate latent variables that prior benchmarks conflated, such as model vs harness vs adapter.
- Use richer metrics than top-line accuracy: checkpoint scores, BAD recall, apply-failure rate, cost, latency, and recovery quality.
- Stress intermediate failure modes rather than only final success.
Open questions / failure modes:
- Benchmark realism still depends on handcrafted rules, synthetic labels, or single-product datasets.
- LLM-based judges remain a weak link in several pipelines.
- Single-run or narrow-domain evaluations may overstate ranking stability.
- Better realism often increases evaluation cost and complexity.

Theme: Harnesses, critics, and verifiers are becoming first-class capability multipliers

Why it matters: Multiple papers show that frozen or modestly tuned models can improve substantially when wrapped with better execution logic, verification, or critique. This suggests near-term gains may come more from systems design than from retraining larger base models.
Representative papers:
Common approach:
- Add deterministic verifiers, repair loops, or grounded critics around a frozen or lightly adapted model.
- Use task-specific structure: simulator-backed checks, GUI coordinate markers, schema validators, or content-hash manifests.
- Convert failures into actionable feedback rather than scalar rejection.
- Optimize for pre-execution prevention, not just post-hoc detection.
Open questions / failure modes:
- These systems often depend on reliable task-specific verifiers, limiting generality.
- Extra calls and orchestration add latency and engineering complexity.
- Manual harness design remains a bottleneck.
- Critics can still fail on subtle perception or open-ended generation.

Theme: RAG and persistent-context systems face new attack classes

Why it matters: The attack surface is moving from direct prompt attacks to indirect manipulation of retrieved documents, persistent memory, and safety-trained behaviors. This is operationally important because these attacks can scale through shared corpora and standard agent pipelines.
Representative papers:
Common approach:
- Model attacks as control over external context rather than over the user prompt.
- Evaluate stealthy objectives: preserve answer correctness, evade detection, or satisfy agreement predicates.
- Introduce controller or audit layers that gate context updates, routing, or execution.
- Measure system-level outcomes such as token-cost inflation, brand suppression, drift, or routed unsafe actions.
Open questions / failure modes:
- Most defenses are still prompt- or controller-level and may not address root retrieval vulnerabilities.
- Several studies use scoped settings: static corpora, white-box access, or finite horizons.
- Mechanisms behind safety overreaction and context drift remain underdetermined.
- End-to-end retrieval, chunking, and downstream action loops are often not fully modeled.

Theme: Oversight tools are brittle unless grounded, calibrated, and culturally aware

Why it matters: A common result across judging, red-teaming, and multilingual safety is that apparent safety or evaluation quality can be misleading. Systems can look aligned because they misunderstand the input, because judges are biased, or because translated prompts miss real local threat context.
Representative papers:
Common approach:
- Replace shallow labels with expert rubrics, culturally adapted prompts, or scenario-specific references.
- Audit judges for position bias, context overflow, self-preference, and refusal artifacts.
- Distinguish true refusal/alignment from comprehension failure.
- Use denser reward/eval signals to expose trainable failure modes.
Open questions / failure modes:
- LLM judges still show instability, family bias, and scenario dependence.
- Cultural adaptation is expensive and often human-dependent.
- Rubric construction is high-quality but labor-intensive.
- Better evaluation does not automatically yield robust training signals without judge calibration.

Theme: Alignment is moving toward system contracts, auditable boundaries, and targeted adaptation

Why it matters: Several papers move beyond generic “safer model” claims toward explicit contracts: what information may cross boundaries, what values should be followed, what gets erased, or what safety behavior can be distilled onto constrained hardware.
Representative papers:
Common approach:
- Define explicit invariants or targets: non-export of per-device state, value-conditioned action choice, erased concepts, or distilled refusal behavior.
- Use lightweight adaptation layers or localized edits rather than full retraining.
- Evaluate robustness under transfer, relearning, or conflicting instructions.
- Pair empirical results with auditable artifacts or formal framing.
Open questions / failure modes:
- Many results are scoped to small/medium models, LoRA regimes, or synthetic settings.
- Alignment to explicit values remains weak when it conflicts with model defaults.
- Distilled or edited systems may inherit base-model vulnerabilities.
- Auditability does not imply full privacy, DP, or adversarial robustness.

3) Technical synthesis

Verifiability is becoming a design primitive: papers repeatedly use deterministic checks, execution traces, checkpoint scoring, schema validation, or formal parsers/model checkers instead of relying on free-form self-evaluation.
Several strong results come from decomposing tasks into controllable subproblems: planner/executor in CAHL, reasoner/generator in Cosmos 3, boundary/global planes in Echelon, and adapter/orchestrator in Claw-SWE-Bench.
Reward shaping is getting denser and more structured: RLVR with expert rubrics, MA-GRPO for adversarial document generation, and high-/low-level verifiable rewards for tool use all replace sparse end-task rewards.
Cross-agent disagreement is increasingly used as signal, but papers show it must be grounded: debate helps detection but can hurt generation; GT-MCP adds causal consistency and drift, not just agreement.
Long-context evaluation is a weak point across domains: long-form judges suffer from overflow and position bias, persistent-context systems drift over time, and workplace agents degrade with task concurrency.
Safety failures often arise from system interactions rather than base-model intent: RAG poisoning, harness bugs, unsafe proxies, and world-model poisoning all exploit surrounding infrastructure.
Multiple papers show that “more calls” is not the explanation for gains: MUSE beats compute-matched self-consistency, and grounded critics outperform generic verbal or scalar critics.
Multilingual safety evaluation needs disentangling of capability vs alignment: low unsafe rates can reflect poor comprehension, and direct translation systematically underestimates risk.
Robustness work is shifting from direct prompt attacks to supply-chain and indirect attacks: poisoned corpora, training data backdoors, world-model poisoning, and leaked API credentials.
Cost is now part of the benchmark contract: Claw-SWE-Bench, OpenPCC, Echelon, and UNIVID all report latency, throughput, or dollar cost alongside quality.

4) Top 5 papers (with “why now”)

1. Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Introduces LongJudgeBench for document-level judging across five scenarios and six datasets, with outputs averaging about 9,249.7 tokens.
Shows current long-form judges are only modestly reliable: mean accuracy 0.5627, with best configuration Qwen3-Max + Reference at 0.6721.
Identifies practical failure modes that matter immediately for research-agent products: position bias, context-window overflow, and safety-policy rejections.
Why now: teams are increasingly using LLM judges for long reports, research agents, and review workflows, but this paper shows those pipelines are much less trustworthy than short-form judge results suggest.
Skeptical about / limitation: benchmark coverage is broad but not exhaustive, and it does not test more advanced judge architectures like retrieval-augmented or multi-agent judging.

2. Inference Cost Attacks for Retrieval-Augmented Large Language Models

Formalizes retrieval-augmented inference cost attacks where poisoned external documents inflate token usage while preserving answer correctness.
CREEP + MA-GRPO achieves large cost amplification, with reported maximum weighted token consumption ratio up to 13.12× against GPT-5.
Shows transfer across datasets and victim models, suggesting attack patterns are not narrowly overfit.
Why now: RAG is becoming default infrastructure, and this paper reframes poisoning as an availability/cost attack rather than only a factuality attack.
Skeptical about / limitation: evaluation scope is limited to three QA datasets and a black-box attacker who can inject retrievable documents.

3. MUSE: A Unified Agentic Harness for MLLMs

Demonstrates that a black-box harness with verifiers, perception tools, and repair loops can materially improve frozen MLLMs across diverse visual tasks.
Gains are large and concrete: e.g., GPT-4o on CoMT improves from 101 to 175 correct; Word Search improves from 3 to 21.
Ablations show improvements are not just from extra sampling; compute-matched self-consistency does not explain the gains.
Why now: frontier multimodal models are changing quickly, and harness-level improvements are one of the few durable, model-agnostic levers available to product teams.
Skeptical about / limitation: applicability depends on having reliable task-specific verifiers and deterministic tools.

4. When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

Provides a rare negative result on debate: it degrades generative workflow quality while strongly improving error detection.
Identifies critique-induced confusion as the mechanism and gives a predictive condition for when debate helps: critic verification odds weighted by fixability must exceed generator accuracy odds.
Shows a practical fix: code-execution grounding plus evidence-gated generation yields the first significant debate win over single-agent generation (+5.3pp).
Why now: multi-agent debate is being adopted broadly, often without task-specific justification; this paper gives a decision rule instead of blanket optimism.
Skeptical about / limitation: tested topology is mainly a two-agent Generator–Critic setup on relatively small tables.

5. OpenPCC: Open and Confidential LLM Serving on Commodity TEEs

Presents an open confidential inference stack using Intel TDX + NVIDIA H100 confidential computing, with composite attestation binding session keys to attested code.
Reports low serving overhead on Llama-3 8B: median TTFT overhead 6.73% and decode throughput overhead around 3.78%.
Separates OpenPCC’s software overhead from the underlying TEE hardware floor, making the deployment tradeoff clearer.
Why now: confidential inference is moving from vendor-specific claims to auditable infrastructure requirements, especially for enterprise and regulated deployments.
Skeptical about / limitation: current prototype is single-GPU, does not fully solve network anonymity, and leaves side channels out of scope.

5) Practical next steps

Add deployment-shaped evals to your stack: at minimum, test long-form judging, persistent-context drift, task concurrency, and recovery behavior rather than only final-answer accuracy.
Treat harness design as a tunable product surface: benchmark verifier-guided repair, grounded critics, and adapter quality before assuming model upgrades are the main lever.
For RAG systems, measure three separate risks: factual corruption, token-cost amplification, and safety-overreaction/suppression effects from injected context.
Audit multilingual safety with culturally adapted prompts, not direct translations alone; separately track refusal rate and comprehension to avoid “safety-by-failure” false comfort.
If using LLM judges, add reference/rubric variants, position-bias checks, and overflow diagnostics; avoid treating a single judge score as ground truth for long outputs.
For tool or GUI agents, log invalid calls, redundant calls, silent failures, and pre-execution critic interventions; these are often more actionable than task success alone.
In regulated or enterprise settings, define explicit system contracts: what state may cross boundaries, what evidence is required for approval, and what artifacts are auditable after the fact.
For safety adaptation on constrained devices, test lightweight distillation or soft-prompt methods against a dual-model guard baseline, and include over-refusal and adversarial robustness checks.

Generated from per-paper analyses; no external browsing.

Evaluation gets operational.

Takeaways

Start with: Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Themes

Papers Worth Your Reading Time

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Inference Cost Attacks for Retrieval-Augmented Large Language Models

MUSE: A Unified Agentic Harness for MLLMs

AI Paper Insight Brief

2026-06-14

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Deployment-realistic evaluation replaces static benchmarks

Theme: Harnesses, critics, and verifiers are becoming first-class capability multipliers

Theme: RAG and persistent-context systems face new attack classes

Theme: Oversight tools are brittle unless grounded, calibrated, and culturally aware

Theme: Alignment is moving toward system contracts, auditable boundaries, and targeted adaptation

3) Technical synthesis

4) Top 5 papers (with “why now”)

1. Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

2. Inference Cost Attacks for Retrieval-Augmented Large Language Models

3. MUSE: A Unified Agentic Harness for MLLMs

4. When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

5. OpenPCC: Open and Confidential LLM Serving on Commodity TEEs

5) Practical next steps