Takeaways

Agent research is shifting from raw task completion to **process quality**: multiple papers introduce rewards, benchmarks, or memory structures that explicitly optimize exploration quality, tool-use decisions, evidence selection, and efficiency rather than just final success.
**Evaluation itself is under attack or mis-specified**. Several papers show that current benchmarks can overstate capability because models exploit language priors, accessible tests, wild-only security datasets, or coarse aggregate metrics.
A strong pattern across safety/security work is **runtime, structure-aware defense**: manifold-trajectory jailbreak detection, capped coding evaluation, UI repair proxies, and runtime-verified malicious-skill benchmarks all move beyond static prompt or code inspection.

Start with: Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Why it catches my eye: It targets a core failure mode in agent progress claims: agents can exploit evaluations unless tests and rewards are designed against cheating.

Read skeptically for: Evidence is centered on coding evaluations, so transfer to broader agent settings remains unproven.

agents evaluation deception coding

arXiv PDF

Themes

Agent training is becoming reward-engineering for behavior, not just outcomes Several papers argue that end-task success alone produces brittle agents: overconfident tool calls, bloated web search, weak GUI credit assignment, and poor coding exploration. The common fix is to shape rewards around uncertainty, efficiency, process evidence, or trace-derived skills.

Benchmarks are increasingly measuring the wrong thing A recurring message is that current evaluations often conflate capabilities or reward shortcuts. This creates false confidence in model quality and makes progress hard to interpret.

Security defenses are moving to runtime and system level Static filtering is proving insufficient against adaptive attacks, hybrid artifacts, and supply-chain threats. The stronger papers here defend at the point where behavior becomes executable or observable.

Signal Benchmarks now need adversaries. Capped randomized coding tests, runtime-verified malicious-skill tasks, and slice-aware hallucination benchmarks all assume models will exploit weak evaluation setups.

Tension Better process signals add complexity. Uncertainty-aligned tool RL, GUI process rewards, and structured evidence grounding improve reliability, but they add verifier cost and new proxy-failure modes.

Bet Runtime controls will beat static filters. Jailbreak trajectory detection, malicious-skill runtime verification, and system-level agent defenses suggest live monitoring is becoming the practical safety layer.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Useful if you evaluate coding agents: it directly tests whether benchmark gains survive anti-cheating design.

Why now: Coding agents are improving fast, and inflated evals can mislead both training and deployment decisions.
Skepticism: The main evidence is in coding tasks, not the full range of tool-using agents.

arXiv PDF

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

A complementary paper on how to improve agent behavior itself, not just measure it, by reducing overconfident tool mistakes.

Why now: Tool-use errors are a common hidden cost in deployed agents, and standard RL can worsen them.
Skepticism: Its uncertainty signal is based on perplexity, which may miss richer trajectory-level uncertainty.

arXiv PDF

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

Worth opening for a concrete runtime defense that treats jailbreaks as dynamic representation shifts rather than static prompts.

Why now: Adaptive jailbreaks are making static prompt filtering less credible as a primary defense.
Skepticism: Attackers may eventually learn jailbreaks that stay closer to benign manifold trajectories.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 248
Selected: 30
Deepread completed: 30
Window (UTC): 2026-06-05T00:00:00Z → 2026-06-06T00:00:00Z (explicit, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2606.07131`	MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills PDF	cs.CR, cs.SE	95	Runtime-verified benchmark for malicious agent skills; highly relevant to agent security evaluation.	agent-safety, benchmark, malicious-skills, supply-chain, security-evaluation
`2606.07379`	Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests PDF	cs.LG, cs.AI, cs.CL, stat.ME	95	Targets agent cheating in coding evals with randomized tests and anti-cheating reward design.	agents, evaluation, deception, coding, reward-design, robustness
`2606.06976`	Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning PDF	cs.AI	93	Targets agent tool-use reliability by aligning RL with uncertainty to reduce overconfident mistakes.	agents, tool-use, uncertainty, reinforcement-learning, reliability, safety
`2606.07335`	Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics PDF	cs.CR	92	Jailbreak defense with adaptive-attack focus; strong deployment relevance for LLM safety.	jailbreak, defense, robustness, deployment-safety, adversarial
`2606.07150`	From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability PDF	cs.CR, cs.AI, cs.MA, cs.NI	92	Highlights metadata leakage in agent protocols; strong security relevance for interoperable agents.	agent-safety, security, privacy, protocols, MCP, A2A, workflow-integrity
`2606.07130`	Explicit Evidence Grounding via Structured Inline Citation Generation PDF	cs.CL	91	Structured inline citations for claim-level evidence grounding directly improve factuality and auditability.	grounding, citations, factuality, RAG, faithfulness, evaluation
`2606.07462`	Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle PDF	cs.AI	91	Benchmarking frontier research agents on ethics, judgment, and lifecycle tasks is highly safety-relevant.	agents, evaluation, research-agents, safety, benchmark
`2606.06959`	OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios PDF	cs.CL, cs.AI	89	Unified hallucination detection benchmark across settings; useful for reliable LLM evaluation.	hallucination, benchmark, evaluation, reliability, truthfulness
`2606.07402`	M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions PDF	cs.CL	89	Realistic multimodal memory benchmark for user-agent interactions; exposes key gaps in long-horizon agent memory.	benchmark, agents, multimodal, memory, evaluation, user-interaction
`2606.07074`	SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating PDF	cs.LG, cs.AI	88	Efficiency-aware web agents with adaptive reward gating; relevant for scalable, safer agent deployment.	web-agents, efficiency, reinforcement-learning, tool-use, training, deployment
`2606.07040`	Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling PDF	cs.CL	88	Reusable evaluation skills for reward modeling could improve scalable judging beyond ad hoc rubrics.	reward-modeling, evaluation, alignment, judges, preference-learning
`2606.06797`	Korean Culture into LLM Alignment: Toward Cultural Coherence PDF	cs.CL	88	Concrete DPO alignment pipeline for culturally coherent safe responses in Korean across open LLMs.	alignment, safety, DPO, multilingual, cultural-alignment
`2606.06914`	DPAgent-in-the-Middle: Agentic Defense and Repair Against AI-Groomed Deceptive Patterns PDF	cs.CR	87	Agentic defense against AI-groomed deceptive patterns and data-void manipulation threats.	agent-safety, privacy, deceptive-patterns, data-poisoning, security
`2606.07297`	SWE-Explore: Benchmarking How Coding Agents Explore Repositories PDF	cs.SE, cs.CL	87	Fine-grained benchmark for repository exploration, a core capability and failure point of coding agents.	coding-agents, benchmark, evaluation, repository-understanding, SWE
`2606.07412`	Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills PDF	cs.SE, cs.AI	86	Self-evolving coding agents from trace-derived skills could materially improve real-world agent capability.	coding-agents, self-improvement, training-data, software-engineering, agents
`2606.07027`	StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents PDF	cs.AI	86	Process rewards for GUI agents with evidence linking address long-horizon credit assignment.	agents, GUI-agents, process-reward-models, RL, credit-assignment
`2606.07515`	How reliable are LLMs when it comes to playing dice? PDF	cs.CL, cs.AI, cs.HC, math.PR	86	Strong reliability benchmark exposing token bias and prompt susceptibility in probabilistic reasoning.	reliability, reasoning, evaluation, prompting, robustness
`2606.07017`	The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective PDF	cs.AI, cs.CL, cs.ET	85	Frames FM-agent robustness as sim-to-real MDP gap; strong agenda-setting relevance.	agents, robustness, sim-to-real, evaluation, reliability
`2606.07512`	MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism PDF	cs.CV, cs.AI, cs.CL	85	Agentic retrieval plus hierarchical memory for long-video understanding looks broadly reusable and impactful.	multimodal, long-context, memory, agentic-retrieval, video-understanding, MLLM
`2606.06833`	Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks PDF	cs.LG, cs.AI, cs.CR	85	Shows LLM priors can strengthen real-time ASR attacks; notable AI security implication.	security, adversarial-attacks, ASR, LLMs, robustness
`2606.06946`	Auditing Training Data in Domain-adapted LLMs: LoRA-MINT PDF	cs.CL, cs.AI	84	Audits training-data membership in LoRA-adapted LLMs; concrete privacy/IP relevance.	privacy, membership-inference, LoRA, data-auditing, llm-security
`2606.07271`	Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path PDF	cs.LG, cs.AI, cs.SD	84	Analyzes membership leakage in rectified flows; strong privacy relevance for deployed generative models.	privacy, membership-inference, generative-models, security, rectified-flows
`2606.06890`	Diagnosing Visual Ignorance in Vision-Language Models PDF	cs.CV, cs.LG	84	Mechanistic analysis of VLM visual grounding failures; useful for multimodal reliability and evaluation.	VLM, interpretability, grounding, multimodal, reliability
`2606.06893`	Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition PDF	cs.AI	82	Automatic skill construction for agents with explicit safety/rollback structure in representation.	agents, skills, workflow, safety, tool-use
`2606.07437`	Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability PDF	cs.RO, cs.AI, cs.HC, cs.SE, eess.SY	82	Reframes AV safety with auditable predictability/transferability concepts; notable safety governance relevance.	autonomous-vehicles, safety, auditability, predictability, governance
`2606.07020`	MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights PDF	cs.CL	82	Agentic multilingual diagnosis framework for benchmark results offers reusable evaluation tooling.	evaluation, agents, multilingual, benchmarks, analysis
`2606.07218`	HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG PDF	cs.IR, cs.CL	82	Multi-hop RAG evidence organization with hypergraph keys; practical for grounded retrieval pipelines.	RAG, retrieval, multi-hop, grounding, knowledge
`2606.07000`	Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization PDF	cs.AI	81	Dense tutoring signals for multimodal RLVR may improve post-training without answer leakage.	multimodal, RLVR, post-training, distillation, reasoning
`2606.07299`	DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning PDF	cs.AI	80	Auditable multi-agent deep-research system targeting planning, verification, and hallucination risk.	agents, auditability, multi-agent, deep-research, grounding
`2606.07210`	A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization PDF	cs.SD, cs.CR	80	Per-speaker privacy analysis reveals uneven re-identification risk hidden by averages; useful evaluation lens.	privacy, speech, anonymization, evaluation, security, risk-analysis

AI Paper Insight Brief

2026-06-07

0) Executive takeaways (read this first)

Agent research is shifting from raw task completion to process quality: multiple papers introduce rewards, benchmarks, or memory structures that explicitly optimize exploration quality, tool-use decisions, evidence selection, and efficiency rather than just final success.
Evaluation itself is under attack or mis-specified. Several papers show that current benchmarks can overstate capability because models exploit language priors, accessible tests, wild-only security datasets, or coarse aggregate metrics.
A strong pattern across safety/security work is runtime, structure-aware defense: manifold-trajectory jailbreak detection, capped coding evaluation, UI repair proxies, and runtime-verified malicious-skill benchmarks all move beyond static prompt or code inspection.
For retrieval and grounding, the frontier is moving from “retrieve relevant chunks” to organize evidence into usable structures: hypergraphs for multi-hop RAG, structured inline citations, multimodal memory surrogates, and graph memory for long video all improve downstream reasoning by controlling evidence form.
Privacy risks are becoming more adaptation- and protocol-specific: LoRA fine-tuning leaks membership, rectified flows leak along specific interpolation regions, speech anonymization hides worst-case speaker risk, and agent interoperability leaks workflow intent through metadata even with encrypted payloads.
Practical implication: teams building frontier agents should invest less in monolithic end-to-end scaling and more in auditable intermediate representations, calibrated rewards, stress-test suites, and cost-aware runtime controls.

2) Key themes (clusters)

Theme: Agent training is becoming reward-engineering for behavior, not just outcomes

Why it matters: Several papers argue that end-task success alone produces brittle agents: overconfident tool calls, bloated web search, weak GUI credit assignment, and poor coding exploration. The common fix is to shape rewards around uncertainty, efficiency, process evidence, or trace-derived skills.
Representative papers:
Common approach:
- Replace scalar success rewards with structured signals: uncertainty separation, tool/token efficiency, entity-linked process rewards, or execution-grounded repair rewards.
- Use intermediate artifacts as training targets: key-turn annotations, minimal necessary paths, entity-state traces, or distilled skills from prior trajectories.
- Validate with ablations showing the shaping term is necessary, not just helpful.
Open questions / failure modes:
- Many methods rely on proxy uncertainty or proxy process signals that may not generalize beyond text or fixed tool spaces.
- Several approaches add substantial training complexity or verifier cost.
- Reward shaping can still be gamed if anchors, gates, or process verifiers are incomplete.

Theme: Benchmarks are increasingly measuring the wrong thing

Why it matters: A recurring message is that current evaluations often conflate capabilities or reward shortcuts. This creates false confidence in model quality and makes progress hard to interpret.
Representative papers:
Common approach:
- Decompose end-to-end performance into narrower measurable subproblems: exploration, hallucination detection, visual grounding, or cheating-resistant pass rates.
- Introduce stress tests or controlled perturbations: progressive blur, capped randomized tests, restricted-context repair, access-aware detector comparisons.
- Emphasize cost-aware or process-aware metrics rather than single leaderboard scores.
Open questions / failure modes:
- Many new benchmarks still depend on LLM judges, curated subsets, or trajectory-derived labels.
- Better diagnostics do not automatically produce better training signals.
- Coverage gaps remain for multimodal, long-context, closed-source, and interactive agent settings.

Theme: Security defenses are moving to runtime and system level

Why it matters: Static filtering is proving insufficient against adaptive attacks, hybrid artifacts, and supply-chain threats. The stronger papers here defend at the point where behavior becomes executable or observable.
Representative papers:
Common approach:
- Model attacks as dynamic processes: layer trajectories, live UI interactions, runtime skill execution, or streaming audio prefixes.
- Evaluate under adaptive or realistic threat models rather than static held-out attacks.
- Use system instrumentation or proxy interception to observe behavior where it matters.
Open questions / failure modes:
- Runtime defenses can be expensive and operationally brittle.
- Some threats remain architecture-specific or fail to transfer broadly.
- Benchmarks still struggle to cover the full hybrid space of prompt, code, tool, and UI attacks.

Theme: Evidence organization is becoming a first-class design problem

Why it matters: Better retrieval is no longer just about finding relevant text; it is about structuring evidence so the reader or agent can actually reason over it. Several papers show large gains from changing evidence form rather than changing the base model.
Representative papers:
Common approach:
- Separate storage/indexing from reasoning: textual surrogates, hypergraph keys, hierarchical graph memory, or posthoc citation alignment.
- Use structured evidence units rather than flat chunks: spans, hyperedges, modality-tagged surrogates, event graphs.
- Add retrieval controllers or agentic tool loops to query memory iteratively.
Open questions / failure modes:
- Gains often depend on upstream extraction quality; selection improves, but extraction remains a bottleneck.
- Structured memory can lose information if summaries or surrogates are too lossy.
- Many results are on fixed substrates or dev splits rather than full end-to-end deployments.

Theme: Privacy leakage is increasingly localized, conditional, and hard to see in averages

Why it matters: The privacy papers show that leakage is often hidden by average-case reporting. Risk depends on adaptation method, architecture, protocol metadata, or even specific interpolation regions during generation.
Representative papers:
Common approach:
- Replace average metrics with localized diagnostics: per-speaker linkability, λ-resolved membership profiles, metadata-view inference, or LoRA-specific perplexity thresholds.
- Study threat models tied to deployment reality: PEFT adaptation, passive metadata observers, semi-informed attackers.
- Show that leakage can remain high even when standard utility or validation metrics look stable.
Open questions / failure modes:
- Several methods assume white-box or partially privileged access.
- Calibration often depends on synthetic references, simulated generators, or fixed attacker settings.
- Defenses are less mature than the attacks and diagnostics.

Theme: Locale, culture, and researcher-quality behavior are entering alignment evaluation

Why it matters: Alignment work is broadening beyond generic refusal and generic task success toward locale-specific coherence and professional norms. This is a sign that “safe enough globally” is no longer a sufficient target.
Representative papers:
Common approach:
- Define constructive criteria, not just prohibited outputs: sociolegal anchoring, demographic specificity, multilingual sensitivity, researcher-like integrity.
- Build diagnosis pipelines that surface slice-level failures rather than aggregate scores.
- Use agentic analysis systems to turn benchmark outputs into actionable remediation plans.
Open questions / failure modes:
- Human validation remains limited in several cases.
- Locale-specific alignment can become stale as norms and laws change.
- Benchmarking professional behavior is still small-scale and partly dependent on handcrafted tasks.

3) Technical synthesis

A common design move is decoupling: perception from reasoning (MemDreamer), planning from search (DuMate), workflow from semantics/attachments (Workflow-to-Skill), and retrieval from evidence organization (HKVM-RAG, M3Proctor).
Many papers replace raw hidden states or outputs with structured intermediate signals: rank trajectories for jailbreak detection, stain concentrations for GUI rewards, hyperedges for multi-hop evidence, and λ-resolved reconstruction gaps for membership inference.
Several strong results come from offline artifact synthesis rather than online generation: Eval-Skill’s reusable judging skills, Korean cultural triplets, trace-derived SWE skills, and M3Proctor’s textual surrogates.
Ablation-driven causal claims are a norm in the stronger papers: removing uncertainty coefficients, correctness gates, global/local stain modules, or skill registries consistently degrades performance.
There is a broad shift from average-case metrics to worst-case or slice-aware evaluation: per-speaker privacy, PMPs for jailbreak detectors, multilingual slice diagnosis, and line-level repository exploration.
Multiple papers show that selection is the bottleneck more often than generation: support selection in HKVM-RAG, line-level evidence finding in SWE-Explore, visual grounding in VLMs, and snippet localization in FullCite.
Cost is now a first-class metric in evaluation: OpenHalDet profiles evidence acquisition cost, SlimSearcher optimizes tool/token usage, M3Proctor reduces retrieval tokens, and MemDreamer cuts active context by ~40×.
Security work increasingly assumes adaptive attackers: detector-aware jailbreaks, streaming ASR attackers with LLM priors, malicious skill supply chains, and metadata observers inferring future workflows.
Several papers use LLMs as infrastructure rather than endpoints: judges, safe-response generators, skill distillers, task generators, and diagnostic agents.
A recurring limitation is dependence on curated substrates: fixed candidate sets, cached extractors, synthetic references, or benchmark-specific annotations, which improves control but may narrow external validity.

4) Top 5 papers (with “why now”)

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios
- Standardizes hallucination detection across 17 datasets and 16 detectors under black-/gray-/white-box access regimes.
- Main takeaway is operational: detector rankings are scenario- and backbone-dependent, and evidence acquisition often dominates cost.
- Useful now because teams are shipping detectors without a fair way to compare them under realistic access constraints.
- Skeptical about: labels rely on an LLM judge and coverage excludes multimodal, long-context, and interactive agent settings.
Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
- Introduces a zero-shot jailbreak detector based on layer-wise nearest-benign rank trajectories rather than static features.
- Reports strong AUROC, low PMP false positives, and resilience under adaptive attacks, plus transfer to VLMs.
- Useful now because jailbreak defense is increasingly an adaptive-attack problem, not a static classification problem.
- Skeptical about: the defense assumes jailbreaks induce detectable manifold irregularities; stronger attacks may learn to stay on-manifold.
Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning
- Shows standard RL can make tool-using agents more overconfident on wrong actions, then fixes this with uncertainty-aligned rewards.
- Delivers gains on When2Call, BFCL-V4, and ToolSandbox while restoring separation between correct and incorrect decision uncertainty.
- Useful now because tool-use errors are a major source of downstream agent failures and hidden costs.
- Skeptical about: uncertainty is instantiated via perplexity, which may miss richer semantic or trajectory-level uncertainty.
SWE-Explore: Benchmarking How Coding Agents Explore Repositories
- Separates repository exploration from patch synthesis and evaluates ranked line-level evidence selection under a fixed budget.
- Shows agentic explorers beat classical retrieval, but line-level recall remains low and strongly predicts downstream repair.
- Useful now because coding-agent progress is increasingly bottlenecked by localization, not just patch generation.
- Skeptical about: ground truth is trajectory-derived and limited to issues solved by at least two successful runs.
MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills
- Builds a runtime-verified benchmark of malicious skills spanning code injection, prompt injection, and mixed attacks.
- Demonstrates that wild-only evaluation is badly biased and that existing detectors either over-trigger or miss hybrid attacks.
- Useful now because agent ecosystems are starting to import third-party skills and plugins faster than security tooling is adapting.
- Skeptical about: limitations around verification noise and platform breadth are not fully characterized in the provided analysis.

5) Practical next steps

Add process-level telemetry to agent training and eval: uncertainty traces, tool-call counts, evidence windows, line-level exploration logs, and retrieval cost.
Stress-test any deployed evaluator or benchmark with shortcut probes: blurred images, randomized capped tests, PMPs, wild-vs-synthetic splits, and restricted-context patching.
For tool-using agents, try reward shaping with correctness gates plus efficiency/uncertainty terms before scaling model size or context length.
Build retrieval stacks around structured evidence objects rather than flat chunks: spans, hyperedges, event graphs, modality-tagged surrogates, or executable skills.
Audit PEFT and generative systems for privacy with adaptation-specific probes: LoRA membership tests, per-user worst-case metrics, and trajectory-aware leakage scans.
Treat agent security as a runtime systems problem: inspect live UI state, skill execution traces, and internal representation trajectories rather than relying only on prompt filters.
For multilingual or locale-sensitive deployments, define constructive alignment rubrics that specify what a good local response should contain, not just what to suppress.
Track cost-quality Pareto fronts explicitly in benchmarks and training loops; several papers show accuracy gains can come with avoidable token, tool, or evidence-acquisition overhead.

Generated from per-paper analyses; no external browsing.

Agent evaluation turns adversarial.

Takeaways

Start with: Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Themes

Papers Worth Your Reading Time

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

AI Paper Insight Brief

2026-06-07

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Agent training is becoming reward-engineering for behavior, not just outcomes

Theme: Benchmarks are increasingly measuring the wrong thing

Theme: Security defenses are moving to runtime and system level

Theme: Evidence organization is becoming a first-class design problem

Theme: Privacy leakage is increasingly localized, conditional, and hard to see in averages

Theme: Locale, culture, and researcher-quality behavior are entering alignment evaluation

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps