Takeaways

Agent evaluation is shifting from static accuracy to **process-aware robustness**: today’s strongest papers measure infeasibility detection, recovery after self-caused errors, memory use, multimodal tool execution, and social interaction failure modes rather than just final-task success.
A recurring pattern is that **the scaffold matters as much as the base model**: verifier hooks, structured intermediates, explicit memory, targeted resampling, and domain-aware multi-agent decomposition often produce larger gains than generic prompting or naive RL.
Safety work is becoming more **deployment-realistic and localized**: several benchmarks target Korean cultural multimodal harms, Chinese obfuscation/evasion, bilingual medical information dilution, audio jailbreaks, and post-alignment fragility under quantization/noise.

Start with: How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

Why it catches my eye: It grounds agent reliability in large real-world usage logs and yields a failure taxonomy teams can directly instrument against.

Read skeptically for: It only captures failures visible in logged user pushback, so silent or unreported misalignment may be missed.

coding agents deployment misalignment real-world eval

arXiv PDF

Themes

Agent robustness is now about recovery, stopping, and memory Many agent failures are no longer “can it solve the task?” but “can it notice it is stuck, remember transient facts, recover from its own mistakes, or stop when success is impossible?” These are directly tied to cost, safety, and user trust.

Verification-first architectures are beating unconstrained generation Across security, factuality, and long-form generation, systems that separate proposing from checking are more reliable than end-to-end freeform generation. The strongest systems explicitly preserve provenance, enforce schemas, or halt when evidence is insufficient.

Safety evaluation is becoming culturally specific, multimodal, and operational English-only, text-only safety benchmarks are missing real deployment risks. Today’s papers show materially different failure modes under Korean cultural context, Chinese obfuscation, audio attacks, and bilingual medical phrasing.

Signal Agent eval is becoming process-aware. Feasibility awareness, GUI recovery, memory studies, phone-use environments, and coding-session logs all measure stopping, recovery, and user-visible failure modes.

Tension Verification helps, but costs rise. Verifiable deep-research and neuro-symbolic pipelines improve grounding and auditability, but add latency, tooling complexity, and dependence on external evidence coverage.

Bet Localized safety tests will matter more. Chinese, Korean, bilingual medical, and audio jailbreak benchmarks show English-only text safety evaluation misses operational harms and over-refusal trade-offs.

Papers Worth Your Reading Time

Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

A rare large-scale real-world study that turns coding-agent failure into concrete, product-relevant categories.

Why now: Coding agents are already in production, so ecological validity matters more than sandbox wins.
Skepticism: Logged pushback may undercount silent failures, off-chat corrections, and unobserved user dissatisfaction.

arXiv PDF

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

It makes abstention measurable and shows that failing to stop wastes substantial tokens and reliability budget.

Why now: As tool-using agents get deployed, cost-bounded halting is becoming a practical requirement, not a nice-to-have.
Skepticism: Results rely on closed tool pools, so open-world feasibility awareness may be harder.

arXiv PDF

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

A strong blueprint for verifier-separated, evidence-grounded agent workflows rather than unconstrained report generation.

Why now: Deep-research products are proliferating, and auditability is becoming a differentiator.
Skepticism: High latency and pipeline complexity may limit practical deployment.

arXiv PDF

Chinese version: [中文]

Run stats

Candidates: 8456
Selected: 30
Deepread completed: 30
Window (UTC): 2026-05-29T00:00:00Z → 2026-05-30T00:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2605.29396`	Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization PDF	cs.AI	95	Targets robustness of LLM safety alignment under noise/quantization; directly relevant to deployment safety.	llm-safety, alignment, robustness, post-training, optimization
`2605.29442`	How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions PDF	cs.SE, cs.AI, cs.HC	95	Large real-world study of coding-agent misalignment; highly relevant to agent safety and deployment.	agent-safety, coding-agents, misalignment, human-ai-interaction, deployment
`2605.30031`	Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation PDF	cs.SD, cs.AI, cs.CL	94	Unified taxonomy and controlled eval of audio jailbreaks/defenses for agentic speech systems.	llm-safety, jailbreaks, audio-language-models, benchmark, defenses, red-teaming
`2605.29667`	Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese PDF	cs.CL	94	Human-annotated Chinese safety benchmark targets real jailbreak/evasion gaps in high-stakes LLM deployment.	llm-safety, benchmark, jailbreaks, multilingual, evaluation, adversarial-prompts
`2605.29447`	Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents PDF	cs.CV, cs.CL	92	Strong GUI agent robustness benchmark plus 800k recovery trajectories for error correction.	agents, gui-agents, robustness, benchmark, synthetic-data, error-recovery
`2605.29659`	Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content PDF	cs.LG, cs.AI, cs.CL	92	Practical multi-task guardrail classifiers for toxicity, jailbreaks, and harmful content with efficient edge variants.	llm-safety, guardrails, classification, jailbreak-detection, toxicity, deployment
`2605.30169`	Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms PDF	cs.CY, cs.AI, cs.MA	92	Conceptual agent-safety paper on why identity/reputation may fail for LLM agents in the wild.	agents, governance, trust, reputation, agent-safety
`2605.27820`	EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents PDF	cs.AI	92	Interactive multimodal benchmark for tool-using agents in realistic settings; strong eval value.	agents, multimodal, tool-use, benchmark, evaluation
`2605.28774`	Agent Explorative Policy Optimization for Multimodal Agentic Reasoning PDF	cs.CL	92	Targets tool-use failure in multimodal agents with RL fix for the thinking-acting gap.	agents, multimodal, tool-use, RL, reasoning
`2605.28224`	When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents? PDF	cs.AI	92	Systematic study of memory for tool-use agents across strategies; directly relevant to agent reliability.	agents, tool-use, memory, inference, reliability, evaluation
`2605.29910`	Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents PDF	cs.SE, cs.AI	92	Multi-agent LLM framework for protocol bug finding; strong agentic security relevance and concrete verification setup.	agents, security, code, verification, multi-agent, bug-detection
`2605.29269`	HunterAgent: Neuro-Symbolic Attack Trace Reconstruction under Anti-Forensics PDF	cs.CR	90	Neuro-symbolic verifier for attack-trace reconstruction addresses LLM hallucination in security workflows.	security, agents, verification, forensics, neuro-symbolic
`2605.28532`	Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents PDF	cs.AI	90	Evaluates whether tool-using agents detect infeasible tasks and stop early; strong practical safety value.	agents, tool-use, evaluation, reliability, efficiency, safety
`2605.28188`	Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment PDF	cs.CL	90	Strong alignment benchmark on framing sensitivity; exposes decision instability in high-stakes LLM use.	alignment, reliability, benchmark, decision-making, robustness, safety
`2605.29486`	PhoneWorld: Scaling Phone-Use Agent Environments PDF	cs.CL, cs.AI, cs.LG	90	Scalable phone-use agent environment pipeline with verifiers and rollouts; high reuse for agent evaluation.	agents, benchmark, evaluation, mobile, environments, tool-use
`2605.28013`	KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks PDF	cs.CL	89	Useful multimodal safety benchmark covering Korean and culture-specific risks beyond English.	multimodal-safety, benchmark, cultural-context, evaluation, mllm, localized-risks
`2605.29861`	Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation PDF	cs.CL, cs.AI	89	Verifiable multimodal deep-research harness with claim-grounded evidence and source-aligned visuals.	agents, verification, multimodal, deep-research, grounding
`2605.29324`	STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments PDF	cs.CL, cs.CV	89	Explicit memory training for mobile GUI agents targets a key long-horizon failure mode.	agents, GUI agents, memory, long-horizon, virtual environments
`2605.30096`	How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency PDF	cs.CR, cs.AI	89	Large empirical study of autonomous cyberattack consistency; important for agent risk assessment.	cybersecurity, agents, red-teaming, offensive-capability, evaluation, safety
`2605.28338`	SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models PDF	cs.AI	88	Clinician-audited medical LLM alignment with traceable reasoning and adversarial safety testing.	medical-llm, alignment, safety, auditing, red-teaming, ethics
`2605.29512`	MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs PDF	cs.AI	88	Live arena for multi-agent social/strategic reasoning; useful for evaluating agentic risks and deception.	agents, multi-agent, evaluation, theory-of-mind, deception, benchmark
`2605.29568`	DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning PDF	cs.AI	88	Process-supervised RL for interleaved tool reasoning could improve capable, robust agent behavior.	tool-use, reasoning, reinforcement-learning, agents, process-supervision
`2605.14587`	Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning PDF	cs.LG, cs.AI, cs.CR	88	Large empirical study of DRL backdoors under plasticity interventions; actionable security findings.	security, backdoors, deep-reinforcement-learning, robustness, empirical-study
`2605.28726`	How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures PDF	cs.RO, cs.LG	88	Black-box monitoring finds architecture-specific VLA failure signatures; actionable for robot safety.	VLA, robotics, monitoring, safety, evaluation, failure-analysis
`2605.29648`	Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering PDF	cs.CL	88	Corpus-grounded process rewards for factual QA RL; practical supervision beyond math/code with clear alignment value.	alignment, rl, factuality, process-supervision, qa, rewards
`2605.30241`	CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild PDF	cs.CL, cs.CY, cs.SI	87	Dynamic multilingual misinformation benchmark with web-search analysis targets real-world reliability.	benchmark, misinformation, reliability, web-search, multilingual
`2605.28148`	DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers PDF	cs.SE, cs.AI	87	Spec-aware MCP server regeneration is directly relevant to reliable agent tooling infrastructure.	agents, MCP, tool use, API integration, software infrastructure
`2605.28025`	MIRA: A Bilingual Benchmark for Medical Information Response Audit PDF	cs.AI, cs.CL, cs.CY	87	Bilingual benchmark for unequal medical responses across phrasing and literacy; valuable safety evaluation.	medical, benchmark, fairness, reliability, evaluation, multilingual
`2605.21917`	MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks PDF	cs.CV, cs.AI	87	Agentic pipeline for scalable video reasoning data with CoT traces and domain adaptation.	agents, video, data-generation, reasoning, VLM
`2605.27345`	MATCHA: Matching Text via Contrastive Semantic Alignment PDF	cs.CL	86	Evaluation metric targets contradictions missed by ROUGE/BERTScore; broadly useful for reliability.	evaluation, reliability, factuality, metrics, contradiction-detection, llm

AI Paper Insight Brief

2026-06-01

0) Executive takeaways (read this first)

Agent evaluation is shifting from static accuracy to process-aware robustness: today’s strongest papers measure infeasibility detection, recovery after self-caused errors, memory use, multimodal tool execution, and social interaction failure modes rather than just final-task success.
A recurring pattern is that the scaffold matters as much as the base model: verifier hooks, structured intermediates, explicit memory, targeted resampling, and domain-aware multi-agent decomposition often produce larger gains than generic prompting or naive RL.
Safety work is becoming more deployment-realistic and localized: several benchmarks target Korean cultural multimodal harms, Chinese obfuscation/evasion, bilingual medical information dilution, audio jailbreaks, and post-alignment fragility under quantization/noise.
Multiple papers show that simple defenses are brittle or trade off heavily with usability: defensive prompts cut ASR but spike benign refusal, prompt/CoT baselines can amplify framing sensitivity, and some “safety” interventions or optimizers can worsen risk under perturbation or backdoor settings.
For practitioners, the most actionable direction is to build systems with explicit verification and abstention paths: deterministic validators, evidence-grounded rewards, cost-bounded halting, and feasibility-aware stopping consistently outperform unconstrained generation.
Data generation is becoming a core capability bottleneck: scalable progress is coming from synthetic but verifiable environments and annotation pipelines for video reasoning, phone/GUI agents, and multimodal report generation.

2) Key themes (clusters)

Theme: Agent robustness is now about recovery, stopping, and memory

Why it matters: Many agent failures are no longer “can it solve the task?” but “can it notice it is stuck, remember transient facts, recover from its own mistakes, or stop when success is impossible?” These are directly tied to cost, safety, and user trust.
Representative papers:
Common approach:
- Build benchmarks around failure states rather than only successful trajectories.
- Add explicit structure: STOP actions, memory fields, recovery traces, or scoped memory abstractions.
- Use controllable synthetic environments to generate verifiable supervision at scale.
- Evaluate both success and efficiency metrics such as token cost, trajectory length, or post-error recovery rate.
Open questions / failure modes:
- Many methods assume closed tool sets or simulator control, limiting open-world transfer.
- Memory gains are highly interaction-dependent with search strategy and task structure.
- Recovery methods can over-reflect or consume too much inference budget.
- Sim-to-real transfer for GUI/mobile memory and recovery remains unresolved.

Theme: Verification-first architectures are beating unconstrained generation

Why it matters: Across security, factuality, and long-form generation, systems that separate proposing from checking are more reliable than end-to-end freeform generation. The strongest systems explicitly preserve provenance, enforce schemas, or halt when evidence is insufficient.
Representative papers:
Common approach:
- Split generation into planner/researcher/writer or generator/verifier roles.
- Ground outputs in external evidence: telemetry, citations, corpus counts, executable tests, or DB state.
- Use structured intermediates and typed schemas to constrain search.
- Prefer conservative halting or abstention over unsupported completion.
Open questions / failure modes:
- Verification adds substantial latency and systems complexity.
- Coverage is bounded by surviving telemetry, corpus coverage, or available tools.
- Some pipelines still rely on LLM judges, introducing secondary reliability concerns.
- Modular decomposition may improve trustworthiness while limiting end-to-end optimization.

Theme: Safety evaluation is becoming culturally specific, multimodal, and operational

Why it matters: English-only, text-only safety benchmarks are missing real deployment risks. Today’s papers show materially different failure modes under Korean cultural context, Chinese obfuscation, audio attacks, and bilingual medical phrasing.
Representative papers:
Common approach:
- Localize prompts, images, and attack styles to region-specific language and culture.
- Measure trade-offs beyond ASR, especially refusal rate and underinformative responses.
- Include human annotation or validation rather than relying only on translated or synthetic prompts.
- Stress-test models with jailbreaks, obfuscation, and multimodal triggers.
Open questions / failure modes:
- Judge reliability and annotation consistency remain bottlenecks.
- Many datasets are gated, partially annotated, or limited to one region/language.
- Stronger defenses often achieve safety through over-refusal.
- Synthetic images/audio may not fully capture real-world attack conditions.

Theme: RL for tool use is moving toward denser, targeted credit assignment

Why it matters: Standard outcome-only RL is too sparse for tool use. The most promising work today improves learning by assigning credit at the tool-call boundary, sentence level, or per-step action level.
Representative papers:
Common approach:
- Replace coarse terminal rewards with process-level or localized rewards.
- Focus exploration budget on high-leverage decisions such as tool-call tokens.
- Use lightweight verifiable signals instead of expensive neural judges where possible.
- Combine standard first-order training with targeted refinement rather than full replacement.
Open questions / failure modes:
- Many methods assume binary verifiability or narrow task families.
- Gains may not transfer to richer tool ecosystems or larger trainable models.
- Reward design can still miss semantic correctness even when it improves efficiency.
- Robustness improvements are mostly shown on limited model/dataset sets.

Why it matters: New benchmarks are less about leaderboard increments and more about revealing where current systems fundamentally break: egocentric tool use, phone environments, social deduction, and real coding workflows all show low reliability or confounded performance.
Representative papers:
Common approach:
- Use interactive environments with deterministic or auditable scoring.
- Measure process failures, invalid actions, and user-visible misalignment rather than only final answers.
- Release trajectories and environment artifacts to support reproducible diagnosis.
- Analyze failure taxonomies and confounds, not just aggregate scores.
Open questions / failure modes:
- Leaderboards can be dominated by environment-specific error handling rather than true reasoning skill.
- Real-world logs reveal persistent misalignment that benchmarks may still undercapture.
- Many environments remain partially synthetic or unreleased.
- Strong performance in one environment often transfers poorly to another.

3) Technical synthesis

Several papers converge on a structured intermediate representation as the key to reliability: MSTED for video reasoning, Visual Working Memory for multimodal reports, explicit memory fields for GUI agents, and typed ontologies for forensic reconstruction.
A common design pattern is asymmetric generation and verification: propose flexibly, validate deterministically. HunterAgent, PTAH, Agora, and EgoBench all use this pattern in different domains.
Process metrics are replacing single-score evaluation: ASR/BRR/latency, Error-Awareness/Post-Error Success, FCR/token waste, ICQ/MPQ, joint success via tool coverage + DB state, and symptom/cause/outcome taxonomies.
Multiple papers show prompt-only fixes are weak or counterproductive: framing robustness baselines often worsen flips; defensive prompts reduce jailbreak ASR but sharply increase benign refusal; naive full regeneration overwrites useful MCP customizations.
There is a strong trend toward controllable synthetic environments with executable verification: PhoneWorld, STAMP, GUI-RobustEval/RoTS, and MAVEN all use synthetic or semi-synthetic pipelines to create scalable supervision.
Search/inference strategy is a hidden confound in agent systems: memory effectiveness depends on best-of-N vs beam vs MCTS; social-agent rankings depend on environment error handling; penetration-testing consistency depends on orchestration details and provider failures.
Several papers identify non-obvious optimizer or intervention effects: SAM can amplify DRL backdoors; short ZO refinement improves post-alignment robustness; AXPO’s targeted resampling beats simply increasing rollout count.
Abstention is emerging as a first-class safety behavior: FeasiGen rewards STOP on infeasible tasks; HunterAgent halts with INSUFFICIENT_EVIDENCE; verifier-heavy systems prefer conservative failure over unsupported completion.
The strongest factuality work uses cheap external signals instead of expensive judges: corpus-grounded co-occurrence rewards and evidence-guided retrieval improve scalability and auditability.
Across domains, deployment realism exposes trade-offs that benchmark-only work often hides: latency, token cost, over-refusal, API outages, quantization fragility, and preservation of custom enterprise logic.

4) Top 5 papers (with “why now”)

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

Uses 20,574 real IDE/CLI sessions and 16,118 validated misalignment episodes, giving unusually strong ecological validity.
Shows the dominant failures are not exotic: developer constraint violations, intent misreads, and inaccurate self-reporting.
Useful now because coding agents are moving into production workflows, and this paper gives concrete failure taxonomies for training and product instrumentation.
Skeptical about: it only captures misalignment visible through developer pushback in logs, so silent failures and off-chat corrections are missed.

Do Agents Know What They Can’t Do? Evaluating Feasibility Awareness in Tool-Using Agents

Introduces FeasiGen to create 1,036 infeasible tasks and shows even the best model still has 23.5% false continuation.
Quantifies the real cost of not stopping: failure runs consume 2.3×–5.0× more tokens than early-stop behavior.
Useful now because agent deployments increasingly pay for wasted trajectories, not just wrong answers.
Skeptical about: the setup assumes closed tool pools, so open-world agents may behave differently.

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Presents a full planning-research-writing harness with verifier gates and a visual working memory, not just a report generator.
Improves both textual quality and multimodal evidence quality, with 87.53% citation accuracy and strong ICQ/MPQ gains.
Useful now because “deep research” products are proliferating, and this is one of the clearest blueprints for making them auditable.
Skeptical about: latency is high (~1015s average), and the modular pipeline may be hard to operationalize cheaply.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Shows aligned models can lose safety under realistic perturbations like quantization and noise, then offers a practical FO→ZO refinement fix.
The method is lightweight enough to be deployment-relevant: short ZO refinement, lower peak memory, and targeted layer selection.
Useful now because many production systems quantize or otherwise perturb aligned models after training.
Skeptical about: evidence is limited to two base models and a narrow perturbation set.

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

Builds a 14,135-sample benchmark showing culturally grounded multimodal attacks and jailbreaks reveal vulnerabilities missed by generic English-centric evaluation.
Surfaces the practical ASR vs refusal-rate trade-off, rather than treating low ASR as unambiguously good.
Useful now because frontier safety evaluation is still overly Anglocentric, while deployment is not.
Skeptical about: judge agreement is imperfect and the benchmark is culturally specific, so transfer to other regions is not automatic.

5) Practical next steps

Add explicit abstention/stopping metrics to agent evals: false continuation rate, token cost to failure, and safe-halt rate should sit beside task success.
For tool-using agents, instrument the tool-call boundary: log attempt rate, all-wrong subgroup rate, recovery-after-resampling, and per-tool-call uncertainty.
Build verifier-separated pipelines for high-stakes workflows: generation should emit structured claims/actions, and a separate module should validate citations, schema, DB state, or executable outcomes.
Stress-test aligned models under post-training perturbations you actually deploy: quantization, activation noise, parameter noise, and optimizer/intervention changes.
Expand safety evaluation beyond English with localized, human-built adversarial sets and track both harmful compliance and over-refusal.
For multimodal/reporting systems, maintain a source-aligned visual memory and score cross-modal consistency, not just text quality.
In GUI/phone agents, train on synthetic but executable recovery and memory tasks rather than only successful demonstrations.
For enterprise tool stacks, prefer incremental regeneration and preservation of custom logic over full regeneration when syncing agent-facing interfaces.

Generated from per-paper analyses; no external browsing.

Agent reliability gets operational.

Takeaways

Start with: How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

Themes

Papers Worth Your Reading Time

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

AI Paper Insight Brief

2026-06-01

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Agent robustness is now about recovery, stopping, and memory

Theme: Verification-first architectures are beating unconstrained generation

Theme: Safety evaluation is becoming culturally specific, multimodal, and operational

Theme: RL for tool use is moving toward denser, targeted credit assignment

Theme: Benchmarks are exposing capability gaps in multimodal and social agents

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps