June 1, 2026 Research Brief
Agent reliability gets operational.
Today’s papers push agents and safety systems toward deployment reality: process-aware evaluation, verifier-first scaffolds, and localized multimodal safety tests expose failures static benchmarks miss.
Takeaways
- Agent evaluation is shifting from static accuracy to **process-aware robustness**: today’s strongest papers measure infeasibility detection, recovery after self-caused errors, memory use, multimodal tool execution, and social interaction failure modes rather than just final-task success.
- A recurring pattern is that **the scaffold matters as much as the base model**: verifier hooks, structured intermediates, explicit memory, targeted resampling, and domain-aware multi-agent decomposition often produce larger gains than generic prompting or naive RL.
- Safety work is becoming more **deployment-realistic and localized**: several benchmarks target Korean cultural multimodal harms, Chinese obfuscation/evasion, bilingual medical information dilution, audio jailbreaks, and post-alignment fragility under quantization/noise.
Start with: How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions
Why it catches my eye: It grounds agent reliability in large real-world usage logs and yields a failure taxonomy teams can directly instrument against.
Read skeptically for: It only captures failures visible in logged user pushback, so silent or unreported misalignment may be missed.
Themes
Papers Worth Your Reading Time
Ranked for research usefulness: novelty, method pattern, evidence quality, and skepticism value.
How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions
#1A rare large-scale real-world study that turns coding-agent failure into concrete, product-relevant categories.
- Why now
- Coding agents are already in production, so ecological validity matters more than sandbox wins.
- Skepticism
- Logged pushback may undercount silent failures, off-chat corrections, and unobserved user dissatisfaction.
Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents
#2It makes abstention measurable and shows that failing to stop wastes substantial tokens and reliability budget.
- Why now
- As tool-using agents get deployed, cost-bounded halting is becoming a practical requirement, not a nice-to-have.
- Skepticism
- Results rely on closed tool pools, so open-world feasibility awareness may be harder.
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
#3A strong blueprint for verifier-separated, evidence-grounded agent workflows rather than unconstrained report generation.
- Why now
- Deep-research products are proliferating, and auditability is becoming a differentiator.
- Skepticism
- High latency and pipeline complexity may limit practical deployment.
Chinese version: [中文]
Run stats
- Candidates: 8456
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-05-29T00:00:00Z → 2026-05-30T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2605.29396 | Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization | cs.AI | 95 | Targets robustness of LLM safety alignment under noise/quantization; directly relevant to deployment safety. | llm-safety, alignment, robustness, post-training, optimization |
2605.29442 | How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions | cs.SE, cs.AI, cs.HC | 95 | Large real-world study of coding-agent misalignment; highly relevant to agent safety and deployment. | agent-safety, coding-agents, misalignment, human-ai-interaction, deployment |
2605.30031 | Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation | cs.SD, cs.AI, cs.CL | 94 | Unified taxonomy and controlled eval of audio jailbreaks/defenses for agentic speech systems. | llm-safety, jailbreaks, audio-language-models, benchmark, defenses, red-teaming |
2605.29667 | Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese | cs.CL | 94 | Human-annotated Chinese safety benchmark targets real jailbreak/evasion gaps in high-stakes LLM deployment. | llm-safety, benchmark, jailbreaks, multilingual, evaluation, adversarial-prompts |
2605.29447 | Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents | cs.CV, cs.CL | 92 | Strong GUI agent robustness benchmark plus 800k recovery trajectories for error correction. | agents, gui-agents, robustness, benchmark, synthetic-data, error-recovery |
2605.29659 | Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content | cs.LG, cs.AI, cs.CL | 92 | Practical multi-task guardrail classifiers for toxicity, jailbreaks, and harmful content with efficient edge variants. | llm-safety, guardrails, classification, jailbreak-detection, toxicity, deployment |
2605.30169 | Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms | cs.CY, cs.AI, cs.MA | 92 | Conceptual agent-safety paper on why identity/reputation may fail for LLM agents in the wild. | agents, governance, trust, reputation, agent-safety |
2605.27820 | EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents | cs.AI | 92 | Interactive multimodal benchmark for tool-using agents in realistic settings; strong eval value. | agents, multimodal, tool-use, benchmark, evaluation |
2605.28774 | Agent Explorative Policy Optimization for Multimodal Agentic Reasoning | cs.CL | 92 | Targets tool-use failure in multimodal agents with RL fix for the thinking-acting gap. | agents, multimodal, tool-use, RL, reasoning |
2605.28224 | When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents? | cs.AI | 92 | Systematic study of memory for tool-use agents across strategies; directly relevant to agent reliability. | agents, tool-use, memory, inference, reliability, evaluation |
2605.29910 | Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents | cs.SE, cs.AI | 92 | Multi-agent LLM framework for protocol bug finding; strong agentic security relevance and concrete verification setup. | agents, security, code, verification, multi-agent, bug-detection |
2605.29269 | HunterAgent: Neuro-Symbolic Attack Trace Reconstruction under Anti-Forensics | cs.CR | 90 | Neuro-symbolic verifier for attack-trace reconstruction addresses LLM hallucination in security workflows. | security, agents, verification, forensics, neuro-symbolic |
2605.28532 | Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents | cs.AI | 90 | Evaluates whether tool-using agents detect infeasible tasks and stop early; strong practical safety value. | agents, tool-use, evaluation, reliability, efficiency, safety |
2605.28188 | Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment | cs.CL | 90 | Strong alignment benchmark on framing sensitivity; exposes decision instability in high-stakes LLM use. | alignment, reliability, benchmark, decision-making, robustness, safety |
2605.29486 | PhoneWorld: Scaling Phone-Use Agent Environments | cs.CL, cs.AI, cs.LG | 90 | Scalable phone-use agent environment pipeline with verifiers and rollouts; high reuse for agent evaluation. | agents, benchmark, evaluation, mobile, environments, tool-use |
2605.28013 | KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks | cs.CL | 89 | Useful multimodal safety benchmark covering Korean and culture-specific risks beyond English. | multimodal-safety, benchmark, cultural-context, evaluation, mllm, localized-risks |
2605.29861 | Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation | cs.CL, cs.AI | 89 | Verifiable multimodal deep-research harness with claim-grounded evidence and source-aligned visuals. | agents, verification, multimodal, deep-research, grounding |
2605.29324 | STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments | cs.CL, cs.CV | 89 | Explicit memory training for mobile GUI agents targets a key long-horizon failure mode. | agents, GUI agents, memory, long-horizon, virtual environments |
2605.30096 | How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency | cs.CR, cs.AI | 89 | Large empirical study of autonomous cyberattack consistency; important for agent risk assessment. | cybersecurity, agents, red-teaming, offensive-capability, evaluation, safety |
2605.28338 | SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models | cs.AI | 88 | Clinician-audited medical LLM alignment with traceable reasoning and adversarial safety testing. | medical-llm, alignment, safety, auditing, red-teaming, ethics |
2605.29512 | MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs | cs.AI | 88 | Live arena for multi-agent social/strategic reasoning; useful for evaluating agentic risks and deception. | agents, multi-agent, evaluation, theory-of-mind, deception, benchmark |
2605.29568 | DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning | cs.AI | 88 | Process-supervised RL for interleaved tool reasoning could improve capable, robust agent behavior. | tool-use, reasoning, reinforcement-learning, agents, process-supervision |
2605.14587 | Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning | cs.LG, cs.AI, cs.CR | 88 | Large empirical study of DRL backdoors under plasticity interventions; actionable security findings. | security, backdoors, deep-reinforcement-learning, robustness, empirical-study |
2605.28726 | How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures | cs.RO, cs.LG | 88 | Black-box monitoring finds architecture-specific VLA failure signatures; actionable for robot safety. | VLA, robotics, monitoring, safety, evaluation, failure-analysis |
2605.29648 | Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering | cs.CL | 88 | Corpus-grounded process rewards for factual QA RL; practical supervision beyond math/code with clear alignment value. | alignment, rl, factuality, process-supervision, qa, rewards |
2605.30241 | CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild | cs.CL, cs.CY, cs.SI | 87 | Dynamic multilingual misinformation benchmark with web-search analysis targets real-world reliability. | benchmark, misinformation, reliability, web-search, multilingual |
2605.28148 | DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers | cs.SE, cs.AI | 87 | Spec-aware MCP server regeneration is directly relevant to reliable agent tooling infrastructure. | agents, MCP, tool use, API integration, software infrastructure |
2605.28025 | MIRA: A Bilingual Benchmark for Medical Information Response Audit | cs.AI, cs.CL, cs.CY | 87 | Bilingual benchmark for unequal medical responses across phrasing and literacy; valuable safety evaluation. | medical, benchmark, fairness, reliability, evaluation, multilingual |
2605.21917 | MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks | cs.CV, cs.AI | 87 | Agentic pipeline for scalable video reasoning data with CoT traces and domain adaptation. | agents, video, data-generation, reasoning, VLM |
2605.27345 | MATCHA: Matching Text via Contrastive Semantic Alignment | cs.CL | 86 | Evaluation metric targets contradictions missed by ROUGE/BERTScore; broadly useful for reliability. | evaluation, reliability, factuality, metrics, contradiction-detection, llm |
AI Paper Insight Brief
2026-06-01
0) Executive takeaways (read this first)
- Agent evaluation is shifting from static accuracy to process-aware robustness: today’s strongest papers measure infeasibility detection, recovery after self-caused errors, memory use, multimodal tool execution, and social interaction failure modes rather than just final-task success.
- A recurring pattern is that the scaffold matters as much as the base model: verifier hooks, structured intermediates, explicit memory, targeted resampling, and domain-aware multi-agent decomposition often produce larger gains than generic prompting or naive RL.
- Safety work is becoming more deployment-realistic and localized: several benchmarks target Korean cultural multimodal harms, Chinese obfuscation/evasion, bilingual medical information dilution, audio jailbreaks, and post-alignment fragility under quantization/noise.
- Multiple papers show that simple defenses are brittle or trade off heavily with usability: defensive prompts cut ASR but spike benign refusal, prompt/CoT baselines can amplify framing sensitivity, and some “safety” interventions or optimizers can worsen risk under perturbation or backdoor settings.
- For practitioners, the most actionable direction is to build systems with explicit verification and abstention paths: deterministic validators, evidence-grounded rewards, cost-bounded halting, and feasibility-aware stopping consistently outperform unconstrained generation.
- Data generation is becoming a core capability bottleneck: scalable progress is coming from synthetic but verifiable environments and annotation pipelines for video reasoning, phone/GUI agents, and multimodal report generation.
2) Key themes (clusters)
Theme: Agent robustness is now about recovery, stopping, and memory
- Why it matters: Many agent failures are no longer “can it solve the task?” but “can it notice it is stuck, remember transient facts, recover from its own mistakes, or stop when success is impossible?” These are directly tied to cost, safety, and user trust.
- Representative papers:
- Do Agents Know What They Can’t Do? Evaluating Feasibility Awareness in Tool-Using Agents
- Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
- STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments
- When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?
- Common approach:
- Build benchmarks around failure states rather than only successful trajectories.
- Add explicit structure: STOP actions, memory fields, recovery traces, or scoped memory abstractions.
- Use controllable synthetic environments to generate verifiable supervision at scale.
- Evaluate both success and efficiency metrics such as token cost, trajectory length, or post-error recovery rate.
- Open questions / failure modes:
- Many methods assume closed tool sets or simulator control, limiting open-world transfer.
- Memory gains are highly interaction-dependent with search strategy and task structure.
- Recovery methods can over-reflect or consume too much inference budget.
- Sim-to-real transfer for GUI/mobile memory and recovery remains unresolved.
Theme: Verification-first architectures are beating unconstrained generation
- Why it matters: Across security, factuality, and long-form generation, systems that separate proposing from checking are more reliable than end-to-end freeform generation. The strongest systems explicitly preserve provenance, enforce schemas, or halt when evidence is insufficient.
- Representative papers:
- HunterAgent: Neuro-Symbolic Attack Trace Reconstruction under Anti-Forensics
- Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
- Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
- Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents
- Common approach:
- Split generation into planner/researcher/writer or generator/verifier roles.
- Ground outputs in external evidence: telemetry, citations, corpus counts, executable tests, or DB state.
- Use structured intermediates and typed schemas to constrain search.
- Prefer conservative halting or abstention over unsupported completion.
- Open questions / failure modes:
- Verification adds substantial latency and systems complexity.
- Coverage is bounded by surviving telemetry, corpus coverage, or available tools.
- Some pipelines still rely on LLM judges, introducing secondary reliability concerns.
- Modular decomposition may improve trustworthiness while limiting end-to-end optimization.
Theme: Safety evaluation is becoming culturally specific, multimodal, and operational
- Why it matters: English-only, text-only safety benchmarks are missing real deployment risks. Today’s papers show materially different failure modes under Korean cultural context, Chinese obfuscation, audio attacks, and bilingual medical phrasing.
- Representative papers:
- KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks
- Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese
- Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation
- MIRA: A Bilingual Benchmark for Medical Information Response Audit
- Common approach:
- Localize prompts, images, and attack styles to region-specific language and culture.
- Measure trade-offs beyond ASR, especially refusal rate and underinformative responses.
- Include human annotation or validation rather than relying only on translated or synthetic prompts.
- Stress-test models with jailbreaks, obfuscation, and multimodal triggers.
- Open questions / failure modes:
- Judge reliability and annotation consistency remain bottlenecks.
- Many datasets are gated, partially annotated, or limited to one region/language.
- Stronger defenses often achieve safety through over-refusal.
- Synthetic images/audio may not fully capture real-world attack conditions.
Theme: RL for tool use is moving toward denser, targeted credit assignment
- Why it matters: Standard outcome-only RL is too sparse for tool use. The most promising work today improves learning by assigning credit at the tool-call boundary, sentence level, or per-step action level.
- Representative papers:
- Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
- DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning
- Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
- Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
- Common approach:
- Replace coarse terminal rewards with process-level or localized rewards.
- Focus exploration budget on high-leverage decisions such as tool-call tokens.
- Use lightweight verifiable signals instead of expensive neural judges where possible.
- Combine standard first-order training with targeted refinement rather than full replacement.
- Open questions / failure modes:
- Many methods assume binary verifiability or narrow task families.
- Gains may not transfer to richer tool ecosystems or larger trainable models.
- Reward design can still miss semantic correctness even when it improves efficiency.
- Robustness improvements are mostly shown on limited model/dataset sets.
Theme: Benchmarks are exposing capability gaps in multimodal and social agents
- Why it matters: New benchmarks are less about leaderboard increments and more about revealing where current systems fundamentally break: egocentric tool use, phone environments, social deduction, and real coding workflows all show low reliability or confounded performance.
- Representative papers:
- EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
- PhoneWorld: Scaling Phone-Use Agent Environments
- MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
- How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions
- Common approach:
- Use interactive environments with deterministic or auditable scoring.
- Measure process failures, invalid actions, and user-visible misalignment rather than only final answers.
- Release trajectories and environment artifacts to support reproducible diagnosis.
- Analyze failure taxonomies and confounds, not just aggregate scores.
- Open questions / failure modes:
- Leaderboards can be dominated by environment-specific error handling rather than true reasoning skill.
- Real-world logs reveal persistent misalignment that benchmarks may still undercapture.
- Many environments remain partially synthetic or unreleased.
- Strong performance in one environment often transfers poorly to another.
3) Technical synthesis
- Several papers converge on a structured intermediate representation as the key to reliability: MSTED for video reasoning, Visual Working Memory for multimodal reports, explicit memory fields for GUI agents, and typed ontologies for forensic reconstruction.
- A common design pattern is asymmetric generation and verification: propose flexibly, validate deterministically. HunterAgent, PTAH, Agora, and EgoBench all use this pattern in different domains.
- Process metrics are replacing single-score evaluation: ASR/BRR/latency, Error-Awareness/Post-Error Success, FCR/token waste, ICQ/MPQ, joint success via tool coverage + DB state, and symptom/cause/outcome taxonomies.
- Multiple papers show prompt-only fixes are weak or counterproductive: framing robustness baselines often worsen flips; defensive prompts reduce jailbreak ASR but sharply increase benign refusal; naive full regeneration overwrites useful MCP customizations.
- There is a strong trend toward controllable synthetic environments with executable verification: PhoneWorld, STAMP, GUI-RobustEval/RoTS, and MAVEN all use synthetic or semi-synthetic pipelines to create scalable supervision.
- Search/inference strategy is a hidden confound in agent systems: memory effectiveness depends on best-of-N vs beam vs MCTS; social-agent rankings depend on environment error handling; penetration-testing consistency depends on orchestration details and provider failures.
- Several papers identify non-obvious optimizer or intervention effects: SAM can amplify DRL backdoors; short ZO refinement improves post-alignment robustness; AXPO’s targeted resampling beats simply increasing rollout count.
- Abstention is emerging as a first-class safety behavior: FeasiGen rewards STOP on infeasible tasks; HunterAgent halts with INSUFFICIENT_EVIDENCE; verifier-heavy systems prefer conservative failure over unsupported completion.
- The strongest factuality work uses cheap external signals instead of expensive judges: corpus-grounded co-occurrence rewards and evidence-guided retrieval improve scalability and auditability.
- Across domains, deployment realism exposes trade-offs that benchmark-only work often hides: latency, token cost, over-refusal, API outages, quantization fragility, and preservation of custom enterprise logic.
4) Top 5 papers (with “why now”)
- Uses 20,574 real IDE/CLI sessions and 16,118 validated misalignment episodes, giving unusually strong ecological validity.
- Shows the dominant failures are not exotic: developer constraint violations, intent misreads, and inaccurate self-reporting.
- Useful now because coding agents are moving into production workflows, and this paper gives concrete failure taxonomies for training and product instrumentation.
- Skeptical about: it only captures misalignment visible through developer pushback in logs, so silent failures and off-chat corrections are missed.
Do Agents Know What They Can’t Do? Evaluating Feasibility Awareness in Tool-Using Agents
- Introduces FeasiGen to create 1,036 infeasible tasks and shows even the best model still has 23.5% false continuation.
- Quantifies the real cost of not stopping: failure runs consume 2.3×–5.0× more tokens than early-stop behavior.
- Useful now because agent deployments increasingly pay for wasted trajectories, not just wrong answers.
- Skeptical about: the setup assumes closed tool pools, so open-world agents may behave differently.
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
- Presents a full planning-research-writing harness with verifier gates and a visual working memory, not just a report generator.
- Improves both textual quality and multimodal evidence quality, with 87.53% citation accuracy and strong ICQ/MPQ gains.
- Useful now because “deep research” products are proliferating, and this is one of the clearest blueprints for making them auditable.
- Skeptical about: latency is high (~1015s average), and the modular pipeline may be hard to operationalize cheaply.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
- Shows aligned models can lose safety under realistic perturbations like quantization and noise, then offers a practical FO→ZO refinement fix.
- The method is lightweight enough to be deployment-relevant: short ZO refinement, lower peak memory, and targeted layer selection.
- Useful now because many production systems quantize or otherwise perturb aligned models after training.
- Skeptical about: evidence is limited to two base models and a narrow perturbation set.
KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks
- Builds a 14,135-sample benchmark showing culturally grounded multimodal attacks and jailbreaks reveal vulnerabilities missed by generic English-centric evaluation.
- Surfaces the practical ASR vs refusal-rate trade-off, rather than treating low ASR as unambiguously good.
- Useful now because frontier safety evaluation is still overly Anglocentric, while deployment is not.
- Skeptical about: judge agreement is imperfect and the benchmark is culturally specific, so transfer to other regions is not automatic.
5) Practical next steps
- Add explicit abstention/stopping metrics to agent evals: false continuation rate, token cost to failure, and safe-halt rate should sit beside task success.
- For tool-using agents, instrument the tool-call boundary: log attempt rate, all-wrong subgroup rate, recovery-after-resampling, and per-tool-call uncertainty.
- Build verifier-separated pipelines for high-stakes workflows: generation should emit structured claims/actions, and a separate module should validate citations, schema, DB state, or executable outcomes.
- Stress-test aligned models under post-training perturbations you actually deploy: quantization, activation noise, parameter noise, and optimizer/intervention changes.
- Expand safety evaluation beyond English with localized, human-built adversarial sets and track both harmful compliance and over-refusal.
- For multimodal/reporting systems, maintain a source-aligned visual memory and score cross-modal consistency, not just text quality.
- In GUI/phone agents, train on synthetic but executable recovery and memory tasks rather than only successful demonstrations.
- For enterprise tool stacks, prefer incremental regeneration and preservation of custom logic over full regeneration when syncing agent-facing interfaces.
Generated from per-paper analyses; no external browsing.