Daily AI Paper Report (2026-05-16)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 371
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-14T00:00:00Z → 2026-05-15T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2605.15030WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
PDF
cs.CR, cs.AI95Robust prompt-injection defense for web agents with large-scale dataset and deployment focus.agent-safety, prompt-injection, web-agents, defense, benchmark, security
2605.14421MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
PDF
cs.CR, cs.AI95Cryptographic provenance for agent memory directly targets persistent prompt-injection risks.agent-safety, memory-security, provenance, prompt-injection, guardrails
2605.14746Selective Safety Steering via Value-Filtered Decoding
PDF
cs.LG94Decoding-time safety steering that reduces unnecessary interventions while improving safety.safety, alignment, decoding, steering, reliability
2605.14271Auditing Agent Harness Safety
PDF
cs.CL, cs.CY93Audits full agent trajectories for permission and info-flow violations beyond final outputs.agent-safety, auditing, execution-traces, permissions, information-flow, evaluation
2605.14786Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
PDF
cs.CR, cs.AI, cs.HC, cs.LG93Shows passive fingerprinting of browser agents, enabling targeted attacks on known model weaknesses.agent-security, browser-agents, fingerprinting, privacy, adversarial
2605.14865Holistic Evaluation and Failure Diagnosis of AI Agents
PDF
cs.AI, cs.CL93Strong agent evaluation/diagnosis framework with span-level localization and reported SOTA gains.agents, evaluation, diagnosis, benchmarks, reliability
2605.15188FutureSim: Replaying World Events to Evaluate Adaptive Agents
PDF
cs.LG, cs.AI, cs.CL93Grounded benchmark for adaptive agents in evolving real-world settings; strong eval value.agents, evaluation, benchmark, forecasting, real-world
2605.14859Do Coding Agents Understand Least-Privilege Authorization?
PDF
cs.CR, cs.AI92Least-privilege benchmark for coding agents targets a core real-world deployment safety gap.agent-safety, coding-agents, authorization, least-privilege, benchmark, security
2605.14605One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
PDF
cs.CR, cs.AI, cs.LG91Strong adaptive-attack critique showing current anti-malicious-finetuning defenses broadly fail.alignment, robustness, finetuning, adaptive-attacks, safety-evaluation, open-weights
2605.14750EVA: Editing for Versatile Alignment against Jailbreaks
PDF
cs.CR, cs.AI91Model-editing defense against jailbreaks for LLMs/VLMs with safety-utility focus.jailbreak-defense, alignment, model-editing, VLM, robustness
2605.15040Orchard: An Open-Source Agentic Modeling Framework
PDF
cs.AI, cs.CL91Open-source agentic modeling stack with sandbox primitives and scalable training recipes.agents, frameworks, open-source, sandboxing, training
2605.15109Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG
PDF
cs.AI, cs.IR91Important Agentic GraphRAG citation-faithfulness study with trajectory-level provenance framing.RAG, agents, evaluation, citations, provenance, factuality
2605.15134Training ML Models with Predictable Failures
PDF
cs.LG91Targets deployment-scale failure prediction for safety assessment with concrete training objective.safety, evaluation, reliability, failure-prediction, deployment
2605.15118Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks
PDF
cs.CR, cs.CL90Threat-surface taxonomy and coverage audit for LLM attack benchmarks; highly reusable evaluation lens.llm-security, taxonomy, benchmarking, attacks, evaluation, agents
2605.15138Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
PDF
cs.LG, cs.CL, cs.ET90Important unlearning result: quantization can undo forgetting; proposes mechanistic fix.unlearning, quantization, model-safety, mechanistic-interpretability, deployment
2605.15000Quantifying and Mitigating Premature Closure in Frontier LLMs
PDF
cs.CL, cs.AI90Directly studies unsafe premature commitment in frontier LLMs with mitigation on medical tasks.llm-safety, reliability, uncertainty, abstention, evaluation
2605.15152Widening the Gap: Exploiting LLM Quantization via Outlier Injection
PDF
cs.LG, cs.AI89Practical attack on advanced quantization schemes exposes deployment-time LLM security risk.quantization, model-security, backdoor-risk, deployment, adversarial
2605.14498GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
PDF
cs.CL89Useful benchmark for multi-user agent memory, belief tracking, and audience-aware responses.benchmark, agents, memory, multi-party, evaluation
2605.14404Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
PDF
cs.CL89Multilingual unlearning metrics for cross-lingual privacy leakage; highly relevant to LLM safety.unlearning, privacy, multilingual, evaluation, llm-safety
2605.14454LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
PDF
cs.LG, cs.CL, cs.CR88Practical framework for adapting guardrails from sparse deployment feedback in agent settings.guardrails, agent-safety, online-adaptation, policy-learning, deployment, reliability
2605.15128MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
PDF
cs.CV, cs.CL, cs.IR88Targets multimodal agent memory with visual-grounded benchmark for evidence preservation.multimodal, agents, memory, evaluation, benchmark
2605.15155Self-Distilled Agentic Reinforcement Learning
PDF
cs.LG, cs.AI, cs.CL88Post-training method for LLM agents combining RL with self-distillation for long-horizon tasks.llm-agents, reinforcement-learning, post-training, self-distillation, reasoning
2605.14290Web Agents Should Adopt the Plan-Then-Execute Paradigm
PDF
cs.CR, cs.AI, cs.CL, cs.SE87Argues plan-then-execute reduces web prompt-injection control-flow risk by design.web-agents, prompt-injection, agent-architecture, security, plan-then-execute
2605.14968GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation
PDF
cs.AI87Formally verifiable workflows for reliable agentic automation in mission-critical settings.agents, formal-methods, reliability, workflows, verification
2605.15077Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs
PDF
cs.CL, cs.AI, cs.LG87Practical agent efficiency advance: async tool calling without model changes or retraining.agents, tool-use, systems, efficiency, function-calling
2605.14483LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
PDF
cs.AI87Learns executable multi-agent orchestration with counterfactual credit assignment; useful for agents.multi-agent, orchestration, reinforcement-learning, agents, automation
2605.14604Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks
PDF
cs.AI, cs.HC86Sycophancy benchmark for LLM tutors highlights a concrete, underexplored safety failure mode.sycophancy, benchmark, education, alignment, evaluation
2605.14747Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining
PDF
cs.CL, cs.AI, cs.CV, cs.LG86Large-scale GUI-agent pretraining data from unlabeled videos could materially boost agent capability.agents, gui, pretraining, datasets, multimodal
2605.14570Uncertainty Quantification for Large Language Diffusion Models
PDF
cs.CL86First systematic uncertainty quantification study for language diffusion models; safety-relevant.uncertainty, reliability, diffusion-lm, hallucination, evaluation
2605.14932Toward Securing AI Agents Like Operating Systems
PDF
cs.CR85OS-inspired security framing for AI agents offers useful systems perspective on isolation and privilege.agent-security, systems, sandboxing, privilege-separation, architecture, survey

AI Paper Insight Brief

2026-05-16

0) Executive takeaways (read this first)

  • Agent safety evaluation is shifting from final-answer scoring to trajectory-, harness-, and provenance-level auditing. Multiple papers show that high task completion can coexist with serious boundary violations, unsafe memory reuse, or misleading citations.
  • The strongest security pattern today is architectural separation of control from untrusted content: plan-then-execute for web agents, OS-style runtime isolation, lineage-aware memory gating, and typed/verified workflows all aim to remove attack paths rather than merely detect bad outputs.
  • Several papers expose deployment-time reversals: unlearning can fail after quantization, malicious behavior can be activated only after quantization, and fine-tuning defenses fail under adaptive attackers. Safety claims that ignore downstream deployment transforms are increasingly unreliable.
  • Memory is emerging as a first-class failure surface. Benchmarks and defenses show current systems lose speaker grounding, temporal validity, visual evidence, and provenance; in some settings, simple retrieval baselines still beat sophisticated memory ingestion pipelines.
  • Practical mitigations are becoming more selective and calibrated: value-filtered decoding bounds unnecessary interventions, LiSA adapts guardrails conservatively from sparse feedback, and SDAR gates privileged distillation to stabilize agent RL.
  • Infrastructure matters as much as model quality: open environment layers, async tool execution, large-scale GUI pretraining data, and better orchestration learning all improve agent capability—but they also expand the need for stronger runtime controls and auditing.

2) Key themes (clusters)

Theme: Agent safety is moving from outputs to execution traces

Theme: Structural defenses are replacing prompt-only defenses for web and agent security

Theme: Prompt injection remains central, but defenses are diversifying

Theme: Memory is now a core capability and a core attack surface

Theme: Deployment transforms break many safety assumptions

Theme: Better agent infrastructure is improving capability—but also clarifying bottlenecks

3) Technical synthesis

  • Multiple papers converge on a control/data separation principle: PTE isolates control flow from web content; MemLineage separates provenance from content; GraphFlow separates verified structure from runtime nondeterminism; OS-style agent security separates runtime enforcement from model intent.
  • Evaluation is becoming trajectory-native: HarnessAudit normalizes traces into a unified schema, TRAIL-style diagnosis scores leaf spans, and GraphRAG provenance work tests answer dependence via graph ablations rather than citation inspection alone.
  • Several works replace monolithic judgments with factorized scoring: safety adherence vs task completion, retrieval vs reasoning failure, sufficiency vs tightness, or broad vs local policy memory.
  • A recurring failure pattern is misalignment between utility and safety: high task completion can coexist with boundary violations, overbroad permissions, stale memory retrieval, or unsafe intermediate actions.
  • Memory papers consistently show that ingestion is the bottleneck: GroupMemBench finds retrieval failures dominate; MemEye shows stale-evidence selection and caption loss; MemLineage shows provenance loss enables laundering attacks.
  • Deployment robustness increasingly requires post-transform evaluation: quantization changes unlearning outcomes, activates hidden attacks, and should be treated as part of the threat model rather than a downstream implementation detail.
  • Several methods use selective intervention instead of blanket steering: value-filtered decoding only intervenes above a threshold, LiSA gates broad policies by Beta-posterior confidence, and SDAR gates token-level distillation with detached sigmoids.
  • There is a strong move toward typed interfaces and structured artifacts: YAML orchestration specs, typed site APIs, future-valued function schemas, CBOR provenance entries, and explicit permission whitelists all make agent behavior more auditable.
  • Adaptive evaluation is becoming a baseline expectation: WARD uses attacker–guard co-evolution, SIDESTEPPER attacks MFT defenses with mixed objectives, and browser-agent fingerprinting studies retraining-aware adversaries.
  • Systems papers suggest that execution-layer changes can yield large gains without model changes: AsyncFC speeds tool use, Orchard lowers rollout cost/latency, and PTE reframes web-agent security as an architecture choice rather than a robustness patch.

4) Top 5 papers (with “why now”)

  • Auditing Agent Harness Safety
    • Reframes agent safety as a trajectory-level harness problem, not an output-level one.
    • Introduces HarnessAudit-Bench with 210 tasks across 8 domains and 525 perturbation cases.
    • Finds task completion and safe execution are poorly aligned; multi-agent setups amplify violations.
    • Useful now because many teams are shipping multi-agent/tool-using systems with little visibility into mid-trajectory failures.
    • Skepticism / limitation: the paper surfaces failures well, but mitigation strategies are not the main focus.
  • Web Agents Should Adopt the Plan-Then-Execute Paradigm
    • Makes a strong architectural claim: ReAct is intrinsically vulnerable on the web because untrusted content appears exactly where action decisions are made.
    • Shows all 860 WebArena tasks are compatible with PTE under trusted-API assumptions, with 81.28% solvable without runtime LLM subroutines.
    • Useful now because prompt injection on the web is increasingly a deployment blocker, and this offers a structural alternative rather than another detector.
    • Skepticism / limitation: deployability depends heavily on complete, trusted, typed APIs or maintained SDKs.
  • WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
    • Combines a large multimodal dataset, guard-targeted attack training, and adaptive adversarial training into a compact guard model.
    • Reports strong OOD detection, 100% recall under guard-targeted injections after PIG training, low false positives, and efficient parallel deployment.
    • Useful now because it is one of the more deployment-oriented web-agent defenses in the batch.
    • Skepticism / limitation: explicitly out of scope for pixel-level imperceptible attacks, and camouflaged task-aligned UI remains a failure mode.
  • Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
    • Shows standard unlearning can be undone by 4-bit quantization because updates are too small to survive binning.
    • Proposes MANSU, combining circuit localization, restricted null-space projection, and magnitude flooring to make forgetting survive NF4 quantization.
    • Useful now because many release pipelines quantize models after safety work, making pre-quantization unlearning claims incomplete.
    • Skepticism / limitation: evidence is strongest on 8B-class models and factual-recall benchmarks, with nontrivial compute cost.
  • Widening the Gap: Exploiting LLM Quantization via Outlier Injection
    • Exposes a supply-chain style attack where a benign full-precision model becomes malicious only after users quantize it.
    • Demonstrates high post-quantization ASR across practical quantizers including GPTQ and AWQ while preserving full-precision utility.
    • Useful now because quantized model distribution is widespread and often treated as a benign compression step rather than an attack surface.
    • Skepticism / limitation: requires white-box pre-release access to modify weights, so threat relevance depends on model provenance and distribution channel.

5) Practical next steps

  • Add trajectory logging and hidden-policy audits to agent evals; do not rely on final-answer success as a safety proxy.
  • For web agents, prototype plan-then-execute on a narrow set of high-value sites with typed APIs/SDKs and compare security/latency against ReAct.
  • Treat memory as a security boundary: add provenance metadata, trust labels, and sensitive-action gates before allowing memory-derived actions.
  • Evaluate all unlearning and safety edits after deployment transforms: at minimum test post-quantization, post-distillation, and multilingual recovery paths.
  • Red-team MFT defenses with adaptive mixed-objective attackers, not just harmful-loss-only fine-tuning.
  • Benchmark memory systems on speaker grounding, temporal validity, and visual evidence retention; compare against simple BM25 or raw retrieval baselines before adding complex ingestion.
  • Use selective, calibrated steering where possible: measure unnecessary intervention rates, not just refusal/safety gains.
  • If building agent infrastructure, separate concerns explicitly: environment service, harness, planner, executor, and policy enforcement should be independently testable and replaceable.

Generated from per-paper analyses; no external browsing.