Daily AI Paper Report (2026-05-16)
Published:
Chinese version: [中文]
Run stats
- Candidates: 371
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-05-14T00:00:00Z → 2026-05-15T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2605.15030 | WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections | cs.CR, cs.AI | 95 | Robust prompt-injection defense for web agents with large-scale dataset and deployment focus. | agent-safety, prompt-injection, web-agents, defense, benchmark, security |
2605.14421 | MemLineage: Lineage-Guided Enforcement for LLM Agent Memory | cs.CR, cs.AI | 95 | Cryptographic provenance for agent memory directly targets persistent prompt-injection risks. | agent-safety, memory-security, provenance, prompt-injection, guardrails |
2605.14746 | Selective Safety Steering via Value-Filtered Decoding | cs.LG | 94 | Decoding-time safety steering that reduces unnecessary interventions while improving safety. | safety, alignment, decoding, steering, reliability |
2605.14271 | Auditing Agent Harness Safety | cs.CL, cs.CY | 93 | Audits full agent trajectories for permission and info-flow violations beyond final outputs. | agent-safety, auditing, execution-traces, permissions, information-flow, evaluation |
2605.14786 | Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces | cs.CR, cs.AI, cs.HC, cs.LG | 93 | Shows passive fingerprinting of browser agents, enabling targeted attacks on known model weaknesses. | agent-security, browser-agents, fingerprinting, privacy, adversarial |
2605.14865 | Holistic Evaluation and Failure Diagnosis of AI Agents | cs.AI, cs.CL | 93 | Strong agent evaluation/diagnosis framework with span-level localization and reported SOTA gains. | agents, evaluation, diagnosis, benchmarks, reliability |
2605.15188 | FutureSim: Replaying World Events to Evaluate Adaptive Agents | cs.LG, cs.AI, cs.CL | 93 | Grounded benchmark for adaptive agents in evolving real-world settings; strong eval value. | agents, evaluation, benchmark, forecasting, real-world |
2605.14859 | Do Coding Agents Understand Least-Privilege Authorization? | cs.CR, cs.AI | 92 | Least-privilege benchmark for coding agents targets a core real-world deployment safety gap. | agent-safety, coding-agents, authorization, least-privilege, benchmark, security |
2605.14605 | One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries | cs.CR, cs.AI, cs.LG | 91 | Strong adaptive-attack critique showing current anti-malicious-finetuning defenses broadly fail. | alignment, robustness, finetuning, adaptive-attacks, safety-evaluation, open-weights |
2605.14750 | EVA: Editing for Versatile Alignment against Jailbreaks | cs.CR, cs.AI | 91 | Model-editing defense against jailbreaks for LLMs/VLMs with safety-utility focus. | jailbreak-defense, alignment, model-editing, VLM, robustness |
2605.15040 | Orchard: An Open-Source Agentic Modeling Framework | cs.AI, cs.CL | 91 | Open-source agentic modeling stack with sandbox primitives and scalable training recipes. | agents, frameworks, open-source, sandboxing, training |
2605.15109 | Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG | cs.AI, cs.IR | 91 | Important Agentic GraphRAG citation-faithfulness study with trajectory-level provenance framing. | RAG, agents, evaluation, citations, provenance, factuality |
2605.15134 | Training ML Models with Predictable Failures | cs.LG | 91 | Targets deployment-scale failure prediction for safety assessment with concrete training objective. | safety, evaluation, reliability, failure-prediction, deployment |
2605.15118 | Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks | cs.CR, cs.CL | 90 | Threat-surface taxonomy and coverage audit for LLM attack benchmarks; highly reusable evaluation lens. | llm-security, taxonomy, benchmarking, attacks, evaluation, agents |
2605.15138 | Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution | cs.LG, cs.CL, cs.ET | 90 | Important unlearning result: quantization can undo forgetting; proposes mechanistic fix. | unlearning, quantization, model-safety, mechanistic-interpretability, deployment |
2605.15000 | Quantifying and Mitigating Premature Closure in Frontier LLMs | cs.CL, cs.AI | 90 | Directly studies unsafe premature commitment in frontier LLMs with mitigation on medical tasks. | llm-safety, reliability, uncertainty, abstention, evaluation |
2605.15152 | Widening the Gap: Exploiting LLM Quantization via Outlier Injection | cs.LG, cs.AI | 89 | Practical attack on advanced quantization schemes exposes deployment-time LLM security risk. | quantization, model-security, backdoor-risk, deployment, adversarial |
2605.14498 | GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations | cs.CL | 89 | Useful benchmark for multi-user agent memory, belief tracking, and audience-aware responses. | benchmark, agents, memory, multi-party, evaluation |
2605.14404 | Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation | cs.CL | 89 | Multilingual unlearning metrics for cross-lingual privacy leakage; highly relevant to LLM safety. | unlearning, privacy, multilingual, evaluation, llm-safety |
2605.14454 | LiSA: Lifelong Safety Adaptation via Conservative Policy Induction | cs.LG, cs.CL, cs.CR | 88 | Practical framework for adapting guardrails from sparse deployment feedback in agent settings. | guardrails, agent-safety, online-adaptation, policy-learning, deployment, reliability |
2605.15128 | MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory | cs.CV, cs.CL, cs.IR | 88 | Targets multimodal agent memory with visual-grounded benchmark for evidence preservation. | multimodal, agents, memory, evaluation, benchmark |
2605.15155 | Self-Distilled Agentic Reinforcement Learning | cs.LG, cs.AI, cs.CL | 88 | Post-training method for LLM agents combining RL with self-distillation for long-horizon tasks. | llm-agents, reinforcement-learning, post-training, self-distillation, reasoning |
2605.14290 | Web Agents Should Adopt the Plan-Then-Execute Paradigm | cs.CR, cs.AI, cs.CL, cs.SE | 87 | Argues plan-then-execute reduces web prompt-injection control-flow risk by design. | web-agents, prompt-injection, agent-architecture, security, plan-then-execute |
2605.14968 | GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation | cs.AI | 87 | Formally verifiable workflows for reliable agentic automation in mission-critical settings. | agents, formal-methods, reliability, workflows, verification |
2605.15077 | Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs | cs.CL, cs.AI, cs.LG | 87 | Practical agent efficiency advance: async tool calling without model changes or retraining. | agents, tool-use, systems, efficiency, function-calling |
2605.14483 | LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning | cs.AI | 87 | Learns executable multi-agent orchestration with counterfactual credit assignment; useful for agents. | multi-agent, orchestration, reinforcement-learning, agents, automation |
2605.14604 | Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks | cs.AI, cs.HC | 86 | Sycophancy benchmark for LLM tutors highlights a concrete, underexplored safety failure mode. | sycophancy, benchmark, education, alignment, evaluation |
2605.14747 | Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining | cs.CL, cs.AI, cs.CV, cs.LG | 86 | Large-scale GUI-agent pretraining data from unlabeled videos could materially boost agent capability. | agents, gui, pretraining, datasets, multimodal |
2605.14570 | Uncertainty Quantification for Large Language Diffusion Models | cs.CL | 86 | First systematic uncertainty quantification study for language diffusion models; safety-relevant. | uncertainty, reliability, diffusion-lm, hallucination, evaluation |
2605.14932 | Toward Securing AI Agents Like Operating Systems | cs.CR | 85 | OS-inspired security framing for AI agents offers useful systems perspective on isolation and privilege. | agent-security, systems, sandboxing, privilege-separation, architecture, survey |
AI Paper Insight Brief
2026-05-16
0) Executive takeaways (read this first)
- Agent safety evaluation is shifting from final-answer scoring to trajectory-, harness-, and provenance-level auditing. Multiple papers show that high task completion can coexist with serious boundary violations, unsafe memory reuse, or misleading citations.
- The strongest security pattern today is architectural separation of control from untrusted content: plan-then-execute for web agents, OS-style runtime isolation, lineage-aware memory gating, and typed/verified workflows all aim to remove attack paths rather than merely detect bad outputs.
- Several papers expose deployment-time reversals: unlearning can fail after quantization, malicious behavior can be activated only after quantization, and fine-tuning defenses fail under adaptive attackers. Safety claims that ignore downstream deployment transforms are increasingly unreliable.
- Memory is emerging as a first-class failure surface. Benchmarks and defenses show current systems lose speaker grounding, temporal validity, visual evidence, and provenance; in some settings, simple retrieval baselines still beat sophisticated memory ingestion pipelines.
- Practical mitigations are becoming more selective and calibrated: value-filtered decoding bounds unnecessary interventions, LiSA adapts guardrails conservatively from sparse feedback, and SDAR gates privileged distillation to stabilize agent RL.
- Infrastructure matters as much as model quality: open environment layers, async tool execution, large-scale GUI pretraining data, and better orchestration learning all improve agent capability—but they also expand the need for stronger runtime controls and auditing.
2) Key themes (clusters)
Theme: Agent safety is moving from outputs to execution traces
- Why it matters: Several papers argue that final outputs are too coarse to assess real agent safety. Unsafe tool use, unauthorized access, and failure causes often occur mid-trajectory and remain invisible if evaluation only checks the last answer.
- Representative papers:
- Common approach:
- Audit full traces rather than terminal responses
- Decompose failures into boundary, execution, retrieval, or planning subtypes
- Use hidden artifacts, span-level judges, or ablation-based provenance tests
- Measure both utility and safety/exposure jointly
- Open questions / failure modes:
- How to aggregate many local failures without over-penalizing minor issues
- Whether benchmark taxonomies align with real production failure modes
- How to make provenance complete when uncited context still influences answers
- How to infer least-privilege boundaries before execution, not after harm
Theme: Structural defenses are replacing prompt-only defenses for web and agent security
- Why it matters: The most credible defenses in this batch reduce attack surface by construction: separating planning from execution, enforcing runtime isolation, or binding memory/tool use to explicit authority. This is a stronger pattern than asking the model to “be careful.”
- Representative papers:
- Common approach:
- Separate control flow from untrusted observations
- Enforce permissions, provenance, or contracts outside the model
- Treat tools/memory/runtime as security-critical system components
- Favor typed interfaces, append-only logs, and replayable execution
- Open questions / failure modes:
- Heavy dependence on trusted APIs, SDKs, or runtime infrastructure
- Dynamic tasks may still require replanning or human recovery paths
- Provenance systems depend on attribution quality and trusted inference
- Formal verification often stops at workflow structure, not external boundary behavior
Theme: Prompt injection remains central, but defenses are diversifying
- Why it matters: Web and multimodal agents remain highly exposed to indirect prompt injection and jailbreaks. The notable change is that defenses now span architecture, compact guard models, and direct model editing rather than a single filtering layer.
- Representative papers:
- Common approach:
- Train guards on multimodal and guard-targeted attack data
- Patch harmful internal representations with localized edits
- Restrict runtime control flow so untrusted content cannot add actions
- Study passive reconnaissance channels like UI-trace fingerprinting
- Open questions / failure modes:
- Pixel-level or highly camouflaged attacks remain difficult
- Adaptive attackers can still recover some attack success
- Guard robustness may depend on attacker distribution seen in training
- Fingerprinting enables model-specific exploitation even without direct probing
Theme: Memory is now a core capability and a core attack surface
- Why it matters: Persistent memory is no longer just a convenience feature. It is a source of poisoning, stale-state errors, speaker confusion, provenance gaps, and visual evidence loss—often dominating downstream failures.
- Representative papers:
- Common approach:
- Benchmark memory under multi-party, temporal, or multimodal constraints
- Distinguish retrieval failure from reasoning failure
- Preserve provenance or confidence metadata alongside memory entries
- Use structured memory policies instead of raw append-only recall
- Open questions / failure modes:
- Ingestion often destroys thread structure, speaker grounding, or visual detail
- Retrieval may surface stale but semantically related evidence
- Memory poisoning can be laundered through agent-authored summaries
- Strong offline managers or judges may be required for robust adaptation
Theme: Deployment transforms break many safety assumptions
- Why it matters: Several papers show that safety properties measured in one regime do not survive realistic deployment steps like quantization or downstream fine-tuning. This is a major warning for release processes that certify only pre-deployment checkpoints.
- Representative papers:
- Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
- Widening the Gap: Exploiting LLM Quantization via Outlier Injection
- One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
- Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
- Common approach:
- Evaluate safety after quantization, multilingual transfer, or adaptive attack
- Replace behavioral-only metrics with structural or cross-context diagnostics
- Stress defenses with attacker-aware objectives rather than naive baselines
- Focus on persistence across deployment conditions, not one-shot suppression
- Open questions / failure modes:
- Quantization can both erase intended edits and activate hidden malicious behavior
- Unlearning may fail in hold-out languages even when training languages look clean
- Many defenses only block naive optimization paths, not adaptive ones
- Structural guarantees remain limited to certain model sizes or task types
Theme: Better agent infrastructure is improving capability—but also clarifying bottlenecks
- Why it matters: A second cluster focuses on making agents faster, more scalable, and easier to train: orchestration learning, async function calling, open environment services, GUI pretraining, and long-horizon adaptation benchmarks. These papers sharpen where current bottlenecks really are.
- Representative papers:
- Common approach:
- Move improvements into the execution layer or environment layer
- Learn structured orchestration specs rather than ad hoc prompting
- Use richer benchmarks that expose adaptation, latency, and memory bottlenecks
- Combine SFT with RL and targeted systems engineering
- Open questions / failure modes:
- Gains can be highly harness-dependent or in-distribution
- More capable infrastructure increases the need for stronger safety boundaries
- Long-horizon adaptation remains weak even for frontier agents
- Training/evaluation cost remains high for orchestration and adaptive benchmarks
3) Technical synthesis
- Multiple papers converge on a control/data separation principle: PTE isolates control flow from web content; MemLineage separates provenance from content; GraphFlow separates verified structure from runtime nondeterminism; OS-style agent security separates runtime enforcement from model intent.
- Evaluation is becoming trajectory-native: HarnessAudit normalizes traces into a unified schema, TRAIL-style diagnosis scores leaf spans, and GraphRAG provenance work tests answer dependence via graph ablations rather than citation inspection alone.
- Several works replace monolithic judgments with factorized scoring: safety adherence vs task completion, retrieval vs reasoning failure, sufficiency vs tightness, or broad vs local policy memory.
- A recurring failure pattern is misalignment between utility and safety: high task completion can coexist with boundary violations, overbroad permissions, stale memory retrieval, or unsafe intermediate actions.
- Memory papers consistently show that ingestion is the bottleneck: GroupMemBench finds retrieval failures dominate; MemEye shows stale-evidence selection and caption loss; MemLineage shows provenance loss enables laundering attacks.
- Deployment robustness increasingly requires post-transform evaluation: quantization changes unlearning outcomes, activates hidden attacks, and should be treated as part of the threat model rather than a downstream implementation detail.
- Several methods use selective intervention instead of blanket steering: value-filtered decoding only intervenes above a threshold, LiSA gates broad policies by Beta-posterior confidence, and SDAR gates token-level distillation with detached sigmoids.
- There is a strong move toward typed interfaces and structured artifacts: YAML orchestration specs, typed site APIs, future-valued function schemas, CBOR provenance entries, and explicit permission whitelists all make agent behavior more auditable.
- Adaptive evaluation is becoming a baseline expectation: WARD uses attacker–guard co-evolution, SIDESTEPPER attacks MFT defenses with mixed objectives, and browser-agent fingerprinting studies retraining-aware adversaries.
- Systems papers suggest that execution-layer changes can yield large gains without model changes: AsyncFC speeds tool use, Orchard lowers rollout cost/latency, and PTE reframes web-agent security as an architecture choice rather than a robustness patch.
4) Top 5 papers (with “why now”)
- Auditing Agent Harness Safety
- Reframes agent safety as a trajectory-level harness problem, not an output-level one.
- Introduces HarnessAudit-Bench with 210 tasks across 8 domains and 525 perturbation cases.
- Finds task completion and safe execution are poorly aligned; multi-agent setups amplify violations.
- Useful now because many teams are shipping multi-agent/tool-using systems with little visibility into mid-trajectory failures.
- Skepticism / limitation: the paper surfaces failures well, but mitigation strategies are not the main focus.
- Web Agents Should Adopt the Plan-Then-Execute Paradigm
- Makes a strong architectural claim: ReAct is intrinsically vulnerable on the web because untrusted content appears exactly where action decisions are made.
- Shows all 860 WebArena tasks are compatible with PTE under trusted-API assumptions, with 81.28% solvable without runtime LLM subroutines.
- Useful now because prompt injection on the web is increasingly a deployment blocker, and this offers a structural alternative rather than another detector.
- Skepticism / limitation: deployability depends heavily on complete, trusted, typed APIs or maintained SDKs.
- WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
- Combines a large multimodal dataset, guard-targeted attack training, and adaptive adversarial training into a compact guard model.
- Reports strong OOD detection, 100% recall under guard-targeted injections after PIG training, low false positives, and efficient parallel deployment.
- Useful now because it is one of the more deployment-oriented web-agent defenses in the batch.
- Skepticism / limitation: explicitly out of scope for pixel-level imperceptible attacks, and camouflaged task-aligned UI remains a failure mode.
- Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
- Shows standard unlearning can be undone by 4-bit quantization because updates are too small to survive binning.
- Proposes MANSU, combining circuit localization, restricted null-space projection, and magnitude flooring to make forgetting survive NF4 quantization.
- Useful now because many release pipelines quantize models after safety work, making pre-quantization unlearning claims incomplete.
- Skepticism / limitation: evidence is strongest on 8B-class models and factual-recall benchmarks, with nontrivial compute cost.
- Widening the Gap: Exploiting LLM Quantization via Outlier Injection
- Exposes a supply-chain style attack where a benign full-precision model becomes malicious only after users quantize it.
- Demonstrates high post-quantization ASR across practical quantizers including GPTQ and AWQ while preserving full-precision utility.
- Useful now because quantized model distribution is widespread and often treated as a benign compression step rather than an attack surface.
- Skepticism / limitation: requires white-box pre-release access to modify weights, so threat relevance depends on model provenance and distribution channel.
5) Practical next steps
- Add trajectory logging and hidden-policy audits to agent evals; do not rely on final-answer success as a safety proxy.
- For web agents, prototype plan-then-execute on a narrow set of high-value sites with typed APIs/SDKs and compare security/latency against ReAct.
- Treat memory as a security boundary: add provenance metadata, trust labels, and sensitive-action gates before allowing memory-derived actions.
- Evaluate all unlearning and safety edits after deployment transforms: at minimum test post-quantization, post-distillation, and multilingual recovery paths.
- Red-team MFT defenses with adaptive mixed-objective attackers, not just harmful-loss-only fine-tuning.
- Benchmark memory systems on speaker grounding, temporal validity, and visual evidence retention; compare against simple BM25 or raw retrieval baselines before adding complex ingestion.
- Use selective, calibrated steering where possible: measure unnecessary intervention rates, not just refusal/safety gains.
- If building agent infrastructure, separate concerns explicitly: environment service, harness, planner, executor, and policy enforcement should be independently testable and replaceable.
Generated from per-paper analyses; no external browsing.
