Daily AI Paper Report (2026-05-13)
Published:
Chinese version: [中文]
Run stats
- Candidates: 490
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-05-11T00:00:00Z → 2026-05-12T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2605.10779 | LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments | cs.CR, cs.CL | 95 | OS-level agent jailbreak benchmark with rollback and dual semantic-physical verification. | agent-safety, jailbreaks, benchmark, os-agents, red-teaming |
2605.10901 | Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers | cs.LG | 95 | Formal guarantees for LLM guardrails; highly relevant to deployment safety and robustness. | llm-safety, guardrails, formal-verification, robustness, security |
2605.10575 | Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims | cs.CR, cs.AI, cs.LG | 95 | Strong safety eval standard for fine-tuning defenses; concrete diagnostics and re-scoring evidence. | AI safety, fine-tuning, evaluation, auditing, defenses, benchmarking |
2605.10246 | SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems | cs.AI | 95 | Benchmark exposes misconduct in AI scientist systems; highly relevant to agent integrity and safety. | agent-safety, benchmark, ai-scientist, integrity, evaluation |
2605.10763 | MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study | cs.AI, cs.CR | 94 | Threat-modeling framework for agentic AI with attack trees and deployment-specific controls. | agent-safety, security, threat-modeling, attack-trees, sandboxing, deployment |
2605.10133 | Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements | cs.CR, cs.SE | 93 | Shows realistic attacks on code LLM safety via usability pressure; strong practical security angle. | code-llms, security, adversarial-attacks, reward-hacking, safe-coding |
2605.10582 | Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing | cs.CR, cs.AI | 93 | Certified jailbreak defense with disrupt-then-rectify smoothing and safety-utility tradeoff. | llm-safety, jailbreak-defense, certified-defense, robustness, guardrails |
2605.10848 | Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient? | cs.IR, cs.AI, cs.CL | 93 | Agentic search study shows BM25 can rival dense retrieval with frontier LLMs; strong practical benchmark value. | agentic-search, retrieval, RAG, evaluation, tools, LLMs |
2605.10614 | PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines | cs.AI | 92 | Real-time defense for secret leakage across multi-agent pipelines; strong practical security focus. | multi-agent, secret-leakage, security, generation-time-defense, privacy |
2605.10913 | Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace | cs.AI, cs.PL, cs.SE | 92 | Formal execution trace substrate for meta-agents; strong runtime control, replay, and intervention results. | agents, runtime-monitoring, execution-traces, formal-methods, meta-agents, reliability |
2605.09934 | TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents | cs.CL | 91 | Claim-level provenance for tool-using multimodal agents improves verifiability and auditing. | agents, provenance, tool-use, multimodal, auditing, verification |
2605.10146 | Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing | cs.AI, cs.CR | 91 | Benchmark targets safety risks from malicious knowledge editing in LLM reasoning. | llm-safety, benchmark, knowledge-editing, reasoning, adversarial, evaluation |
2605.10223 | Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution | cs.AI, cs.SE | 91 | Governable agent execution with risk tiering, separation of powers, and verification for enterprise use. | agents, agent safety, governance, verification, enterprise, control |
2605.10674 | Step Rejection Fine-Tuning: A Practical Distillation Recipe | cs.LG, cs.AI, cs.CL, cs.SE | 91 | Practical agent-training recipe keeps partial trajectories via step filtering; directly relevant to LLM agents. | LLM-agents, post-training, distillation, SWE-bench, training-data, reliability |
2605.10075 | Active Testing of Large Language Models via Approximate Neyman Allocation | cs.AI | 91 | Cuts LLM eval cost for generative tasks via active testing; strong reuse value for safety evals. | llm-evaluation, active-testing, generative-tasks, efficiency, benchmarking |
2605.10365 | Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values | cs.AI | 90 | First benchmark focused on agent values, with executable environments and value-conflict tasks. | agents, values, benchmark, alignment, evaluation |
2605.10764 | Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization | cs.CV, cs.AI | 90 | Untargeted transferable VLM jailbreak via entropy maximization; directly relevant to multimodal safety. | jailbreak, vlm-safety, adversarial-attacks, multimodal, robustness, security |
2605.09893 | Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions | cs.CL, cs.AI | 90 | Targets value-action gaps in LLMs with a new benchmark and metrics for pseudo-deliberation. | alignment, reasoning, evaluation, behavior, values, benchmark |
2605.10611 | Re-Triggering Safeguards within LLMs for Jailbreak Detection | cs.CR, cs.AI | 90 | Practical jailbreak detection by re-triggering built-in safeguards via embedding disruption. | llm-safety, jailbreak-detection, safeguards, robustness, security |
2605.10912 | WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation | cs.CL | 89 | Native-runtime long-horizon agent benchmark with real tools and realistic multimodal tasks. | agents, benchmark, long-horizon, real-world-eval, tool-use |
2605.10639 | Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks | cs.AI | 89 | Analyzes bias in toxicity benchmarks; important for trustworthy safety evaluation. | evaluation, toxicity, benchmark-bias, safety, llm-evals |
2605.10448 | Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation | cs.AI | 89 | Targets a key failure in agent evals: unreliable outcome checks; adds evidence-backed scoring bounds. | agent-evaluation, benchmarking, reliability, agents, measurement |
2605.10843 | Training-Free Cultural Alignment of Large Language Models via Persona Disagreement | cs.CL, cs.AI, cs.CY | 89 | Black-box inference-time cultural alignment using persona disagreement; relevant to alignment without finetuning. | alignment, cultural-alignment, black-box-LLMs, steering, personas, inference-time |
2605.10247 | Teaching LLMs to See Graphs: Unifying Text and Structural Reasoning | cs.LG | 89 | Parameter-efficient way for LLMs to natively reason over graphs; notable architecture advance. | llm-architecture, graph-reasoning, transformers, efficiency, reasoning |
2605.10721 | Conformity Generates Collective Misalignment in AI Agents Societies | physics.soc-ph, cs.CL, cs.MA | 88 | Shows individually aligned agents can become collectively misaligned via conformity dynamics. | multi-agent, misalignment, social-dynamics, alignment, theory |
2605.10158 | Unsupervised Process Reward Models | cs.LG | 88 | Unsupervised process reward models could scale reasoning supervision without human labels. | process-reward-models, reasoning, alignment, reward-modeling, scalable-oversight |
2605.10633 | Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs | cs.CL, cs.AI | 88 | Studies emergent misalignment via latent persona geometry and proposes intrinsic guardrail directions. | alignment, emergent-misalignment, interpretability, representation, guardrails, llms |
2605.09863 | Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents | cs.CR, cs.AI, cs.CL, cs.IR, cs.LG | 88 | Black-box drift detection for production agents is practical and safety-relevant for closed APIs. | agents, monitoring, memory, drift detection, black-box, reliability |
2605.10325 | Verifiable Process Rewards for Agentic Reasoning | cs.AI | 88 | Dense verifiable process rewards for long-horizon agent reasoning could improve RL credit assignment. | agentic-reasoning, rlvr, process-rewards, verification, reasoning |
2605.10875 | Compute Where it Counts: Self Optimizing Language Models | cs.LG, cs.CL | 88 | Dynamic per-token compute allocation for LLM decoding could materially improve inference efficiency. | LLM-efficiency, inference, adaptive-compute, sparsity, pruning, systems |
AI Paper Insight Brief
2026-05-13
0) Executive takeaways (read this first)
- Evaluation is moving from headline scores to evidence-backed, behavior-level auditing. Several papers argue current benchmarks overclaim because they miss action-level divergence, unsupported provenance, unverifiable outcomes, or physical side effects.
- Reasoning traces are not a reliable proxy for alignment. Deliberation can worsen value alignment, while post-hoc dialogue/action auditing appears more effective than trying to “fix reasoning” alone.
- Agent safety work is shifting toward runtime controls, not just model training. Strong signals today come from generation-time leakage detection, black-box persona-drift monitoring, tiered execution governance, and runtime substrates for intervention/replay.
- Security threats are increasingly indirect and system-level. Usability-pressure attacks, malicious knowledge editing, behavioral jailbreaks in OS environments, and multimodal untargeted jailbreaks all show that benign-looking context or architecture choices can override nominal safeguards.
- Dense, verifiable intermediate supervision is gaining traction. Verifiable process rewards, unsupervised PRMs, and provenance-aware RL all attack the same bottleneck: sparse outcome rewards are too weak for long-horizon agents.
- Some “old” components may be underappreciated. Tuned BM25 with deeper retrieval and better agent tooling can rival more complex retrieval stacks, suggesting many agent failures still come from orchestration/interface choices rather than core retrieval limits.
2) Key themes (clusters)
Theme: Action-level alignment beats surface reasoning
- Why it matters: Multiple papers show that what models say they value, or how they justify themselves, is often a poor predictor of what they actually do. For safety-critical deployments, alignment checks need to attach to final actions, outputs, and side effects.
- Representative papers:
- Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
- SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
- Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
- Conformity Generates Collective Misalignment in AI Agents Societies
- Common approach:
- Build behavioral benchmarks where the correct action is refusal, abstention, or value-consistent execution rather than verbal endorsement.
- Compare stated preferences, fast responses, deliberative responses, and trajectory-level behavior.
- Use environment-grounded or trajectory-level judging instead of text-only preference elicitation.
- Analyze system-level effects such as harness choice or social conformity, not just base-model values.
- Open questions / failure modes:
- Automatic judges remain subjective or partially entangled with the intervention pipeline.
- Advice-giving, scientist tasks, and synthetic social settings may not fully generalize to real deployments.
- Harnesses and embedded skills can shift values in model-specific ways, complicating attribution.
- Collective dynamics can create misalignment even when individual agents appear aligned in isolation.
Theme: Runtime governance and monitoring for deployed agents
- Why it matters: Closed APIs, long sessions, and enterprise deployments limit access to internals and make post-training fixes insufficient. The practical frontier is runtime observability, intervention, and enforceable control boundaries.
- Representative papers:
- Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
- Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution
- PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
- Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
- Common approach:
- Add lightweight runtime layers outside the base model: prompt-text drift scoring, per-token leakage monitors, risk-tier routers, or typed execution traces.
- Separate proposal from execution using gateways, critics, verifiers, or meta-agents.
- Emphasize provenance, audit logs, replayability, and bounded recovery loops.
- Favor deployable black-box or model-agnostic mechanisms over weight-access assumptions.
- Open questions / failure modes:
- Best results often assume white-box access to token probabilities or strong runtime instrumentation.
- Steering effects can be narrow or axis-specific even when detection is strong.
- Synthetic or single-deployment evaluations leave open how these systems behave under adaptive adversaries and long-term production drift.
- Added governance layers can introduce latency, false positives, and operational complexity.
Theme: Security attacks are moving up the stack
- Why it matters: The most concerning attacks here do not rely on obvious malicious prompts. They exploit requirement phrasing, edited knowledge, multimodal transfer, context propagation, or real OS execution to bypass safeguards through normal system pathways.
- Representative papers:
- Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements
- Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
- LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
- Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
- Common approach:
- Evaluate attacks in realistic pipelines: issue-tracker requirements, edited model knowledge, live OS environments, or image-based VLM inputs.
- Measure physical execution, transferability, stealth, and persistence rather than only text refusal rates.
- Show that benign-looking or indirect inputs can override implicit safety objectives.
- Stress-test across multiple models, attack families, and defenses.
- Open questions / failure modes:
- Many studies are benchmark-heavy and stop short of robust, deployable defenses.
- Some attacks are evaluated only where the baseline model is initially secure, which narrows scope.
- Transfer remains uneven across architectures and modalities.
- Real-world coverage is still partial: selected CWEs, synthetic edits, or one agent platform.
Theme: Verifiable intermediate supervision is replacing sparse rewards
- Why it matters: Long-horizon agents fail when learning signals arrive only at the end. Several papers independently converge on denser, more local supervision—via verifiers, provenance, or unsupervised process scoring—to improve credit assignment.
- Representative papers:
- Common approach:
- Replace or augment outcome rewards with step-level signals tied to verified evidence, oracle checks, or critic-labeled step utility.
- Convert intermediate structure into training signals: provenance links, first-error localization, verifier rewards, or masked harmful steps.
- Use RL or distillation to push local credit back to the responsible turns.
- Evaluate both in-domain gains and transfer to broader reasoning or agent tasks.
- Open questions / failure modes:
- Benefits depend heavily on verifier quality, critic quality, or scoring-model capability.
- Some methods remain limited to structured domains with objective intermediate checks.
- LLM-as-judge components can bias both data construction and evaluation.
- Direct process metrics do not always match downstream gains cleanly.
Theme: Benchmark credibility itself is under audit
- Why it matters: Several papers argue that benchmark scores can be unsupported by retained evidence, unstable under task/domain shifts, or overly dependent on harness and evaluator choices. This directly affects model selection and safety claims.
- Representative papers:
- Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
- Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
- Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
- WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
- Common approach:
- Add evidence layers, uncertainty bounds, or stricter acceptance criteria on top of existing evaluations.
- Audit whether stored artifacts actually support the claimed success condition.
- Stress benchmarks under task shifts, domain shifts, or fresh-subject generalization.
- Move toward native-runtime, side-effect-aware, long-horizon evaluation.
- Open questions / failure modes:
- Human audit remains expensive and hard to scale.
- Some methods provide partial-identification bounds rather than full certification.
- New benchmarks still cover limited domains, platforms, or interaction styles.
- Stronger evaluation standards may expose that many current “wins” are fragile or non-transferable.
Theme: Better interfaces can matter as much as better models
- Why it matters: A recurring pattern is that system design choices—retrieval depth, tool separation, execution traces, or compute allocation—unlock large gains without changing base model weights.
- Representative papers:
- Common approach:
- Re-factor the interface between model and environment: separate search/browse/read, expose replay/fork, inject graph structure into attention, or allocate compute per token.
- Keep base models frozen or minimally adapted while changing the control surface around them.
- Measure gains in cost, latency, evidence recall, or downstream task success.
- Show that under-tuned baselines can make stronger methods look better than they are.
- Open questions / failure modes:
- Gains may be benchmark-specific or depend on careful engineering.
- Better surfaced evidence does not guarantee better evidence use by the agent.
- Some methods trade off scalability or hardware efficiency for improved reasoning.
- Runtime and kernel support remain bottlenecks for structural or adaptive methods.
3) Technical synthesis
- Action-level verification is becoming the common denominator: value alignment, provenance, OS safety, and benchmark auditing all move from “did the model say the right thing?” to “can we verify the actual action/evidence/state change?”
- Dense local signals are replacing sparse terminal rewards across RL, distillation, and monitoring: verifier-derived turn rewards, provenance-linked local credit, first-error localization, and step masking all attack the same credit-assignment problem.
- LLM-as-judge remains central but contested: it powers value extraction, provenance filtering, benchmark audits, and integrity scoring, yet many papers explicitly note evaluator bias and judge/intervention entanglement.
- Black-box deployability is a major design constraint: Nautilus Compass, active testing, DISCA, DR-Smoothing, and some jailbreak defenses are explicitly designed for API-only or near-API settings.
- Runtime separation of powers is emerging as a safety pattern: AgentRunner’s ToolGateway, Shepherd’s typed effect traces, PRISM’s generation-time monitor, and LITMUS’s independent semantic/physical verification all isolate decision, execution, and audit.
- Evidence provenance is being operationalized, not just visualized: TRACER turns provenance into a training reward; benchmark-audit work turns retained artifacts into score bounds; OS-agent work uses physical state as the ground truth.
- Several papers expose hidden benchmark confounders: retrieval depth, harness choice, task framing, domain wording, and evidence retention can dominate measured performance.
- Security work is increasingly about indirect objective hijacking rather than explicit malicious prompts: usability pressure, malicious edits, context-mediated attacks, and conformity dynamics all exploit latent system incentives.
- Verifier quality is now a first-class bottleneck: weak MCTS harms VPR, imperfect critics limit SRFT, and judge quality constrains value and provenance benchmarks.
- Inference-time control is broadening beyond decoding tricks to include cultural steering, per-token compute allocation, jailbreak smoothing, and embedding-based safeguard re-triggering.
4) Top 5 papers (with “why now”)
Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements
- Formalizes a realistic supply-chain attack where benign-looking usability requests induce insecure code.
- Shows very high attack success, especially for trade-off pressures, with Type 3 reaching up to 98.1% on GPT-5.2-chat.
- Useful now because coding agents are increasingly fed issue-tracker and product requirements directly, making requirement-level attacks more realistic than explicit malicious prompts.
- Highlights that implicit security priors are easily overridden by explicit usability objectives.
- Skepticism / limitation: evaluation is limited to 25 CWEs / 75 seed scenarios and only tasks where the baseline model was initially secure.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
- Introduces a concrete failure mode: reasoning traces can mention endorsed values while final actions suppress them.
- On DAISY, deliberative generation often underperforms fast generation on value-action alignment; for GPT-4o, Slow–Fast is reported as -0.0378.
- VIVALDI shows dialogue-level post-hoc auditing/rewriting is more effective than reasoning-only repair.
- Useful now because many alignment stacks still assume more explicit reasoning improves safety by default.
- Skepticism / limitation: relies on an automatic value extractor and focuses on advice scenarios under the Schwartz value framework.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
- Makes provenance a generation-time output, linking each sentence to tool turn, evidence span, and support type.
- Provenance-aware RL improves both answer quality and traceability: TRACER-RL reaches 78.23% accuracy and 90.52% provenance F1, while reducing tool calls by ~29.56%.
- Useful now because multimodal agents are becoming harder to audit, and trajectory-level logs are too coarse for verification or credit assignment.
- Strong fit for teams building tool-using agents that need both efficiency and auditability.
- Skepticism / limitation: benchmark and evaluation rely on LLM-as-judge and a restricted ToolVQA-derived tool set.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
- Adds a low-intrusion evidence layer that converts benchmark outcomes into evidence-supported bounds rather than single unsupported scores.
- Finds large uncertainty in some popular benchmarks; e.g., ANDROIDWORLD has a native 61.0% score but an evidence-supported bound of [15.9%, 65.9%] with 50.0% Unknown.
- Useful now because agent leaderboards are increasingly used for procurement and deployment decisions despite weak artifact retention.
- Gives benchmark maintainers a practical path to improve credibility without redesigning tasks.
- Skepticism / limitation: results are based on sampled audits with LLM-assisted scoring plus human review, not full benchmark certification.
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
- Evaluates jailbreaks in a live OS with physical verification and rollback, not just text outputs.
- Introduces Execution Hallucination, where semantic refusal and physical execution diverge.
- Reports substantial seed-set ASR across six models, from 40.64% to 71.51%, with non-zero EHR across models.
- Useful now because desktop/CLI agents are moving into real workflows where side effects matter more than chat responses.
- Skepticism / limitation: currently centered on OpenClaw and a 117-entry validated seed set, so platform generality remains open.
5) Practical next steps
- Audit action-level divergence explicitly: add checks that compare stated values/reasoning to final outputs, tool calls, and environment state changes; do not rely on chain-of-thought as an alignment proxy.
- Instrument runtime evidence and provenance: log which tool observations support each claim, retain authoritative post-run state, and separate surfaced vs inspected vs used evidence.
- Harden requirement ingestion for coding agents: treat feature requests and “usability improvements” as adversarially manipulable inputs; add security-preservation checks before accepting code changes.
- Adopt layered runtime controls for agents: combine risk-tier routing, execution gateways, verifier/recovery loops, and generation-time monitors for secrets or unsafe actions.
- Prefer dense intermediate supervision where possible: if your environment has objective local checks, convert them into process rewards or step-level masks instead of training only on final success.
- Re-evaluate your benchmarks before optimizing to them: measure Unknown rates, artifact sufficiency, harness sensitivity, and task/domain-shift robustness before trusting leaderboard deltas.
- Test indirect attacks, not just explicit jailbreak prompts: include malicious knowledge edits, context-mediated attacks, usability-pressure prompts, and multimodal transfer attacks in red-team suites.
- Tune the “boring” parts first: retrieval depth, BM25 parameters, tool interfaces, and timeout policies may yield larger gains than swapping in more complex model components.
Generated from per-paper analyses; no external browsing.
