Daily AI Paper Report (2026-05-04)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 4848
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-05-01T00:00:00Z → 2026-05-02T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.26235LATTICE: Evaluating Decision Support Utility of Crypto Agents
PDF
cs.CR, cs.AI, cs.CL90Benchmark for crypto agents' decision support; directly relevant to agent evaluation and safety.agents, benchmark, evaluation, decision-support, crypto, llm-judges
2604.24155The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
PDF
cs.CY, cs.AI, cs.HC89Directly probes alignment targets via human vs AI vs designer moral judgments; high safety relevance.alignment, AI ethics, human values, evaluation, moral judgment
2604.25684Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents
PDF
cs.AI89Agent governance via internalized deliberation; directly targets autonomous agent safety.agent-safety, governance, autonomous-agents, guardrails, decision-making
2604.24341GoAT-X: A Graph of Auditing Thoughts for Securing Token Transactions in Cross-Chain Contracts
PDF
cs.CR88LLM-inspired auditing framework for cross-chain contracts with explicit reasoning over dependencies.security, smart-contracts, auditing, reasoning-framework
2604.26522AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents
PDF
cs.AI, cs.LG, cs.LO, cs.MA, cs.SC88Neuro-symbolic LLM agent targets compositional generalization failures with grounded verification.llm-agents, neuro-symbolic, compositional-generalization, reasoning, reliability
2604.26197Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent
PDF
cs.IR, cs.LG88Industrial long-term memory for LLM agents with privacy and observability considerations.llm-agents, memory, retrieval, privacy, personalization, deployment
2604.25849ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents
PDF
cs.AI88Long-horizon LLM-agent orchestration with explicit knowledge state, governance, memory, and fallback.llm-agents, long-horizon, orchestration, memory, reliability, evaluation
2604.21352CARE: Counselor-Aligned Response Engine for Online Mental-Health Support
PDF
cs.CL88LLM mental-health assistant in a high-risk domain; counselor alignment and real-time support matter.llm, safety, mental-health, alignment, high-stakes
2604.25602OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction
PDF
cs.AI88Modular multi-agent framework with observability and permission-driven planning; strong agent safety relevance.agents, multi-agent, observability, permissions, frameworks
2604.25152MGTEVAL: An Interactive Platform for Systemtic Evaluation of Machine-Generated Text Detectors
PDF
cs.CR, cs.CL86Reusable platform for systematic evaluation of machine-generated text detectors under attacks.evaluation, text-detection, robustness, benchmark, security
2604.26516Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
PDF
cs.LG, cs.AI86Test-time self-alignment for offline safe RL with Lyapunov-guided safety constraints.safe-rl, alignment, test-time-adaptation, offline-rl, safety
2604.26394SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization
PDF
cs.CR, cs.AI86Multi-agent cybersecurity assistant with user study; strong real-world agentic security relevance.agent-safety, cybersecurity, multi-agent, evaluation, human-study, troubleshooting
2604.17788SoK: Analysis of Privacy Risks and Mitigation in Online Propaganda Detection through the PROMPT Framework
PDF
cs.CR, cs.SI86Privacy-risk framework and compliance scoring for propaganda detection pipelines.privacy, security, survey, framework, compliance, evaluation
2604.25085Optimally Auditing Adversarial Agents
PDF
cs.GT, cs.AI, cs.CY86General audit-policy design for adversarial agents; strong strategic oversight relevance and concrete algorithms.agents, auditing, game-theory, adversarial, mechanism-design
2604.20273ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
PDF
cs.AI, cs.CL84Multi-agent LLM benchmark pipeline with verifier/repair roles and broad model evaluation.llm-evaluation, multi-agent, benchmark, reasoning, dataset
2604.25665LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
PDF
cs.CL, cs.AI, cs.DL, cs.IR84Strong LLM evaluation/meta-eval plus self-evaluative summarization across long documents.llm-evaluation, summarization, self-evaluation, long-context, benchmarks
2604.07900AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
PDF
cs.CV, cs.AI84Tool-augmented RL agent with self-reflection for iterative anomaly synthesis; relevant to agent evaluation.agents, tool-use, reinforcement-learning, industrial-ai, self-reflection
2603.07897LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization
PDF
cs.LG84LLM-agent AutoML for cost prediction; concrete enterprise use with RAG and lifecycle automation.llm, agents, automl, rag, enterprise, prediction
2604.18302Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
PDF
cs.AI84On-device LLM deployment for sensitive mental-health use directly targets privacy-preserving AI.llm, privacy, on-device, healthcare, deployment, security
2604.25247R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models
PDF
cs.CR84Embeds watermarking into reasoning paths, relevant to LLM misuse resistance and model ownership.llm-security, watermarking, reasoning, chain-of-thought, misuse
2604.20795Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems
PDF
cs.AI84Structured external memory with ontology validation could improve grounding, verification, and agent reliability.llm, knowledge-graphs, grounding, verification, agents, rag
2604.25591Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
PDF
eess.AS, cs.AI, cs.CL, cs.LG, cs.SD84First systematic study of uncertainty estimation for audio-aware LLMs; useful reliability benchmark.uncertainty, multimodal, LLM, hallucination, evaluation
2604.24346SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
PDF
cs.CV, cs.AI83Targets VLM sycophancy/hallucination with a new metric and large-scale benchmark.vlm, hallucination, sycophancy, evaluation, benchmark, reliability
2604.05489SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation
PDF
cs.AI, cs.MA82Multi-agent self-correcting prompt refinement with structured verification; relevant to agent reliability.multi-agent, prompting, self-correction, verification, text-to-video
2604.20598Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge
PDF
cs.IR, cs.CL, cs.DB, cs.LG82RAG framework adds temporal, confidence, and relational signals to reduce stale retrieval errors.rag, retrieval, factuality, knowledge-updates, reliability
2604.24076An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
PDF
cs.AI, cs.CL, cs.CR82LLM stability under uncertainty targets reliability in high-stakes deployment, though claims seem abstract.LLM reliability, stability, uncertainty, evaluation, safety
2604.25491The Forensic Cost of Watermark Removal
PDF
cs.CV, cs.AI82Adds forensic detectability to watermark removal evaluation; useful for provenance/security.watermarking, forensics, security, detection, generative-media
2604.20842SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
PDF
cs.CL, cs.AI, cs.SD82Large benchmark for paralinguistic-aware speech generation; reusable eval resource for audio LMs.benchmark, speech, audio-language-models, evaluation, multimodal
2604.21766AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
PDF
cs.CL82New audio QA benchmark targeting shortcut resistance and real auditory reasoning; useful for evals.benchmark, evaluation, audio, reasoning, robustness
2604.11394Optimizing IoT Intrusion Detection with Tabular Foundation Models for Smart City Forensics
PDF
cs.CR82Security-focused evaluation of tabular foundation models for fast IoT intrusion detection.security, intrusion-detection, foundation-models, tabular, evaluation, iot

AI Paper Insight Brief

2026-05-04

0) Executive takeaways (read this first)

  • Agentic systems are shifting from “LLM as monolith” to LLM + constrained runtime structure: multiple papers show gains from adding explicit verification, tool-use boundaries, memory/state abstractions, or governance loops rather than relying on raw prompting alone.
  • A strong theme is evaluation becoming more decision-centric and failure-aware: new benchmarks target utility, uncertainty, sycophancy, privacy/compliance, paralinguistics, audio reasoning, and long-document summarization rather than just top-line accuracy.
  • For safety/security, the most credible wins come from hybrid pipelines that combine symbolic/static structure with learned components: e.g., cross-chain auditing, offline safe RL adaptation, and ontology/memory systems all improve by grounding model reasoning in explicit constraints or state.
  • Several papers show that runtime context quality dominates model quality in practice: device evidence, hierarchical memory, runtime-derived features, and scenario-aware prompt refinement often produce larger gains than swapping base models alone.
  • Production-oriented work is increasingly explicit about latency, cost, provenance, and update loops. The best systems report not just accuracy but deployment tradeoffs: cost per project, inference speedups, token savings, or engineering-cycle compression.
  • A recurring caution: many promising results still rest on synthetic benchmarks, narrow domains, or LLM-as-judge protocols, so operational adoption should prioritize adversarial validation, judge calibration, and real-world holdout tests.

2) Key themes (clusters)

Theme: Structured agent orchestration for production tasks

Theme: Verification-first safety and security pipelines

Theme: Better benchmarks for real failure modes, not just accuracy

Theme: Retrieval, memory, and knowledge representations are becoming temporal and structured

Theme: Privacy, forensics, and ownership are moving toward architectural guarantees

3) Technical synthesis

  • A common systems pattern is proposal → verification → revision: SCMAPR, GoAT-X, LLM-ReSum, PAGRL, and ActuBench all use an initial generative step followed by structured checking and targeted repair.
  • Several papers replace static pipelines with runtime-adaptive control loops: AnomalyAgent’s tool-augmented RL, SecMate’s confidence-guided troubleshooting, SAS’s imagined safe-fragment prompting, and OxyGent’s permission-driven planning.
  • Offline preprocessing to improve online latency is everywhere: HLTM’s bottom-up aggregation, LeJOT’s cached feature extraction, SmartVector’s consolidation path, and ActuBench’s hardest-subset curation.
  • The strongest retrieval/memory papers add non-semantic signals to ranking: time validity, confidence, relations, hierarchy, privacy scope, or answerable QA views.
  • Evaluation is increasingly focused on operating characteristics, not just average score: TPR@low FPR in watermark/detector work, AURAC/AUROC for uncertainty, and human preference or IRT for benchmark validity.
  • Multiple papers show that judge design is now a core methodological variable: LLM-as-judge appears in actuarial reasoning, summarization, speech, crypto utility, and VLM sycophancy, often with explicit concern about bias.
  • There is a visible shift from “alignment as training” to alignment as runtime governance: PAGRL, audit optimization, zero-egress design, and test-time safe RL all emphasize deployment-time control.
  • Hybrid neuro-symbolic methods remain competitive when the task requires compositionality, formal constraints, or persistent state, as seen in AGEL-Comp, ontology construction, and GoAT-X.
  • Several practical papers report that context/evidence quality beats model scaling: device clues in SecMate, runtime features in LeJOT, and hierarchical memory in HLTM all materially change outcomes.
  • A recurring limitation across the set is external validity: synthetic benchmarks, single-domain case studies, and proprietary components remain common, so many reported gains should be treated as strong prototypes rather than settled best practices.

4) Top 5 papers (with “why now”)

  • GoAT-X: A Graph of Auditing Thoughts for Securing Token Transactions in Cross-Chain Contracts
    • Combines static analysis, formal cross-chain predicates, LLM ensembles, and RAG into a constrained audit pipeline.
    • Reports 92% audit-point recall and project-level recall/precision/F1 of 0.95/0.83/0.88 on 673 contracts across 20 projects.
    • Wild scan found 117 confirmed risks from 128 alerts, with low reported cost and runtime per project.
    • Why now: it is a concrete example of how to make LLM-based security analysis useful by tightly grounding it in program structure rather than free-form reasoning.
    • Skeptical about: implicit semantic/arithmetic dependencies still cause misses, and some alerts still require manual exploitability triage.
  • Hierarchical Long-Term Semantic Memory for LinkedIn’s Hiring Agent
    • Introduces a schema-aligned hierarchical memory with facets, answerable QA, and summaries plus privacy-scoped subtree retrieval.
    • Supports lossless incremental updates and provenance-aware answers, addressing real production memory constraints.
    • On a production-derived benchmark, reports semantic correctness 0.798 and Token-F1 0.635, with better latency/quality tradeoffs than baselines.
    • Why now: long-term memory is becoming the bottleneck for enterprise agents, and this is one of the clearest production-grounded designs in the batch.
    • Skeptical about: evaluation is small and domain-specific, and some hierarchical baselines were not fully compared.
  • SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization
    • Integrates device evidence, online user proficiency profiling, and recommendation into a single troubleshooting assistant.
    • Device grounding via the Clue Collector raised correct resolutions from about 50% to 90.9% in a 144-participant study.
    • Profiling improves quickly within a few turns, and the system reports user-preferred stepwise solution delivery.
    • Why now: it shows a practical recipe for making support agents actually useful—ground them in local evidence and adapt to user skill.
    • Skeptical about: participant population is relatively homogeneous, and per-conversation cost/latency are non-trivial.
  • LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
    • Provides a broad meta-evaluation of 14 summarization metrics across seven datasets and shows lexical metrics often fail badly on long/specialized documents.
    • Multi-agent LLM evaluators align better with humans on linguistic dimensions, and the refinement loop improves weak summaries without finetuning.
    • Reports up to +33% factual accuracy and +39% coverage improvements on low-quality summaries, with 89% human preference for refined outputs.
    • Why now: evaluation-driven self-improvement is becoming a practical alternative to retraining, especially for enterprise summarization stacks.
    • Skeptical about: long-document performance still degrades sharply, and evaluator self-preference remains a concern.
  • Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
    • Uses imagined rollouts and occupancy-based Lyapunov scoring to select safe trajectory fragments as in-context prompts for a pretrained transformer policy.
    • Avoids parameter updates at deployment while reducing cost/failure and maintaining or improving reward across Safety Gymnasium and MuJoCo settings.
    • Includes a probabilistic bound linking safety to computational budget.
    • Why now: test-time alignment without retraining is increasingly relevant for deployed agents and robotics-like systems where updates are expensive or impossible.
    • Skeptical about: inference overhead is significant, and safety depends on offline coverage and world-model quality.

5) Practical next steps

  • Build verification-first agent loops: require proposal, explicit check, and targeted revision before any consequential action or external tool call.
  • For agent memory, test hierarchical or temporal retrieval rather than flat vector search; measure stale-answer rate, provenance coverage, and update cost, not just retrieval recall.
  • Add runtime observability hooks now: structured traces, checkpoints, intermediate artifacts, and per-step latency/cost accounting are becoming table stakes for debugging and governance.
  • When evaluating assistants, move beyond accuracy to decision-support metrics: actionability, uncertainty handling, evidence coverage, and user burden.
  • Stress-test any LLM-as-judge setup with human spot checks, pairwise comparisons, and bias audits before using it for model ranking or automated refinement.
  • For safety-critical agents, prototype pre-action governance layers with explicit proceed/self-correct/escalate outcomes and measure false escalations, missed escalations, and latency overhead.
  • In security workflows, prefer hybrid static/symbolic + LLM designs over pure prompting; measure low-FPR performance and analyst triage burden.
  • If deploying privacy-sensitive systems, prioritize architectural guarantees such as on-device inference, scoped memory, and no-egress defaults rather than relying only on policy promises.

Generated from per-paper analyses; no external browsing.