Daily AI Paper Report (2026-04-28)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 4364
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-24T00:00:00Z → 2026-04-25T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.21395Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair
PDF
cs.LG, cs.AI, cs.CV92Theory: ERM forces sensitivity to spurious label-correlated nuisances; unifies robustness failures + minimal fixrobustness, theory, spurious-features, adversarial, representation-learning, generalization
2604.18473Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
PDF
cs.LG92Modular post-training via MoE to add domains without regressions; scalable update path.LLM, post-training, mixture-of-experts, modularity, router, continual-learning
2604.21841Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles
PDF
cs.CR90Coordinated camera+LiDAR spoofing to defeat fusion redundancy; important AV security threat model.adversarial-attacks, sensor-fusion, autonomous-vehicles, spoofing, robustness, security
2604.19211ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation
PDF
cs.AI90Cross-user agent collaboration + governance framing; important for multi-agent safety & permissions.agents, governance, multi-user, coordination, security, infrastructure
2604.18478WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation
PDF
cs.AI, cs.CL90Agent memory engine with ontology-aware reconciliation; tackles contradiction/supersession in RAG.agents, memory, RAG, knowledge-graphs, long-term, consistency
2604.19667Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
PDF
cs.CL, cs.AI, cs.CV, cs.LG, cs.MA90Benchmark + agentic framework for generating executable workflows; targets reliability/execution errors.agents, workflow-generation, benchmark, tool-use, execution, reliability, evaluation
2604.17944ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering
PDF
cs.CL88Large tool-augmented multi-step QA benchmark with verifiable SQL/API steps; strong agent eval.agents, tool-use, benchmark, planning, SQL, evaluation
2604.19606AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories
PDF
cs.AI, cs.MA88Reproduce-then-ablate coding agent with verification artifacts; strong for auditing scientific agent claims.agents, reproducibility, verification, automated-ablation, scientific-ml, evaluation
2604.17883Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
PDF
cs.SE, cs.HC, cs.LG87Proposes governable consensus layer for AI coding; tackles control/traceability failures in dev workflows.AI-assisted coding, governance, traceability, world-models, software-engineering, agents
2603.18788Mi:dm K 2.5 Pro
PDF
cs.CL, cs.AI86Enterprise 32B LLM w/ reasoning-focused data+training (DuS depth upscaling); likely impactful if results solidLLM, reasoning, pretraining, data curation, efficiency, Korean
2604.20677Intersectional Fairness in Large Language Models
PDF
cs.CL86Systematic intersectional fairness eval across LLMs; highlights metric pitfalls & stereotype effectsfairness, bias, evaluation, intersectionality, LLMs
2604.19685An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA
PDF
cs.CL86New doc-grounded “related insight” task + SCOpE-QA dataset for iterative open-ended QARAG, document-grounded QA, dataset, evaluation, interactive QA
2604.21598DryRUN: On the Role of Public Tests in LLM-Driven Code Generation
PDF
cs.SE, cs.AI86Analyzes reliance on public tests in LLM code agents; targets a key unrealistic assumption in eval/training loopscode-generation, agents, evaluation, testing, debugging, software-engineering
2604.12440IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation
PDF
cs.CV, cs.AI86Unified anomaly segmentation+explanation+generation; new Anomaly-56K benchmark; practical VLM designindustrial-anomaly-detection, vision-language-models, grounding, benchmark, DINOv2, Qwen
2604.20805Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
PDF
cs.CY, cs.AI, cs.MA86Governance-focused reframing of alignment via principal-agent axes; useful lens for real deploymentsai-safety, value-alignment, governance, principal-agent, pluralism
2604.19342Are Large Language Models Economically Viable for Industry Deployment?
PDF
cs.CL86Adds cost/latency/energy benchmarking for LLM deployment; closes accuracy-only evaluation gap.llm-evaluation, deployment, latency, energy, cost, benchmarking, systems
2604.06899Data Leakage in Automotive Perception: Practitioners' Insights
PDF
cs.CR, cs.LG, cs.SE84Practitioner study on data leakage in safety-critical automotive perception; actionable reliability insights.data-leakage, evaluation, automotive, ml-reliability, safety, industry-practice
2604.19653A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities
PDF
cs.AI84Analyzes privacy vulnerabilities of synthetic mobility trajectories; concrete privacy-utility evaluation angle.privacy, synthetic-data, trajectory, generative-models, evaluation, data-leakage
2604.17778TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
PDF
cs.LG84TeleEmbedBench benchmark targets embedding eval for RAG on acronym-dense telecom corporaRAG, embeddings, benchmark, domain evaluation, telecom
2604.21282Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection
PDF
cs.CR, cs.LG, cs.SE84Heterogeneous multi-agent LLM setup for vuln detection with local adversarial verifier; cost/accuracy trade-offcybersecurity, vulnerability-detection, multi-agent, LLM, verification, secure-coding
2604.20134AgentSOC: A Multi-Layer Agentic AI Framework for Security Operations Automation
PDF
cs.CR, cs.AI, cs.CL84Agentic SOC automation with risk-based planning and policy-compliant actions; relevant to agent safetyagents, security-operations, tool-use, risk-assessment, policy-compliance, cybersecurity
2604.18349HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
PDF
cs.CL84LLM-guided hierarchical memory retrieval to reduce bloated context and improve precision/inspectability.agents, memory, retrieval, long-context, RAG, efficiency
2604.19278Explicit Trait Inference for Multi-Agent Coordination
PDF
cs.AI, cs.MA84Trait tracking improves multi-agent coordination; addresses goal drift/error cascades in MAS.multi-agent, coordination, agent-reliability, interaction-modeling, benchmarks
2604.17805Ranking Abuse via Strategic Pairwise Data Perturbations
PDF
cs.LG, cs.AI, cs.GT82Studies adversarial manipulation of pairwise ranking; relevant to preference aggregation and eval integrity.robustness, adversarial, ranking, preference-modeling, data-poisoning, security
2604.19031SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
PDF
cs.CR82SAGE tackles “signal submersion” to improve LLM-based vulnerability detection robustnessLLM security, vulnerability detection, representation, software security
2604.21345Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline
PDF
cs.AI, cs.CL82Reusable, typed artifact-based eval pipeline for meeting summaries; supports aggregation + statistical testingevaluation, summarization, benchmarks, pipelines, reliability, offline-eval
2604.11741Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
PDF
cs.AI82Multi-agent script generation for deception/imperfect info reasoning; useful eval setting for agentic VLMsmulti-agent, deception, imperfect-information, evaluation, reasoning, VLM
2604.18206A Control Architecture for Training-Free Memory Use
PDF
cs.AI82Training-free control for when/which memory to use; uncertainty routing + governance of memory bank.agents, memory, routing, uncertainty, reliability, control
2604.19262CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
PDF
cs.CL, cs.AI82Grounded multilingual/multicultural benchmark; useful for safety-relevant global deployment evaluation.benchmark, multilingual, culture, grounded-evaluation, robustness, llm-eval
2604.06865Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible--Infrared Evasion
PDF
cs.CV, cs.AI81Survey of physical adversarial attacks for real surveillance pipelines (tracking, RGB-IR); clarifies threat models.physical-attacks, surveillance, adversarial-examples, tracking, thermal, security

AI Paper Insight Brief

2026-04-28

0) Executive takeaways (read this first)

  • “System-level” robustness is the new baseline: across surveillance and autonomous driving, papers argue that per-frame/per-sensor metrics miss the real threat; persistence over time, cross-modal consistency, and pipeline-aware objectives determine operational risk.
  • Memory is shifting from “retrieve more” to “control + governance”: training-free applicability control (TAG) and write-time semantic reconciliation (WorldDB) both show that when/how memory is applied (and how it evolves) can dominate raw retrieval quality.
  • Benchmarks are becoming more executable and artifact-backed: ReCoQA (SQL+API traces), Chat2Workflow (import+execution), and the meeting-summary pipeline (persisted GT/claims/judgments + significance tests) all push evaluation toward verifiable intermediate steps and end-to-end execution.
  • Modularity is emerging as a practical post-training strategy: BAR (MoE modular post-training) shows near “full retrain” performance while enabling independent domain upgrades—useful for organizations that need frequent capability refreshes without catastrophic forgetting.
  • Security work is increasingly mechanistic: SAGE diagnoses an internal representation failure (“signal submersion”) and fixes it with layerwise sparse feature amplification; ranking manipulation work shows phase transitions where small perturbation budgets cause large outcome shifts.

2) Key themes (clusters)

Theme: System-level physical security (time + modality + pipeline)

  • Why it matters: Real deployments don’t fail on single frames—they fail when evasion persists through tracking, survives sensor redundancy, or induces downstream unsafe actions. Evaluations that ignore these factors can dramatically understate risk.
  • Representative papers:
  • Common approach:
    • Reframe threat models around operational objectives (ID corruption, false trajectories, emergency braking) rather than detector mAP.
    • Emphasize temporal persistence (tracking) and cross-modal transfer/consistency (visible–IR; camera–LiDAR).
    • Propose staged evaluation protocols that increase realism (from digital to activation-aware, multimodal, temporally persistent tests).
  • Open questions / failure modes:
    • How well do digital/simulated attacks transfer to physical conditions (distance, lighting, timing, calibration drift)?
    • What defenses work against coordinated consistency attacks (where sensors agree on a fake object)?
    • How to benchmark identity-level harms (ID switches, long-horizon tracking corruption) consistently across pipelines?

Theme: Memory for agents—control, hierarchy, and write-time semantics

  • Why it matters: Long-running agents fail when memory is applied in the wrong state, when contradictions accumulate, or when retrieval bloats context. New work suggests memory needs policies and semantics, not just embeddings.
  • Representative papers:
  • Common approach:
    • Add applicability control: uncertainty-gated routing + selective acceptance/rollback + retirement of harmful entries (TAG).
    • Use hierarchical structures (event summaries → turn selection) to raise precision while keeping recall (HiGMem).
    • Enforce write-time reconciliation semantics (supersedes/contradicts/same_as handlers) and auditable immutability (WorldDB).
  • Open questions / failure modes:
    • Control policies depend on confidence separability and bank quality; when does confidence fail as a gate?
    • Write-time semantics increase ingest complexity/cost; how to scale extraction/resolution reliably?
    • Generalization beyond the evaluated settings (e.g., HiGMem’s weaker DialSim results; WorldDB evaluated on LongMemEval-s).

Theme: Executable, traceable evaluation for tool/agent workflows

Theme: Modular post-training and enterprise-grade model building

  • Why it matters: Organizations need frequent capability upgrades (math/code/tools/safety, domain language) without full retraining or catastrophic forgetting. Two complementary strategies appear: end-to-end enterprise pipelines and modular MoE composition.
  • Representative papers:
  • Common approach:
    • Heavy emphasis on data curation and targeted synthesis (AST-based code filtering; math gap-filling).
    • Multi-stage post-training (Reasoning SFT, RL variants, merging/fusion) to balance reasoning, fluency, tool use, and robustness.
    • Modular experts trained independently (mid-training→SFT→RLVR) then composed with lightweight router training (BAR).
  • Open questions / failure modes:
    • Inference cost grows with number of experts; BAR notes performance drops when activating fewer experts.
    • Reproducibility gaps: proprietary data/benchmarks and limited compute disclosure (Mi:dm K 2.5 Pro).
    • How to upgrade the anchor/base model without retraining all experts (BAR limitation).

Theme: Security & reliability via internal/mechanistic and socio-technical lenses

  • Why it matters: Robustness failures come from both model internals (representation bottlenecks) and process failures (data leakage, governance). This cluster provides concrete diagnostics and attack surfaces.
  • Representative papers:
  • Common approach:
    • Identify a specific failure mechanism (e.g., “signal submersion” across layers; role-fragmented leakage understanding; MLE ranking phase transitions).
    • Provide actionable interventions or attacks (layerwise SAEs; process controls like immutable eval sets; ASSA manipulation algorithm).
    • Use diagnostics beyond aggregate accuracy (MCC under imbalance; qualitative role-based themes; Kendall Tau distance to target ranking).
  • Open questions / failure modes:
    • SAGE can only amplify signals already present in the backbone; may not help truly novel vulnerability classes.
    • Leakage prevention remains largely process-driven; tooling standardization and cross-role alignment are unresolved.
    • Ranking attacks assume white-box access and heuristic optimization; defenses are not provided.

3) Technical synthesis

  • “Applicability” is a recurring control variable: TAG’s route/accept/retire decisions for memory mirror broader agent/tool pipelines where when to invoke a component matters as much as the component itself (also echoed by hierarchical agent decomposition in ReCoQA).
  • Evaluation is moving from single scalar scores to staged pipelines: Chat2Workflow’s Pass vs Resolve, meeting-summary claim extraction + coverage/completeness, and surveillance’s stage ladder all separate syntactic validity from operational success.
  • LLM-as-judge appears in multiple roles: reward shaping (Mi:dm K 2.5 Pro RL; murder-mystery ScoreAgent), benchmark construction/validation (TeleEmbedBench validator), and evaluation (meeting summaries; CulturALL correctness judging).
  • Long-context and long-memory are diverging: Mi:dm K 2.5 Pro pushes 128K context, while WorldDB/HiGMem argue persistence needs structured memory with reconciliation/hierarchy—context length alone doesn’t solve drift/contradiction.
  • Modularity shows up both in models and systems: BAR composes domain experts; ClawNet composes identity-scoped agents; both aim to reduce interference (capability or privacy) via separation + controlled interfaces.
  • Security attacks increasingly target the “glue”: cross-modal fusion (camera–LiDAR), tracking pipelines (surveillance), and ranking aggregation (Bradley–Terry MLE) are attacked at the system/aggregation layer, not just the base predictor.
  • Mechanistic representation interventions are gaining traction: SAGE’s intermediate-layer sparse projection is a concrete example of “fix the representation bottleneck” rather than only prompting or full fine-tuning.
  • Cost/throughput constraints are being formalized: EDGE-EVAL introduces lifecycle metrics (break-even requests, cold-start tax), while TeleEmbedBench and vulnerability-detection architectures explicitly measure latency/cost trade-offs.

4) Top 5 papers (with “why now”)

1) WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation

  • Introduces write-time programmable edges (supersedes/contradicts/same_as handlers) and content-addressed immutability for auditable memory.
  • Shows very strong LongMemEval-s results (overall 96.40%, task-avg 97.11%) and ablations attributing gains to the engine layer.
  • “Why now”: long-running agents are hitting context rot and contradiction/identity drift; this is a concrete substrate-level proposal with ablations and engineering benchmarks.
  • Skepticism / limitation: higher ingest-time overhead; composed embeddings are parameter-free and the paper notes learned aggregators are future work; evaluation scope centered on LongMemEval-s.

2) Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

  • BAR converts a post-trained dense model into an MoE with an anchor expert (frozen) plus domain experts trained independently (mid-training→SFT→RLVR).
  • At 7B scale, BAR’s overall score (49.1) beats several retraining baselines and supports incremental add/upgrade of experts.
  • “Why now”: frequent model updates are operationally necessary; modularity offers a path to reduce catastrophic forgetting and retraining cost.
  • Skepticism / limitation: inference cost and parameter growth scale with number of experts; performance degrades with sparse expert activation; upgrading the anchor requires retraining experts.

3) A Control Architecture for Training-Free Memory Use

  • TAG provides a training-free control stack: uncertainty-gated retrieval, selective accept/rollback, and evidence-based retirement.
  • Under compute-matched controls, shows sizable arithmetic gains (e.g., SVAMP +7.0, ASDiv +7.67) where “retry” alone is flat.
  • “Why now”: many deployments can’t retrain models but still want memory; this isolates the value of control policy vs “more retrieval.”
  • Skepticism / limitation: strongest wins concentrate on arithmetic; effectiveness depends on confidence separability and memory-bank quality.

4) SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection

  • Diagnoses “Signal Submersion” and uses pan-layer extraction + JumpReLU sparse autoencoders with task-conditional alignment to amplify vulnerability cues.
  • Reports strong MCC results (e.g., BigVul MCC 0.7874 for one setting) and mechanistic evidence (SNR amplification up to 12.7×; concentrated sparse neurons).
  • “Why now”: vulnerability detection is high-impact and suffers from imbalance + distribution shift; this offers a frozen-backbone, mechanistically motivated fix.
  • Skepticism / limitation: cannot create knowledge absent from pretraining; low-resource language subsets are small; SAE training scales with number of probed layers.

5) ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering

  • Provides 29,270 QA instances with verifiable intermediate traces (SLU labels, SQL, cached API calls) enabling deterministic evaluation.
  • Hierarchical HIRE-Agent improves average accuracy and F1 by about +0.20 over a single-agent baseline; GT-signal probing still leaves a gap (avg accuracy 0.8864).
  • “Why now”: tool-augmented agents need benchmarks where intermediate steps are executable and auditable, not just final answers.
  • Skepticism / limitation: Chinese-language and tied to Chinese map services; single-turn only; template-based generation artifacts remain a concern.

5) Practical next steps

  • For agent memory systems, separate memory content from memory-use policy: implement TAG-like routing + accept/rollback and measure compute-matched gains vs “always retrieve.”
  • If building long-term memory, add write-time semantics (supersession/contradiction) and auditability; evaluate on long-memory tasks with ablations that isolate “engine” vs “answerer.”
  • For tool-using agents, adopt trace-first evaluation: require cached/deterministic tool outputs (like ReCoQA) and score both intermediate correctness and final synthesis.
  • In workflow-generation products, track Pass vs Resolve (format/import vs execution correctness) and build error-driven repair loops; measure the pass–resolve gap as a primary KPI.
  • For security robustness in perception, expand tests to temporal + multimodal settings (tracking, visible–IR, camera–LiDAR fusion) and report identity-level or action-level outcomes, not just detector failures.
  • For vulnerability detection, try intermediate-layer feature extraction + sparse amplification (SAGE-style) as a low-cost alternative to full fine-tuning; evaluate under deduped and distribution-shifted splits.
  • For model maintenance, prototype modular expert upgrades (BAR-style) and quantify: (i) domain gain, (ii) general-capability retention, (iii) inference cost vs expert sparsity.

Generated from per-paper analyses; no external browsing.