Daily AI Paper Report (2026-04-28)

Published: April 28, 2026

Chinese version: [中文]

Run stats

Candidates: 4364
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-24T00:00:00Z → 2026-04-25T00:00:00Z (weekend_backlog_sun, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.21395`	Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair PDF	cs.LG, cs.AI, cs.CV	92	Theory: ERM forces sensitivity to spurious label-correlated nuisances; unifies robustness failures + minimal fix	robustness, theory, spurious-features, adversarial, representation-learning, generalization
`2604.18473`	Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts PDF	cs.LG	92	Modular post-training via MoE to add domains without regressions; scalable update path.	LLM, post-training, mixture-of-experts, modularity, router, continual-learning
`2604.21841`	Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles PDF	cs.CR	90	Coordinated camera+LiDAR spoofing to defeat fusion redundancy; important AV security threat model.	adversarial-attacks, sensor-fusion, autonomous-vehicles, spoofing, robustness, security
`2604.19211`	ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation PDF	cs.AI	90	Cross-user agent collaboration + governance framing; important for multi-agent safety & permissions.	agents, governance, multi-user, coordination, security, infrastructure
`2604.18478`	WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation PDF	cs.AI, cs.CL	90	Agent memory engine with ontology-aware reconciliation; tackles contradiction/supersession in RAG.	agents, memory, RAG, knowledge-graphs, long-term, consistency
`2604.19667`	Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language PDF	cs.CL, cs.AI, cs.CV, cs.LG, cs.MA	90	Benchmark + agentic framework for generating executable workflows; targets reliability/execution errors.	agents, workflow-generation, benchmark, tool-use, execution, reliability, evaluation
`2604.17944`	ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering PDF	cs.CL	88	Large tool-augmented multi-step QA benchmark with verifiable SQL/API steps; strong agent eval.	agents, tool-use, benchmark, planning, SQL, evaluation
`2604.19606`	AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories PDF	cs.AI, cs.MA	88	Reproduce-then-ablate coding agent with verification artifacts; strong for auditing scientific agent claims.	agents, reproducibility, verification, automated-ablation, scientific-ml, evaluation
`2604.17883`	Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer PDF	cs.SE, cs.HC, cs.LG	87	Proposes governable consensus layer for AI coding; tackles control/traceability failures in dev workflows.	AI-assisted coding, governance, traceability, world-models, software-engineering, agents
`2603.18788`	Mi:dm K 2.5 Pro PDF	cs.CL, cs.AI	86	Enterprise 32B LLM w/ reasoning-focused data+training (DuS depth upscaling); likely impactful if results solid	LLM, reasoning, pretraining, data curation, efficiency, Korean
`2604.20677`	Intersectional Fairness in Large Language Models PDF	cs.CL	86	Systematic intersectional fairness eval across LLMs; highlights metric pitfalls & stereotype effects	fairness, bias, evaluation, intersectionality, LLMs
`2604.19685`	An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA PDF	cs.CL	86	New doc-grounded “related insight” task + SCOpE-QA dataset for iterative open-ended QA	RAG, document-grounded QA, dataset, evaluation, interactive QA
`2604.21598`	DryRUN: On the Role of Public Tests in LLM-Driven Code Generation PDF	cs.SE, cs.AI	86	Analyzes reliance on public tests in LLM code agents; targets a key unrealistic assumption in eval/training loops	code-generation, agents, evaluation, testing, debugging, software-engineering
`2604.12440`	IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation PDF	cs.CV, cs.AI	86	Unified anomaly segmentation+explanation+generation; new Anomaly-56K benchmark; practical VLM design	industrial-anomaly-detection, vision-language-models, grounding, benchmark, DINOv2, Qwen
`2604.20805`	Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem PDF	cs.CY, cs.AI, cs.MA	86	Governance-focused reframing of alignment via principal-agent axes; useful lens for real deployments	ai-safety, value-alignment, governance, principal-agent, pluralism
`2604.19342`	Are Large Language Models Economically Viable for Industry Deployment? PDF	cs.CL	86	Adds cost/latency/energy benchmarking for LLM deployment; closes accuracy-only evaluation gap.	llm-evaluation, deployment, latency, energy, cost, benchmarking, systems
`2604.06899`	Data Leakage in Automotive Perception: Practitioners' Insights PDF	cs.CR, cs.LG, cs.SE	84	Practitioner study on data leakage in safety-critical automotive perception; actionable reliability insights.	data-leakage, evaluation, automotive, ml-reliability, safety, industry-practice
`2604.19653`	A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities PDF	cs.AI	84	Analyzes privacy vulnerabilities of synthetic mobility trajectories; concrete privacy-utility evaluation angle.	privacy, synthetic-data, trajectory, generative-models, evaluation, data-leakage
`2604.17778`	TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications PDF	cs.LG	84	TeleEmbedBench benchmark targets embedding eval for RAG on acronym-dense telecom corpora	RAG, embeddings, benchmark, domain evaluation, telecom
`2604.21282`	Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection PDF	cs.CR, cs.LG, cs.SE	84	Heterogeneous multi-agent LLM setup for vuln detection with local adversarial verifier; cost/accuracy trade-off	cybersecurity, vulnerability-detection, multi-agent, LLM, verification, secure-coding
`2604.20134`	AgentSOC: A Multi-Layer Agentic AI Framework for Security Operations Automation PDF	cs.CR, cs.AI, cs.CL	84	Agentic SOC automation with risk-based planning and policy-compliant actions; relevant to agent safety	agents, security-operations, tool-use, risk-assessment, policy-compliance, cybersecurity
`2604.18349`	HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents PDF	cs.CL	84	LLM-guided hierarchical memory retrieval to reduce bloated context and improve precision/inspectability.	agents, memory, retrieval, long-context, RAG, efficiency
`2604.19278`	Explicit Trait Inference for Multi-Agent Coordination PDF	cs.AI, cs.MA	84	Trait tracking improves multi-agent coordination; addresses goal drift/error cascades in MAS.	multi-agent, coordination, agent-reliability, interaction-modeling, benchmarks
`2604.17805`	Ranking Abuse via Strategic Pairwise Data Perturbations PDF	cs.LG, cs.AI, cs.GT	82	Studies adversarial manipulation of pairwise ranking; relevant to preference aggregation and eval integrity.	robustness, adversarial, ranking, preference-modeling, data-poisoning, security
`2604.19031`	SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection PDF	cs.CR	82	SAGE tackles “signal submersion” to improve LLM-based vulnerability detection robustness	LLM security, vulnerability detection, representation, software security
`2604.21345`	Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline PDF	cs.AI, cs.CL	82	Reusable, typed artifact-based eval pipeline for meeting summaries; supports aggregation + statistical testing	evaluation, summarization, benchmarks, pipelines, reliability, offline-eval
`2604.11741`	Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games PDF	cs.AI	82	Multi-agent script generation for deception/imperfect info reasoning; useful eval setting for agentic VLMs	multi-agent, deception, imperfect-information, evaluation, reasoning, VLM
`2604.18206`	A Control Architecture for Training-Free Memory Use PDF	cs.AI	82	Training-free control for when/which memory to use; uncertainty routing + governance of memory bank.	agents, memory, routing, uncertainty, reliability, control
`2604.19262`	CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks PDF	cs.CL, cs.AI	82	Grounded multilingual/multicultural benchmark; useful for safety-relevant global deployment evaluation.	benchmark, multilingual, culture, grounded-evaluation, robustness, llm-eval
`2604.06865`	Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible--Infrared Evasion PDF	cs.CV, cs.AI	81	Survey of physical adversarial attacks for real surveillance pipelines (tracking, RGB-IR); clarifies threat models.	physical-attacks, surveillance, adversarial-examples, tracking, thermal, security

AI Paper Insight Brief

2026-04-28

0) Executive takeaways (read this first)

“System-level” robustness is the new baseline: across surveillance and autonomous driving, papers argue that per-frame/per-sensor metrics miss the real threat; persistence over time, cross-modal consistency, and pipeline-aware objectives determine operational risk.
Memory is shifting from “retrieve more” to “control + governance”: training-free applicability control (TAG) and write-time semantic reconciliation (WorldDB) both show that when/how memory is applied (and how it evolves) can dominate raw retrieval quality.
Benchmarks are becoming more executable and artifact-backed: ReCoQA (SQL+API traces), Chat2Workflow (import+execution), and the meeting-summary pipeline (persisted GT/claims/judgments + significance tests) all push evaluation toward verifiable intermediate steps and end-to-end execution.
Modularity is emerging as a practical post-training strategy: BAR (MoE modular post-training) shows near “full retrain” performance while enabling independent domain upgrades—useful for organizations that need frequent capability refreshes without catastrophic forgetting.
Security work is increasingly mechanistic: SAGE diagnoses an internal representation failure (“signal submersion”) and fixes it with layerwise sparse feature amplification; ranking manipulation work shows phase transitions where small perturbation budgets cause large outcome shifts.

2) Key themes (clusters)

Theme: System-level physical security (time + modality + pipeline)

Why it matters: Real deployments don’t fail on single frames—they fail when evasion persists through tracking, survives sensor redundancy, or induces downstream unsafe actions. Evaluations that ignore these factors can dramatically understate risk.
Representative papers:
- Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible–Infrared Evasion
- Cross-Modal Phantom: Coordinated Camera–LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles
Common approach:
- Reframe threat models around operational objectives (ID corruption, false trajectories, emergency braking) rather than detector mAP.
- Emphasize temporal persistence (tracking) and cross-modal transfer/consistency (visible–IR; camera–LiDAR).
- Propose staged evaluation protocols that increase realism (from digital to activation-aware, multimodal, temporally persistent tests).
Open questions / failure modes:
- How well do digital/simulated attacks transfer to physical conditions (distance, lighting, timing, calibration drift)?
- What defenses work against coordinated consistency attacks (where sensors agree on a fake object)?
- How to benchmark identity-level harms (ID switches, long-horizon tracking corruption) consistently across pipelines?

Theme: Memory for agents—control, hierarchy, and write-time semantics

Why it matters: Long-running agents fail when memory is applied in the wrong state, when contradictions accumulate, or when retrieval bloats context. New work suggests memory needs policies and semantics, not just embeddings.
Representative papers:
Common approach:
- Add applicability control: uncertainty-gated routing + selective acceptance/rollback + retirement of harmful entries (TAG).
- Use hierarchical structures (event summaries → turn selection) to raise precision while keeping recall (HiGMem).
- Enforce write-time reconciliation semantics (supersedes/contradicts/same_as handlers) and auditable immutability (WorldDB).
Open questions / failure modes:
- Control policies depend on confidence separability and bank quality; when does confidence fail as a gate?
- Write-time semantics increase ingest complexity/cost; how to scale extraction/resolution reliably?
- Generalization beyond the evaluated settings (e.g., HiGMem’s weaker DialSim results; WorldDB evaluated on LongMemEval-s).

Theme: Executable, traceable evaluation for tool/agent workflows

Why it matters: “Looks right” outputs are not enough—agents must produce executable artifacts and verifiable intermediate steps. This theme pushes evaluation toward reproducibility, diagnosis, and regression gating.
Representative papers:
Common approach:
- Provide machine-verifiable traces (SQL + cached API calls in ReCoQA).
- Separate format validity vs execution correctness (Pass Rate vs Resolve Rate in Chat2Workflow).
- Persist artifact-backed evaluation (structured GT, extracted claims, judge outputs) and run significance tests for release decisions.
Open questions / failure modes:
- Residual errors even with perfect intermediate labels (ReCoQA reports <1 accuracy even with GT SLU/SQL/API), implying a hard global synthesis/planning bottleneck.
- Benchmarks may be limited in scale/ontology (Chat2Workflow: 27 tasks, 20 node types) and risk overfitting to platform conventions.
- Judge variance and GT omissions can confound “unsupported” labels in summarization pipelines.

Theme: Modular post-training and enterprise-grade model building

Why it matters: Organizations need frequent capability upgrades (math/code/tools/safety, domain language) without full retraining or catastrophic forgetting. Two complementary strategies appear: end-to-end enterprise pipelines and modular MoE composition.
Representative papers:
- Mi:dm K 2.5 Pro
- Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
Common approach:
- Heavy emphasis on data curation and targeted synthesis (AST-based code filtering; math gap-filling).
- Multi-stage post-training (Reasoning SFT, RL variants, merging/fusion) to balance reasoning, fluency, tool use, and robustness.
- Modular experts trained independently (mid-training→SFT→RLVR) then composed with lightweight router training (BAR).
Open questions / failure modes:
- Inference cost grows with number of experts; BAR notes performance drops when activating fewer experts.
- Reproducibility gaps: proprietary data/benchmarks and limited compute disclosure (Mi:dm K 2.5 Pro).
- How to upgrade the anchor/base model without retraining all experts (BAR limitation).

Theme: Security & reliability via internal/mechanistic and socio-technical lenses

Why it matters: Robustness failures come from both model internals (representation bottlenecks) and process failures (data leakage, governance). This cluster provides concrete diagnostics and attack surfaces.
Representative papers:
Common approach:
- Identify a specific failure mechanism (e.g., “signal submersion” across layers; role-fragmented leakage understanding; MLE ranking phase transitions).
- Provide actionable interventions or attacks (layerwise SAEs; process controls like immutable eval sets; ASSA manipulation algorithm).
- Use diagnostics beyond aggregate accuracy (MCC under imbalance; qualitative role-based themes; Kendall Tau distance to target ranking).
Open questions / failure modes:
- SAGE can only amplify signals already present in the backbone; may not help truly novel vulnerability classes.
- Leakage prevention remains largely process-driven; tooling standardization and cross-role alignment are unresolved.
- Ranking attacks assume white-box access and heuristic optimization; defenses are not provided.

3) Technical synthesis

“Applicability” is a recurring control variable: TAG’s route/accept/retire decisions for memory mirror broader agent/tool pipelines where when to invoke a component matters as much as the component itself (also echoed by hierarchical agent decomposition in ReCoQA).
Evaluation is moving from single scalar scores to staged pipelines: Chat2Workflow’s Pass vs Resolve, meeting-summary claim extraction + coverage/completeness, and surveillance’s stage ladder all separate syntactic validity from operational success.
LLM-as-judge appears in multiple roles: reward shaping (Mi:dm K 2.5 Pro RL; murder-mystery ScoreAgent), benchmark construction/validation (TeleEmbedBench validator), and evaluation (meeting summaries; CulturALL correctness judging).
Long-context and long-memory are diverging: Mi:dm K 2.5 Pro pushes 128K context, while WorldDB/HiGMem argue persistence needs structured memory with reconciliation/hierarchy—context length alone doesn’t solve drift/contradiction.
Modularity shows up both in models and systems: BAR composes domain experts; ClawNet composes identity-scoped agents; both aim to reduce interference (capability or privacy) via separation + controlled interfaces.
Security attacks increasingly target the “glue”: cross-modal fusion (camera–LiDAR), tracking pipelines (surveillance), and ranking aggregation (Bradley–Terry MLE) are attacked at the system/aggregation layer, not just the base predictor.
Mechanistic representation interventions are gaining traction: SAGE’s intermediate-layer sparse projection is a concrete example of “fix the representation bottleneck” rather than only prompting or full fine-tuning.
Cost/throughput constraints are being formalized: EDGE-EVAL introduces lifecycle metrics (break-even requests, cold-start tax), while TeleEmbedBench and vulnerability-detection architectures explicitly measure latency/cost trade-offs.

4) Top 5 papers (with “why now”)

1) WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation

Introduces write-time programmable edges (supersedes/contradicts/same_as handlers) and content-addressed immutability for auditable memory.
Shows very strong LongMemEval-s results (overall 96.40%, task-avg 97.11%) and ablations attributing gains to the engine layer.
“Why now”: long-running agents are hitting context rot and contradiction/identity drift; this is a concrete substrate-level proposal with ablations and engineering benchmarks.
Skepticism / limitation: higher ingest-time overhead; composed embeddings are parameter-free and the paper notes learned aggregators are future work; evaluation scope centered on LongMemEval-s.

2) Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

BAR converts a post-trained dense model into an MoE with an anchor expert (frozen) plus domain experts trained independently (mid-training→SFT→RLVR).
At 7B scale, BAR’s overall score (49.1) beats several retraining baselines and supports incremental add/upgrade of experts.
“Why now”: frequent model updates are operationally necessary; modularity offers a path to reduce catastrophic forgetting and retraining cost.
Skepticism / limitation: inference cost and parameter growth scale with number of experts; performance degrades with sparse expert activation; upgrading the anchor requires retraining experts.

3) A Control Architecture for Training-Free Memory Use

TAG provides a training-free control stack: uncertainty-gated retrieval, selective accept/rollback, and evidence-based retirement.
Under compute-matched controls, shows sizable arithmetic gains (e.g., SVAMP +7.0, ASDiv +7.67) where “retry” alone is flat.
“Why now”: many deployments can’t retrain models but still want memory; this isolates the value of control policy vs “more retrieval.”
Skepticism / limitation: strongest wins concentrate on arithmetic; effectiveness depends on confidence separability and memory-bank quality.

4) SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection

Diagnoses “Signal Submersion” and uses pan-layer extraction + JumpReLU sparse autoencoders with task-conditional alignment to amplify vulnerability cues.
Reports strong MCC results (e.g., BigVul MCC 0.7874 for one setting) and mechanistic evidence (SNR amplification up to 12.7×; concentrated sparse neurons).
“Why now”: vulnerability detection is high-impact and suffers from imbalance + distribution shift; this offers a frozen-backbone, mechanistically motivated fix.
Skepticism / limitation: cannot create knowledge absent from pretraining; low-resource language subsets are small; SAE training scales with number of probed layers.

5) ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering

Provides 29,270 QA instances with verifiable intermediate traces (SLU labels, SQL, cached API calls) enabling deterministic evaluation.
Hierarchical HIRE-Agent improves average accuracy and F1 by about +0.20 over a single-agent baseline; GT-signal probing still leaves a gap (avg accuracy 0.8864).
“Why now”: tool-augmented agents need benchmarks where intermediate steps are executable and auditable, not just final answers.
Skepticism / limitation: Chinese-language and tied to Chinese map services; single-turn only; template-based generation artifacts remain a concern.

5) Practical next steps

For agent memory systems, separate memory content from memory-use policy: implement TAG-like routing + accept/rollback and measure compute-matched gains vs “always retrieve.”
If building long-term memory, add write-time semantics (supersession/contradiction) and auditability; evaluate on long-memory tasks with ablations that isolate “engine” vs “answerer.”
For tool-using agents, adopt trace-first evaluation: require cached/deterministic tool outputs (like ReCoQA) and score both intermediate correctness and final synthesis.
In workflow-generation products, track Pass vs Resolve (format/import vs execution correctness) and build error-driven repair loops; measure the pass–resolve gap as a primary KPI.
For security robustness in perception, expand tests to temporal + multimodal settings (tracking, visible–IR, camera–LiDAR fusion) and report identity-level or action-level outcomes, not just detector failures.
For vulnerability detection, try intermediate-layer feature extraction + sparse amplification (SAGE-style) as a low-cost alternative to full fine-tuning; evaluate under deduped and distribution-shifted splits.
For model maintenance, prototype modular expert upgrades (BAR-style) and quantify: (i) domain gain, (ii) general-capability retention, (iii) inference cost vs expert sparsity.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-28

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: System-level physical security (time + modality + pipeline)

Theme: Memory for agents—control, hierarchy, and write-time semantics

Theme: Executable, traceable evaluation for tool/agent workflows

Theme: Modular post-training and enterprise-grade model building

Theme: Security & reliability via internal/mechanistic and socio-technical lenses

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps