Daily AI Paper Report (2026-04-21)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 3610
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-17T00:00:00Z → 2026-04-18T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.11753Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
PDF
cs.CL92Parallel test-time scaling for long-horizon agents via trajectory-aware aggregation agent.agents, test-time-scaling, trajectory-aggregation, tool-use, long-horizon
2604.11609Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models
PDF
cs.AI, cs.HC90Measures demographic-dependent sycophancy; intersectional personas + adversarial multi-turn eval.sycophancy, evaluation, fairness, robustness, multi-turn, personas
2604.10923Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation
PDF
cs.CL, cs.AI90Co-evolution of tools+experience for self-evolving agents; likely impacts agent capability/safety dynamicsagents, self-improvement, tool-creation, memory, experience-distillation, multi-agent
2604.11759Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure
PDF
cs.AI88Argues org AI needs epistemic structure beyond RAG; proposes computable commitments/contradictions.RAG, knowledge-representation, epistemics, agents, organizational-ai, contradictions
2604.12948Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents
PDF
cs.AI88Dual-trace persistent memory boosts cross-session recall (+20%); relevant to long-horizon agent reliabilityagents, memory, long-horizon, evaluation, reliability, LongMemEval
2604.04852Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
PDF
cs.CR, cs.AI86Structured prompting to improve CoT integrity for security analysis in local LLM deploymentsLLM, chain-of-thought, prompting, security, reliability, evaluation
2604.04664ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
PDF
cs.RO, cs.AI, cs.MA86Hierarchical semantic-to-physical multi-robot agent framework for long-horizon tasks; relevant to agent reliability.embodied-agents, multi-agent, robotics, LLM-agents, hierarchical-planning, long-horizon
2604.11506RedShell: A Generative AI-Based Approach to Ethical Hacking
PDF
cs.CR86LLM-driven offensive PowerShell gen + ground-truth dataset; high relevance to agent misuse/security evalscybersecurity, offensive-security, code-generation, misuse, dataset, evaluation
2604.05770SoK: Understanding Anti-Forensics Concepts and Research Practices Across Forensic Subdomains
PDF
cs.CR86Systematizes anti-forensics; useful for security threat modeling and robustness research.security, SoK, anti-forensics, digital-forensics, threat-modeling
2604.06762ARuleCon: Agentic Security Rule Conversion
PDF
cs.CR86Agentic framework for SIEM rule conversion; practical security automation with real deployment relevance.agents, cybersecurity, SIEM, tool-use, automation, robustness
2604.00422Shapley-Guided Neural Repair Approach via Derivative-Free Optimization
PDF
cs.SE, cs.LG86Interpretable Shapley fault localization + derivative-free neural repair for backdoors/attacks/unfairness.robustness, security, backdoors, adversarial, fairness, neural-repair, interpretability, shapley, derivative-free
2604.12890Towards Long-horizon Agentic Multimodal Search
PDF
cs.CV, cs.AI85File-based multimodal memory/UIDs to curb context explosion in long-horizon search agentsagents, multimodal, search, long-context, external-memory, systems
2604.05547COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
PDF
cs.AI, cs.GR84Tool-augmented LLM agent trained via RL for closed-loop CAD/CAE orchestration; relevant to agent eval/safetyagents, tool-use, reinforcement-learning, orchestration, industrial, robustness
2604.12655Robust Semi-Supervised Temporal Intrusion Detection for Adversarial Cloud Networks
PDF
cs.LG, cs.CR84Robust semi-supervised intrusion detection handling adversarial contamination + temporal drift.security, intrusion-detection, semi-supervised, adversarial-robustness, temporal-drift, cloud
2603.28594Detection of Adversarial Attacks in Robotic Perception
PDF
cs.CV, cs.AI, cs.CR, cs.RO84Adversarial-attack detection for robotic semantic segmentation; safety-critical perception robustness.adversarial-robustness, robotics, perception, semantic-segmentation, safety
2604.06644Variational Feature Compression for Model-Specific Representations
PDF
cs.CV, cs.LG84Representation release that blocks cross-model transfer while preserving target accuracy; privacy/control angle.privacy, representation-learning, model-stealing, transfer-suppression, variational-bottleneck
2604.04895Agentic Federated Learning: The Future of Distributed Training Orchestration
PDF
cs.MA, cs.AI84LM-agent orchestration for FL: bias, privacy budgets, and adaptive complexity in real deployments.agents, federated-learning, privacy, governance, distributed-systems
2604.11752A Synthetic Conversational Smishing Dataset for Social Engineering Detection
PDF
cs.CR84New labeled multi-round smishing conversations dataset for social engineering detection research.security, social-engineering, phishing, dataset, conversation, cybersecurity
2604.12843Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration
PDF
cs.CL84IRT anchor calibration enables comparable LLM eval as benchmarks evolve; strong for measurement hygieneevaluation, benchmarking, IRT, calibration, comparability, metrics
2604.12911Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss
PDF
cs.CL, cs.AI83Round-trip translation exposes gaps in multilingual benchmarks; better proxy for real multilingual abilityevaluation, multilingual, translation, benchmarks, robustness, measurement
2604.01081ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
PDF
cs.CV, cs.LG, cs.RO, eess.IV83Plug-and-play voxel OOD scoring reduces overconfidence and rare-class OOD absorption in autonomy stacks.ood-detection, uncertainty, autonomous-driving, 3d-occupancy, reliability, tail-risk
2603.28652Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory
PDF
cs.LG, cs.CR, cs.DC, cs.GT82Federated learning backdoor mitigation; game-theoretic framing suggests broader robustness use.federated-learning, backdoors, robustness, security, game-theory
2603.11691STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-task Multi-agent Reinforcement Learning
PDF
cs.AI82Transformer for offline multi-task MARL with better inter-agent attention and long-horizon history modeling.offline-RL, multi-agent, transformers, coordination, generalization
2604.04858FairLogue: A Toolkit for Intersectional Fairness Analysis in Clinical Machine Learning Models
PDF
cs.LG, q-bio.QM82Intersectional fairness toolkit for clinical ML; practical auditing beyond single-axis metrics.fairness, evaluation, toolkit, healthcare, intersectionality, accountability
2604.11548SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering
PDF
cs.AI82Positions 'harness engineering' for controllable/auditable personal agents; systems perspective.agents, agent-infrastructure, auditing, reliability, governance, harness-engineering
2604.12988ROSE: An Intent-Centered Evaluation Metric for NL2SQL
PDF
cs.DB, cs.AI81Intent-centered NL2SQL metric with prover-refuter cascade; reduces brittleness to bad ground truthevaluation, metrics, NL2SQL, semantic-eval, adversarial, reliability
2604.04456Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition
PDF
cs.AI, cs.CL, cs.LG80Metric for explanation/rationale stability under perturbations; useful for auditing model consistencyinterpretability, explainability, robustness, evaluation, SHAP, BERT
2604.04349Adversarial Robustness Analysis of Cloud-Assisted Autonomous Driving Systems
PDF
cs.RO, cs.LG80Hardware-in-the-loop testbed for adversarial + network impairment risks in cloud AV stacks.adversarial-robustness, autonomous-driving, cloud-offloading, safety, testbed, yolov8
2603.09053Sim2Act: Robust Simulation-to-Decision Learning via Adversarial Calibration and Group-Relative Perturbation
PDF
cs.LG, cs.AI80Robust sim-to-decision learning with adversarial calibration; targets decision-critical error regions.robustness, simulation, decision-making, adversarial-training, RL
2603.29608Learning Diagnostic Reasoning for Decision Support in Toxicology
PDF
cs.CL80RL adaptation for clinical diagnostic reasoning under uncertainty; strong reliability relevance.LLMs, clinical-decision-support, reinforcement-learning, reasoning, robustness

AI Paper Insight Brief

2026-04-21

0) Executive takeaways (read this first)

  • Robustness work is shifting from “make the model accurate on average” to make the system reliable in decision-critical regions: Sim2Act explicitly targets action-ranking flips caused by small simulator errors, improving tail risk (CVaR) under perturbations.
  • For long-horizon agents, the new bottleneck is how to scale test-time compute and memory without context blowups: AggAgent aggregates parallel trajectories via tool-based access (not concatenation), while multimodal search offloads images to files via UIDs + fetch_image.
  • Safety evaluation is becoming more identity- and domain-conditional: intersectional persona testing shows sycophancy varies sharply by perceived demographics and domain (philosophy worst), and multilingual “reasoning” benchmarks can miss real multilingual generation failures.
  • Several papers converge on “verification loops” as the practical safety lever: SIEM rule conversion uses IR + RAG + executable checks; CAD–CAE optimization uses tool-log-grounded RL rewards; federated backdoor defense uses anomaly scoring + reputation + minimax weighting.
  • Interpretability is being operationalized as stability/repair tooling: ESS measures rationale stability under perturbations; SHARPEN uses Shapley-guided localization + derivative-free repair across backdoor/adversarial/fairness defects.

2) Key themes (clusters)

Theme: Decision-critical robustness (simulators, policies, and tail risk)

  • Why it matters: In digital twins and offline RL, small model errors in the wrong regions can flip action rankings and create brittle or unsafe deployments; robustness must be targeted where decisions are sensitive.
  • Representative papers:
  • Common approach:
    • Focus robustness objectives on high-impact regions (e.g., adversarial reweighting of simulator loss; perturbation neighborhoods for policy invariance).
    • Architectures that explicitly model spatial relations + long-horizon temporal state (recursive spatial transformer + dual-timescale histories).
    • Offline settings emphasize generalization across shifts (agent count/entity count; dataset quality) without online exploration.
  • Open questions / failure modes:
    • How well do these methods transfer beyond the evaluated domains (Sim2Act: supply chain only; STAIRS: benchmark suites)?
    • Robustness vs conservatism: avoiding “policy collapse” while still controlling worst-case outcomes.
    • Compute/memory overhead (STAIRS higher than simpler baselines; simulator calibration complexity and reproducibility details in appendices).

Theme: Cross-layer autonomy safety (perception attacks + systems constraints)

  • Why it matters: Real safety failures often come from compositions—adversarial perception plus network latency/loss plus control loops—rather than isolated model metrics.
  • Representative papers:
  • Common approach:
    • Evaluate robustness in more realistic loops (hardware-in-the-loop IoV testbed; closed-loop stop-sign compliance under delay/loss).
    • Use prototype structure to improve tail calibration and produce training-free OOD scores (EchoOOD fusing local coherence + local/global prototype matching).
    • Detection framing for segmentation: feature-based metrics and thresholding (confidence / entropy variants / kernel density).
  • Open questions / failure modes:
    • Detection papers without ROC/FPR/TPR and broader attacks are hard to operationalize (segmentation detector lacks detailed detection curves and dataset clarity).
    • ProOOD depends on external depth estimation; small/distant OOD objects and occlusion remain failure cases.
    • Cloud AV study evaluates attacks but not mitigations; generalization beyond Duckiebot-scale setups is unclear.

Theme: LLM/agent evaluation that matches real-world failure modes

Theme: Verification loops and executable grounding for agentic systems

Theme: Practical security & privacy defenses (FL, NIDS, representation control)

  • Why it matters: Deployed systems need defenses that work under partial observability (no raw data), label scarcity, and adaptive attackers—often with tuning and compute constraints.
  • Representative papers:
  • Common approach:
    • Detect/weight suspicious updates via anomaly projections + clustering + reputation, then optimize aggregation under a minimax model.
    • Conservative SSL: only learn from unlabeled data when confidence + teacher agreement + temporal stability gates pass.
    • Reduce repurposing by training a task-driven variational bottleneck and masking latent dimensions using KL + gradient saliency.
  • Open questions / failure modes:
    • Adaptive attackers are often out of scope (feature compression explicitly excludes retraining attackers; NIDS excludes white-box/certified robustness).
    • Parameter/tuning sensitivity (DBSCAN ε, α/β reputation weights; SSL thresholds).
    • Scalability to large models/client populations (PPA complexity; agentic orchestration token costs in related FL position work).

3) Technical synthesis

  • Multiple papers converge on minimax / adversarial emphasis but apply it differently: Sim2Act uses minimax reweighting to surface decision-critical simulator errors; FedBBA uses minimax weighting against poisoning ratios; cloud AV uses explicit white-box FGSM/PGD to quantify worst-case degradation.
  • “Robustness” increasingly means tail behavior under perturbations (Sim2Act CVaR@5%, ProOOD voxel-level OOD AuPRCr, NIDS poisoning contamination curves, cloud AV stop-compliance under delay/loss).
  • A recurring pattern is selective learning / selective trust: RSST-NIDS gates pseudo-label usage; ROSE gates expensive judging via routing (only when executions differ); AggAgent selectively reads trajectory segments via search tools; dual-trace memory gates encoding via evidence scoring.
  • Externalization to avoid context limits appears in two forms: (1) store artifacts outside the prompt (multimodal UIDs + fetch_image; AggAgent’s in-memory trajectory tools), and (2) store structured persistent memory (dual-trace fact+scene; epistemic KOs with decay/contradiction).
  • Evaluation papers highlight that metric choice can invert conclusions: EX vs ROSE diverges as models get stronger; multilingual translated reasoning benchmarks correlate with English reasoning rather than multilingual fidelity; persona-free safety tests can miss intersectional sycophancy.
  • Interpretability is being used as an actionable control surface: SHARPEN localizes defects with Deep SHAP then repairs via CMA-ES; ESS quantifies explanation stability under paraphrase; structured prompting improves evidence-grounding and faithfulness in security CoT.
  • Several systems emphasize executable verification as the practical alternative to purely textual self-critique (ARuleCon Python checks; COSMO toolchain re-evaluation; Mem2Evolve unit tests/self-correction).
  • Across domains, resource trade-offs are explicit: STAIRS reports params/GPU memory; AggAgent reports overhead (~5.7% at K=8); fixed-parameter calibration targets constant incremental benchmarking cost; ARuleCon reports higher token/time costs.

4) Top 5 papers (with “why now”)

1) Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

  • Contributes a tool-based aggregator (AggAgent) that reasons over multiple long trajectories without concatenating them.
  • Shows consistent gains at K=8 across six benchmarks and three model families (e.g., average improvements over Solution Aggregation).
  • Adds cost/latency analysis showing small aggregation overhead (reported 5.7% at K=8).
  • Skepticism: evaluation uses sampled subsets due to cost; relies on LLM-as-judge and pricing assumptions.

2) ROSE: An Intent-Centered Evaluation Metric for NL2SQL

  • Prover–Refuter cascade judges intent fulfillment and uses ground-truth SQL adversarially as counter-evidence.
  • Strong agreement with expert-consensus set (κ reported 80.43%) and provides dataset-auditing labels (GoldX/AmbQ precision reported).
  • Re-evaluates 19 systems and attributes a large share of EX disagreements to gold errors/ambiguity.
  • Skepticism: depends on judge backbone/versioning; ROSE-VEC selection keeps only annotator-agreement cases (selection bias).

3) Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

  • Introduces LiT (1,600 samples) using multi-hop round-trip translation with MQM-style scoring.
  • Reports near-perfect correlation with LMArena Elo (ρ = 0.94) and highlights low-resource collapse not captured by MT-AIME24/INCLUDE.
  • Provides evidence that popular multilingual benchmarks track English reasoning/knowledge instead.
  • Skepticism: multi-hop sequences can conflate cascading errors; LLM-as-judge automation limits direct human verification.

4) Sim2Act: Robust Simulation-to-Decision Learning via Adversarial Calibration and Group-relative Perturbation

  • Targets a concrete sim-to-decision failure mode: small simulator errors in decision-critical regions flipping action rankings.
  • Combines adversarial calibration (reweighting state-action errors) with group-relative perturbation training to preserve relative preferences without collapsing to pessimism.
  • Reports flatter reward degradation under perturbations and improved tail risk (CVaR) on supply-chain benchmarks.
  • Skepticism: evaluated only on three supply-chain datasets; some reproducibility details deferred to appendices.

5) Robust Semi-Supervised Temporal Intrusion Detection for Adversarial Cloud Networks

  • Conservative SSL for NIDS: confidence-aware pseudo-labeling + EMA teacher + selective temporal invariance gated by stability criteria.
  • Reports strong in-domain AUROC (0.973) and improved cross-dataset AUROC/MCC; maintains performance under unlabeled poisoning by admitting fewer windows.
  • Includes operational overhead estimates (training/inference latencies).
  • Skepticism: binary detection only; white-box/certified robustness out of scope; robustness trades off unlabeled utilization under high contamination.

5) Practical next steps

  • If you deploy digital twins / model-based decision systems: add a decision-critical error audit (action-ranking sensitivity) and test whether adversarial reweighting (Sim2Act-style) improves CVaR under perturbations.
  • For long-horizon agent products: implement an AggAgent-like trajectory store + search tools (solution retrieval, step search, segment fetch) and measure gains vs majority-vote/solution-only aggregation at fixed K and fixed cost.
  • For multimodal agents: prototype UID-based external image storage + fetch_image and quantify how many turns you can sustain before context failure, plus performance vs naive image-in-context baselines.
  • For safety evaluation: add intersectional persona grids (race × age × gender × confidence) and domain variation; track tail-risk (fraction of runs with high sycophancy scores) rather than only means.
  • For multilingual evaluation: complement translated reasoning benchmarks with round-trip translation MQM≥80 pass rates and explicitly report low-resource sequence breakdowns.
  • For tool-using security automation (SIEM rules, etc.): adopt IR + RAG + executable consistency checks; track not just similarity metrics but syntactic validity and functional equivalence under synthetic log tests.
  • For federated / distributed learning defenses: test combined anomaly scoring + reputation + adversary-aware weighting (FedBBA-style) and stress with varying malicious ratios; report tuning sensitivity (DBSCAN ε, α/β).
  • For agent memory: evaluate whether dual-trace encoding improves your own cross-session tasks (especially update tracking and temporal reasoning) at equal token budgets.

Generated from per-paper analyses; no external browsing.