Daily AI Paper Report (2026-03-24)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 1193
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-20T00:00:00Z → 2026-03-21T00:00:00Z (weekend_backlog_sun, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.14923Directional Routing in Transformers
PDF
cs.LG, cs.AI94New transformer routing; strong causal ablations + mech interp show routing is dominant pathwaytransformers, routing, mechanistic-interpretability, circuits, architecture
2603.14723Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning
PDF
cs.CL94Shows non-identity safety supervision beats identity framing in low-data LoRA on HarmBench.llm-safety, fine-tuning, LoRA, HarmBench, jailbreak-robustness, supervision-design
2603.18444Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards
PDF
cs.LG, cs.AI91Sample-efficient RLVR via reward distribution estimation; directly targets LLM reasoning post-training.RLVR, post-training, reasoning, sample-efficiency, reward-modeling, LLMs
2603.18545CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
PDF
cs.CV, cs.AI90Clinically plausible distribution-shift attack chain + token-space repair for medical VLM robustness.robustness, distribution-shift, medical, vision-language, attacks, repair
2603.18495Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
PDF
cs.AI90Neurosymbolic counterfactuals for demo-to-code; aims at verifiable procedure adaptation under domain shift.agents, robotics, neurosymbolic, counterfactual-reasoning, verification, code-generation, VLM
2603.15600From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
PDF
cs.RO, cs.AI, cs.CL, cs.CV90RL turns video MLLM into goal-aware process critic for long-horizon robot manipulation monitoringrobotics, process-supervision, reinforcement-learning, multimodal, monitoring
2603.19223F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
PDF
cs.CL, cs.AI90Large multilingual embedding family (80M–14B) with strong MTEB results; useful for RAG/search.embeddings, multilingual, retrieval, MTEB, efficiency, distillation
2603.18411TARo: Token-level Adaptive Routing for LLM Test-time Alignment
PDF
cs.CL, cs.AI, cs.LG89Token-level test-time alignment routing using step-wise reward signals; sizable reasoning gains claimed.test-time-alignment, reasoning, reward-model, routing, inference-time, LLMs
2603.11558RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
PDF
cs.RO, cs.AI88Unified VLM-driven long-horizon robotics with self-resetting data collection via entangled action pairsrobotics, agents, VLA, long-horizon, data-collection, self-improvement
2603.17425Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning
PDF
cs.AI88POMDP-lite proactive inquiry for doctor-patient dialogue; explicit belief updates and gap-aware planning.agents, planning, POMDP, uncertainty, dialogue-systems, clinical, tool-use
2603.17872Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval
PDF
cs.CL, cs.AI86Tiered retrieval+verification pipeline (LangGraph) targeting hallucination reduction in high-stakes domains.hallucinations, RAG, verification, agents, reliability, grounding
2603.18914Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections
PDF
cs.CR, cs.AI, cs.CY86Clear regulatory synthesis for security/privacy of agentic AI; useful for governance & deployment.agentic-ai, regulation, security, privacy, EU-AI-Act, governance
2603.14889Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
PDF
eess.AS, cs.CL, cs.LG86Speech dialogue reward model + new preference dataset targeting prosody & colloquialness gapsreward-modeling, speech, preference-data, evaluation, alignment
2603.19131From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models
PDF
cs.LG, cs.RO86Proposes embodied efficiency metrics for VLA robots; challenges FLOPs/params as proxy for real performanceembodied-agents, VLA, evaluation, efficiency-metrics, robotics
2603.11863CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
PDF
cs.AI86Creativity benchmark with executable metrics to separate novelty from hallucination; self-evolving challenges.evaluation, benchmarks, code-generation, self-play, open-endedness, reliability
2603.09868CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning
PDF
cs.LG, physics.ao-ph86Large zero-shot spatial transfer benchmark for carbon fluxes; strong eval protocols + scale.benchmark, evaluation, zero-shot, domain-generalization, time-series, climate
2603.14712Towards Next-Generation LLM Training: From the Data-Centric Perspective
PDF
cs.CL, cs.LG86Data-centric LLM training: agentic data pipelines, selection/mixture optimization; high leverage bottleneck.llm-training, data-centric-ai, data-mixtures, data-selection, agents, scaling
2603.09356Democratising Clinical AI through Dataset Condensation for Classical Clinical Models
PDF
cs.LG, cs.AI, cs.CR86DP dataset condensation for non-differentiable clinical models; practical privacy+utility angleprivacy, differential-privacy, dataset-condensation, synthetic-data, healthcare, reliability
2603.14838The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments
PDF
cs.CL86Directly studies ideological retrieval effects in RAG on COVID treatments; relevant to grounding risks.RAG, bias, ideology, misinformation, evaluation, prompting
2603.18388Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization
PDF
cs.AI, cs.MA86Makes reflective prompt optimization more interpretable/robust with multi-agent verification and restarts.prompt-optimization, reflection, agents, robustness, interpretability, evaluation
2603.15262Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
PDF
cs.AI84Probe-then-plan grounds LLM search plans in live retrieval state to cut latency and invalid tool plans.agents, tool-use, planning, retrieval, latency, deployment
2603.19185MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data
PDF
cs.LG84Challenge-style benchmark on membership inference vs diffusion synthetic tabular data; concrete privacy eval.privacy, membership-inference, diffusion-models, synthetic-data, tabular, benchmark
2603.17312Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
PDF
cs.CV, cs.AI84Recurrent snippet-based VLM reasoning for long-horizon task progress; cheaper than full-trajectory videoembodied, VLM, reasoning, long-context, planning, monitoring
2603.11479Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
PDF
cs.LG, cs.AI, cs.MA84Neuro-symbolic agent framework to ground natural-language event specs into time-series intervalsagents, neuro-symbolic, grounding, time-series, evaluation
2603.19225FinTradeBench: A Financial Reasoning Benchmark for LLMs
PDF
cs.CE, cs.AI, cs.CL, cs.IR, q-fin.CP84New benchmark for LLM financial reasoning over fundamentals + trading signals; closer to real analyst workflowsLLM-evaluation, benchmark, reasoning, finance, multisignal
2603.15183Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems
PDF
cs.DC, cs.AI, cs.LG, cs.MA84Maps multi-agent LLM sync to cache coherence; proposes lazy invalidation to cut coordination costmulti-agent, systems, coordination, scalability, synchronization
2603.19002RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation
PDF
cs.CL84Proposes standardized alignment metrics for LLM survey simulation incl. ranking+distribution.evaluation, alignment-metrics, survey-simulation, benchmarking, distribution-shift
2603.18447SODIUM: From Open Web Data to Queryable Databases
PDF
cs.DB, cs.AI, cs.CL, cs.CV, cs.IR84Formalizes web-to-database agentic pipeline + benchmark; relevant to tool-using agents and data quality.agents, information-extraction, web, databases, benchmark, tool-use
2603.18481T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World
PDF
cs.CV, cs.LG83Temporal OOD detection for VLMs under drift + covariate shift; open-world robustness focus.OOD-detection, robustness, vision-language, distribution-shift, domain-generalization, evaluation
2603.08321CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support
PDF
cs.AI82Neuro-symbolic CDS: structured reasoning traces + KG safety verification to constrain clinical outputs.safety, clinical, neuro-symbolic, knowledge-graphs, reasoning-traces, verification

AI Paper Insight Brief

2026-03-24

0) Executive takeaways (read this first)

  • “Verify-and-revise” is hardening into a reusable safety pattern: CORE-Acu (clinical KG veto + bounded rewrites), tiered retrieval verification for hallucinations, and neurosymbolic counterfactual verification for robotics all show the same move—generate → check against explicit constraints/world models → revise or refuse.
  • RAG is increasingly treated as an attack surface, not just a factuality fix: ideological retrieval context measurably steers outputs (and explicit ideology descriptions amplify it), while tiered retrieval pipelines still fail on false-premise overclaiming—suggesting “retrieval governance + answerability gating” is becoming mandatory.
  • Lightweight routing/coordination mechanisms are emerging at three levels: (i) architectural (Directional Routing inside Transformers), (ii) decoding-time (TARo token-level adaptive mixing of base+reward logits), and (iii) systems (Token Coherence replacing broadcast sync in multi-agent workflows). All aim to reduce interference/cost while keeping behavior controllable.
  • Temporal and distribution shift is being operationalized with benchmarks + protocols: CarbonBench standardizes zero-shot spatial transfer for carbon flux regression; T-QPM targets temporal OOD for VLMs; CoDA targets pipeline-realistic distribution chains in medical imaging.
  • Low-data alignment can hinge on wording, not just “more data”: a matched non-identity safety framing beats creed/constitutional phrasing in 130-example LoRA across three model families on HarmBench, with negligible MMLU/ARC deltas.
  • Embodied deployment metrics are diverging from inference metrics: compression/pruning/token/action reductions can preserve success rate yet worsen jerk/path length/time—“efficiency” claims for VLA models need embodied-efficiency reporting.

2) Key themes (clusters)

Theme: Neuro-symbolic verification loops for high-stakes decisions

Theme: Retrieval as both mitigation and manipulation channel

Theme: Routing/coordination as a general-purpose control knob (model, decoding, systems)

  • Why it matters: As models and agent systems scale, interference and coordination costs dominate. Routing offers a compact way to allocate computation/authority dynamically—potentially improving interpretability, reliability, and cost.
  • Representative papers:
  • Common approach:
    • Learn input-dependent suppression/mixing (directional component suppression; per-token α mixing base+reward logits).
    • Replace global/static knobs with adaptive, local decisions (token-level vs fixed interpolation; coherence invalidation vs broadcast).
    • Add formal/causal probes to show load-bearing behavior (router-off collapses induction/recall; TLA+ invariants for sync safety).
  • Open questions / failure modes:
    • Generality and variance: directional routing results are from limited scales/seeds; benchmark gains don’t always follow PPL gains.
    • Reward-model dependence and OOD sensitivity for test-time alignment routers.
    • Coherence protocols rely on assumptions (central authority; simulation vs production traces; liveness under failures).

Theme: Robustness under realistic distribution shift (temporal, spatial, pipeline)

Theme: Evaluation infrastructure for alignment, privacy, and “creativity”

3) Technical synthesis

  • Multiple works converge on structured intermediate artifacts as the unit of verification: syndrome→pathology→principle→acupoint chains (CORE-Acu), ELT schemas (SELA), symbolic operators and scene graphs (NESYCR), and state/event tuples + weights (doctor–patient inquiry).
  • Bounded loops are the dominant safety/control primitive: generate–verify–revise (CORE-Acu; tiered retrieval verification; NESYCR repair), with explicit fallbacks (human confirmation; graceful apology).
  • Routing is becoming ubiquitous: inside the model (directional suppression), at decoding (token-level α), in retrieval (domain/tier routing; dual-track retrieval), and in serving (complexity-aware router for e-commerce).
  • Several papers show metric improvements can be misleading if not aligned to the right objective: directional routing yields large PPL reductions but no multiple-choice benchmark gains; VLA compression improves inference metrics but worsens embodied jerk/path/time.
  • Temporal anchoring appears as a general trick for long-horizon understanding: PRIMO’s (I_init, V_seq, I_curr) input structure; T-QPM’s timestep-conditioned prototypes and drift penalties.
  • Robustness work is shifting from “single corruption” to composed, realistic shift chains (CoDA’s A∘R∘D) and from static OOD to streaming temporal drift (T-QPM).
  • Evaluation is increasingly tail-aware: CarbonBench reports per-site quantiles; T-QPM reports early vs late timestep FPR95/AUROC; Token Coherence analyzes volatility regimes.
  • A recurring failure mode across retrieval/verification systems is premise validation: systems can become confident in the wrong frame (false-premise overclaiming; ideology amplification).
  • Lightweight adaptation is favored when foundations are frozen: LoRA with reweighted loss (CORE-Acu), two-scalar fusion learning (T-QPM), token-space linear adapter repair (CoDA), and inference-only activation steering (EvoRePE).

4) Top 5 papers (with “why now”)

1) CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support

  • Introduces a full neuro-symbolic safety stack: structured S-CoT + TCM KG + entity-reweighted loss + generate–verify–revise loop.
  • Reports 0/1,000 KG-defined safety violations after verification, vs 8.5% for GPT-4o on the same benchmark.
  • Practical template for other high-stakes domains where token-level entity fidelity and hard contraindication rules matter.
  • Skepticism: safety is only as good as KG coverage; binary veto may miss nuanced clinical trade-offs.

2) Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems

  • Reframes multi-agent context sharing as cache coherence; provides an analytic savings bound and a concrete protocol (CCS).
  • Uses TLA+ model checking to verify invariants (single-writer, monotonic versioning, bounded staleness).
  • Simulation shows ~84–95% token savings for lazy invalidation across volatility regimes—direct cost lever for agent deployments.
  • Skepticism: evaluation is simulation-based; centralized authority and liveness under failures remain concerns.

3) Directional Routing in Transformers

  • Adds a small router that suppresses learned head-space directions; routing becomes load-bearing (router-off collapses recall/induction).
  • Reports large domain perplexity reductions (31–56%) with ~3.9% parameter overhead.
  • Provides built-in, causally manipulable “directions” as interpretability hooks.
  • Skepticism: limited seeds/scales; PPL gains didn’t translate to multiple-choice benchmark gains.

4) TARo: Token-level Adaptive Routing for LLM Test-time Alignment

  • Learns per-token mixing between base and reward logits, avoiding brittle fixed interpolation in test-time alignment.
  • Reports large MATH500 gains (e.g., 32.0% → 54.4% for Llama-3.1-8B in Table 1) and weak-to-strong transfer to larger backbones.
  • Useful for deployments where retraining is costly but decoding-time steering is feasible.
  • Skepticism: depends on reward model quality/domain bias; full-logits routing can hurt throughput.

5) The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

  • Shows RAG can propagate ideological stance from retrieved texts; adding explicit LMDA descriptions generally amplifies alignment.
  • Provides a concrete methodology (LMDA + controlled retrieval + semantic/lexical similarity + ANOVA) to quantify steering.
  • “Why now”: RAG is ubiquitous in production; this highlights a governance gap beyond hallucinations.
  • Skepticism: domain-specific corpus and curated exemplar selection; effects may vary with retrieval/reranking choices.

5) Practical next steps

  • For any safety-critical assistant, prototype a generate–verify–revise controller with: (i) explicit intermediate schema, (ii) deterministic constraint checks, (iii) bounded retries, (iv) refusal/handoff policy when unresolved.
  • Add a pre-retrieval answerability / premise-check gate to RAG pipelines to reduce false-premise overclaiming (explicitly flagged as a key failure mode in tiered retrieval verification).
  • Treat retrieval corpora as untrusted inputs: implement retrieval governance (source allowlists, ideology/bias detectors, chunk-level provenance) and test for stance steering under controlled retrieval poles.
  • If running multi-agent workflows, instrument token spend by sync boundary and test coherence-style invalidation vs broadcast; verify invariants (single-writer, version monotonicity, staleness bounds) before rollout.
  • When using test-time alignment, replace fixed mixing with adaptive routing (token-level α) and measure not just accuracy but throughput cost and OOD behavior.
  • For VLM robustness, expand evaluation beyond single corruptions to composed pipeline shifts (CoDA-style) and temporal drift (T-QPM-style); track early/late timestep metrics.
  • For embodied agents, report embodied-efficiency metrics (jerk, path length, completion time, action rate) alongside inference metrics before claiming “efficiency improvements.”

Generated from per-paper analyses; no external browsing.