Daily AI Paper Report (2026-03-24)

Published: March 24, 2026

Chinese version: [中文]

Run stats

Candidates: 1193
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-20T00:00:00Z → 2026-03-21T00:00:00Z (weekend_backlog_sun, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.14923`	Directional Routing in Transformers PDF	cs.LG, cs.AI	94	New transformer routing; strong causal ablations + mech interp show routing is dominant pathway	transformers, routing, mechanistic-interpretability, circuits, architecture
`2603.14723`	Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning PDF	cs.CL	94	Shows non-identity safety supervision beats identity framing in low-data LoRA on HarmBench.	llm-safety, fine-tuning, LoRA, HarmBench, jailbreak-robustness, supervision-design
`2603.18444`	Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards PDF	cs.LG, cs.AI	91	Sample-efficient RLVR via reward distribution estimation; directly targets LLM reasoning post-training.	RLVR, post-training, reasoning, sample-efficiency, reward-modeling, LLMs
`2603.18545`	CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models PDF	cs.CV, cs.AI	90	Clinically plausible distribution-shift attack chain + token-space repair for medical VLM robustness.	robustness, distribution-shift, medical, vision-language, attacks, repair
`2603.18495`	Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning PDF	cs.AI	90	Neurosymbolic counterfactuals for demo-to-code; aims at verifiable procedure adaptation under domain shift.	agents, robotics, neurosymbolic, counterfactual-reasoning, verification, code-generation, VLM
`2603.15600`	From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation PDF	cs.RO, cs.AI, cs.CL, cs.CV	90	RL turns video MLLM into goal-aware process critic for long-horizon robot manipulation monitoring	robotics, process-supervision, reinforcement-learning, multimodal, monitoring
`2603.19223`	F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World PDF	cs.CL, cs.AI	90	Large multilingual embedding family (80M–14B) with strong MTEB results; useful for RAG/search.	embeddings, multilingual, retrieval, MTEB, efficiency, distillation
`2603.18411`	TARo: Token-level Adaptive Routing for LLM Test-time Alignment PDF	cs.CL, cs.AI, cs.LG	89	Token-level test-time alignment routing using step-wise reward signals; sizable reasoning gains claimed.	test-time-alignment, reasoning, reward-model, routing, inference-time, LLMs
`2603.11558`	RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks PDF	cs.RO, cs.AI	88	Unified VLM-driven long-horizon robotics with self-resetting data collection via entangled action pairs	robotics, agents, VLA, long-horizon, data-collection, self-improvement
`2603.17425`	Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction, Belief Updating, and Path-Aware Action Planning PDF	cs.AI	88	POMDP-lite proactive inquiry for doctor-patient dialogue; explicit belief updates and gap-aware planning.	agents, planning, POMDP, uncertainty, dialogue-systems, clinical, tool-use
`2603.17872`	Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval PDF	cs.CL, cs.AI	86	Tiered retrieval+verification pipeline (LangGraph) targeting hallucination reduction in high-stakes domains.	hallucinations, RAG, verification, agents, reliability, grounding
`2603.18914`	Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections PDF	cs.CR, cs.AI, cs.CY	86	Clear regulatory synthesis for security/privacy of agentic AI; useful for governance & deployment.	agentic-ai, regulation, security, privacy, EU-AI-Act, governance
`2603.14889`	Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness PDF	eess.AS, cs.CL, cs.LG	86	Speech dialogue reward model + new preference dataset targeting prosody & colloquialness gaps	reward-modeling, speech, preference-data, evaluation, alignment
`2603.19131`	From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models PDF	cs.LG, cs.RO	86	Proposes embodied efficiency metrics for VLA robots; challenges FLOPs/params as proxy for real performance	embodied-agents, VLA, evaluation, efficiency-metrics, robotics
`2603.11863`	CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges PDF	cs.AI	86	Creativity benchmark with executable metrics to separate novelty from hallucination; self-evolving challenges.	evaluation, benchmarks, code-generation, self-play, open-endedness, reliability
`2603.09868`	CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning PDF	cs.LG, physics.ao-ph	86	Large zero-shot spatial transfer benchmark for carbon fluxes; strong eval protocols + scale.	benchmark, evaluation, zero-shot, domain-generalization, time-series, climate
`2603.14712`	Towards Next-Generation LLM Training: From the Data-Centric Perspective PDF	cs.CL, cs.LG	86	Data-centric LLM training: agentic data pipelines, selection/mixture optimization; high leverage bottleneck.	llm-training, data-centric-ai, data-mixtures, data-selection, agents, scaling
`2603.09356`	Democratising Clinical AI through Dataset Condensation for Classical Clinical Models PDF	cs.LG, cs.AI, cs.CR	86	DP dataset condensation for non-differentiable clinical models; practical privacy+utility angle	privacy, differential-privacy, dataset-condensation, synthetic-data, healthcare, reliability
`2603.14838`	The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments PDF	cs.CL	86	Directly studies ideological retrieval effects in RAG on COVID treatments; relevant to grounding risks.	RAG, bias, ideology, misinformation, evaluation, prompting
`2603.18388`	Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization PDF	cs.AI, cs.MA	86	Makes reflective prompt optimization more interpretable/robust with multi-agent verification and restarts.	prompt-optimization, reflection, agents, robustness, interpretability, evaluation
`2603.15262`	Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search PDF	cs.AI	84	Probe-then-plan grounds LLM search plans in live retrieval state to cut latency and invalid tool plans.	agents, tool-use, planning, retrieval, latency, deployment
`2603.19185`	MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data PDF	cs.LG	84	Challenge-style benchmark on membership inference vs diffusion synthetic tabular data; concrete privacy eval.	privacy, membership-inference, diffusion-models, synthetic-data, tabular, benchmark
`2603.17312`	Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress PDF	cs.CV, cs.AI	84	Recurrent snippet-based VLM reasoning for long-horizon task progress; cheaper than full-trajectory video	embodied, VLM, reasoning, long-context, planning, monitoring
`2603.11479`	Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents PDF	cs.LG, cs.AI, cs.MA	84	Neuro-symbolic agent framework to ground natural-language event specs into time-series intervals	agents, neuro-symbolic, grounding, time-series, evaluation
`2603.19225`	FinTradeBench: A Financial Reasoning Benchmark for LLMs PDF	cs.CE, cs.AI, cs.CL, cs.IR, q-fin.CP	84	New benchmark for LLM financial reasoning over fundamentals + trading signals; closer to real analyst workflows	LLM-evaluation, benchmark, reasoning, finance, multisignal
`2603.15183`	Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems PDF	cs.DC, cs.AI, cs.LG, cs.MA	84	Maps multi-agent LLM sync to cache coherence; proposes lazy invalidation to cut coordination cost	multi-agent, systems, coordination, scalability, synchronization
`2603.19002`	RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation PDF	cs.CL	84	Proposes standardized alignment metrics for LLM survey simulation incl. ranking+distribution.	evaluation, alignment-metrics, survey-simulation, benchmarking, distribution-shift
`2603.18447`	SODIUM: From Open Web Data to Queryable Databases PDF	cs.DB, cs.AI, cs.CL, cs.CV, cs.IR	84	Formalizes web-to-database agentic pipeline + benchmark; relevant to tool-using agents and data quality.	agents, information-extraction, web, databases, benchmark, tool-use
`2603.18481`	T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World PDF	cs.CV, cs.LG	83	Temporal OOD detection for VLMs under drift + covariate shift; open-world robustness focus.	OOD-detection, robustness, vision-language, distribution-shift, domain-generalization, evaluation
`2603.08321`	CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support PDF	cs.AI	82	Neuro-symbolic CDS: structured reasoning traces + KG safety verification to constrain clinical outputs.	safety, clinical, neuro-symbolic, knowledge-graphs, reasoning-traces, verification

AI Paper Insight Brief

2026-03-24

0) Executive takeaways (read this first)

“Verify-and-revise” is hardening into a reusable safety pattern: CORE-Acu (clinical KG veto + bounded rewrites), tiered retrieval verification for hallucinations, and neurosymbolic counterfactual verification for robotics all show the same move—generate → check against explicit constraints/world models → revise or refuse.
RAG is increasingly treated as an attack surface, not just a factuality fix: ideological retrieval context measurably steers outputs (and explicit ideology descriptions amplify it), while tiered retrieval pipelines still fail on false-premise overclaiming—suggesting “retrieval governance + answerability gating” is becoming mandatory.
Lightweight routing/coordination mechanisms are emerging at three levels: (i) architectural (Directional Routing inside Transformers), (ii) decoding-time (TARo token-level adaptive mixing of base+reward logits), and (iii) systems (Token Coherence replacing broadcast sync in multi-agent workflows). All aim to reduce interference/cost while keeping behavior controllable.
Temporal and distribution shift is being operationalized with benchmarks + protocols: CarbonBench standardizes zero-shot spatial transfer for carbon flux regression; T-QPM targets temporal OOD for VLMs; CoDA targets pipeline-realistic distribution chains in medical imaging.
Low-data alignment can hinge on wording, not just “more data”: a matched non-identity safety framing beats creed/constitutional phrasing in 130-example LoRA across three model families on HarmBench, with negligible MMLU/ARC deltas.
Embodied deployment metrics are diverging from inference metrics: compression/pruning/token/action reductions can preserve success rate yet worsen jerk/path length/time—“efficiency” claims for VLA models need embodied-efficiency reporting.

2) Key themes (clusters)

Theme: Neuro-symbolic verification loops for high-stakes decisions

Why it matters: When errors are safety-critical (clinical interventions, robot execution), “better prompting” isn’t enough—systems need auditable intermediate structure plus deterministic checks that can veto or repair outputs.
Representative papers:
Common approach:
- Force structured intermediate representations (S-CoT causal chains; Event Logic Trees; STRIPS/PDDL operators + scene graphs).
- Add a symbolic validator/executor (KG constraint checks; forward simulation; operator validity gates).
- Use bounded revise loops when violations occur (generate–verify–revise; counterfactual repair iterations).
Open questions / failure modes:
- Coverage limits: KGs/predicate sets may be incomplete; binary vetoes can miss nuanced trade-offs.
- Schema extraction quality becomes a bottleneck (ELT parsing errors; VLM scene-graph/operator errors).
- Runtime and dependency on strong proprietary models (e.g., GPT-5 in SELA/NESYCR).

Theme: Retrieval as both mitigation and manipulation channel

Why it matters: Retrieval can reduce hallucinations, but it can also steer outputs (ideology) or distract models (time-series/numerical evidence), so safety requires controlling what is retrieved and how it is used.
Representative papers:
Common approach:
- Routing queries to different sources (trusted repositories → web fallback; dual-track retrieval for filings vs prices).
- Post-retrieval grading / reranking (CRAG-style doc grading; object/path-aware reranking).
- Explicit measurement of retrieval effects (semantic/lexical alignment to ideological poles; “retrieval delta” and indicator F1 shifts).
Open questions / failure modes:
- False-premise overclaiming: verification pipelines can still “validate the premise” instead of refusing.
- Ideology amplification: making discourse dimensions explicit in prompts can further steer outputs.
- Numeric/time-series brittleness: RAG can reduce reasoning depth and indicator fidelity when evidence is tabular/time-aligned.

Theme: Routing/coordination as a general-purpose control knob (model, decoding, systems)

Why it matters: As models and agent systems scale, interference and coordination costs dominate. Routing offers a compact way to allocate computation/authority dynamically—potentially improving interpretability, reliability, and cost.
Representative papers:
Common approach:
- Learn input-dependent suppression/mixing (directional component suppression; per-token α mixing base+reward logits).
- Replace global/static knobs with adaptive, local decisions (token-level vs fixed interpolation; coherence invalidation vs broadcast).
- Add formal/causal probes to show load-bearing behavior (router-off collapses induction/recall; TLA+ invariants for sync safety).
Open questions / failure modes:
- Generality and variance: directional routing results are from limited scales/seeds; benchmark gains don’t always follow PPL gains.
- Reward-model dependence and OOD sensitivity for test-time alignment routers.
- Coherence protocols rely on assumptions (central authority; simulation vs production traces; liveness under failures).

Theme: Robustness under realistic distribution shift (temporal, spatial, pipeline)

Why it matters: Deployment failures often come from structured shifts (time drift, geography, clinical pipelines), not i.i.d. noise—benchmarks and threat models are moving closer to operational reality.
Representative papers:
Common approach:
- Define shift-aware protocols (hold out sites for zero-shot spatial transfer; temporal partitions; chained pipeline transformations).
- Report tail/quantile metrics and operational metrics (per-site quantiles; FPR95/AUROC over time; plausibility constraints via SSIM).
- Explore lightweight adaptation/repair (learn only two fusion scalars in T-QPM; token-space adapter trained on clean images in CoDA).
Open questions / failure modes:
- Hard targets remain hard: CarbonBench shows NEE is much harder than GPP/RECO; errors amplify in residuals.
- Caption dependence in T-QPM (caption quality/diversity sensitivity).
- Repairs are partial in CoDA; broader architecture coverage and larger clinical tasks remain untested.

Theme: Evaluation infrastructure for alignment, privacy, and “creativity”

Why it matters: Several papers focus less on new models and more on measurement: safety framing effects, survey simulation fidelity, synthetic-data privacy leakage, and creativity metrics for evolving code systems.
Representative papers:
Common approach:
- Build task-specific metrics that expose failure modes (TRM/RC vs TVD/DH; Quality×Novelty; TPR@10%FPR for MIAs).
- Use controlled experimental designs (matched phrasing conditions; bootstrap tie-handling; competition tracks).
- Provide scalable pipelines (self-evolving challenge generation; calibration-then-scaling; public artifacts/shadow models).
Open questions / failure modes:
- Judge dependence and reproducibility (closed-source judges in HarmBench pipeline; LLM-based construction bias in CreativeBench).
- Synthetic data is not “privacy by default” (nontrivial MIA success even on diffusion-based tabular synthesis).
- Metric strictness vs sample size (e.g., DH as stringent criterion).

3) Technical synthesis

Multiple works converge on structured intermediate artifacts as the unit of verification: syndrome→pathology→principle→acupoint chains (CORE-Acu), ELT schemas (SELA), symbolic operators and scene graphs (NESYCR), and state/event tuples + weights (doctor–patient inquiry).
Bounded loops are the dominant safety/control primitive: generate–verify–revise (CORE-Acu; tiered retrieval verification; NESYCR repair), with explicit fallbacks (human confirmation; graceful apology).
Routing is becoming ubiquitous: inside the model (directional suppression), at decoding (token-level α), in retrieval (domain/tier routing; dual-track retrieval), and in serving (complexity-aware router for e-commerce).
Several papers show metric improvements can be misleading if not aligned to the right objective: directional routing yields large PPL reductions but no multiple-choice benchmark gains; VLA compression improves inference metrics but worsens embodied jerk/path/time.
Temporal anchoring appears as a general trick for long-horizon understanding: PRIMO’s (I_init, V_seq, I_curr) input structure; T-QPM’s timestep-conditioned prototypes and drift penalties.
Robustness work is shifting from “single corruption” to composed, realistic shift chains (CoDA’s A∘R∘D) and from static OOD to streaming temporal drift (T-QPM).
Evaluation is increasingly tail-aware: CarbonBench reports per-site quantiles; T-QPM reports early vs late timestep FPR95/AUROC; Token Coherence analyzes volatility regimes.
A recurring failure mode across retrieval/verification systems is premise validation: systems can become confident in the wrong frame (false-premise overclaiming; ideology amplification).
Lightweight adaptation is favored when foundations are frozen: LoRA with reweighted loss (CORE-Acu), two-scalar fusion learning (T-QPM), token-space linear adapter repair (CoDA), and inference-only activation steering (EvoRePE).

4) Top 5 papers (with “why now”)

1) CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support

Introduces a full neuro-symbolic safety stack: structured S-CoT + TCM KG + entity-reweighted loss + generate–verify–revise loop.
Reports 0/1,000 KG-defined safety violations after verification, vs 8.5% for GPT-4o on the same benchmark.
Practical template for other high-stakes domains where token-level entity fidelity and hard contraindication rules matter.
Skepticism: safety is only as good as KG coverage; binary veto may miss nuanced clinical trade-offs.

2) Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems

Reframes multi-agent context sharing as cache coherence; provides an analytic savings bound and a concrete protocol (CCS).
Uses TLA+ model checking to verify invariants (single-writer, monotonic versioning, bounded staleness).
Simulation shows ~84–95% token savings for lazy invalidation across volatility regimes—direct cost lever for agent deployments.
Skepticism: evaluation is simulation-based; centralized authority and liveness under failures remain concerns.

3) Directional Routing in Transformers

Adds a small router that suppresses learned head-space directions; routing becomes load-bearing (router-off collapses recall/induction).
Reports large domain perplexity reductions (31–56%) with ~3.9% parameter overhead.
Provides built-in, causally manipulable “directions” as interpretability hooks.
Skepticism: limited seeds/scales; PPL gains didn’t translate to multiple-choice benchmark gains.

4) TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Learns per-token mixing between base and reward logits, avoiding brittle fixed interpolation in test-time alignment.
Reports large MATH500 gains (e.g., 32.0% → 54.4% for Llama-3.1-8B in Table 1) and weak-to-strong transfer to larger backbones.
Useful for deployments where retraining is costly but decoding-time steering is feasible.
Skepticism: depends on reward model quality/domain bias; full-logits routing can hurt throughput.

5) The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

Shows RAG can propagate ideological stance from retrieved texts; adding explicit LMDA descriptions generally amplifies alignment.
Provides a concrete methodology (LMDA + controlled retrieval + semantic/lexical similarity + ANOVA) to quantify steering.
“Why now”: RAG is ubiquitous in production; this highlights a governance gap beyond hallucinations.
Skepticism: domain-specific corpus and curated exemplar selection; effects may vary with retrieval/reranking choices.

5) Practical next steps

For any safety-critical assistant, prototype a generate–verify–revise controller with: (i) explicit intermediate schema, (ii) deterministic constraint checks, (iii) bounded retries, (iv) refusal/handoff policy when unresolved.
Add a pre-retrieval answerability / premise-check gate to RAG pipelines to reduce false-premise overclaiming (explicitly flagged as a key failure mode in tiered retrieval verification).
Treat retrieval corpora as untrusted inputs: implement retrieval governance (source allowlists, ideology/bias detectors, chunk-level provenance) and test for stance steering under controlled retrieval poles.
If running multi-agent workflows, instrument token spend by sync boundary and test coherence-style invalidation vs broadcast; verify invariants (single-writer, version monotonicity, staleness bounds) before rollout.
When using test-time alignment, replace fixed mixing with adaptive routing (token-level α) and measure not just accuracy but throughput cost and OOD behavior.
For VLM robustness, expand evaluation beyond single corruptions to composed pipeline shifts (CoDA-style) and temporal drift (T-QPM-style); track early/late timestep metrics.
For embodied agents, report embodied-efficiency metrics (jerk, path length, completion time, action rate) alongside inference metrics before claiming “efficiency improvements.”

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-24

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Neuro-symbolic verification loops for high-stakes decisions

Theme: Retrieval as both mitigation and manipulation channel

Theme: Routing/coordination as a general-purpose control knob (model, decoding, systems)

Theme: Robustness under realistic distribution shift (temporal, spatial, pipeline)

Theme: Evaluation infrastructure for alignment, privacy, and “creativity”

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps