Daily AI Paper Report (2026-03-16)

Published: March 16, 2026

Chinese version: [中文]

Run stats

Candidates: 407
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-13T00:00:00Z → 2026-03-14T00:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.12183`	Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials PDF	cond-mat.mtrl-sci, cs.AI, cs.LG, physics.comp-ph	93	Falsifiable safety certificates + adversarial auditing + Lean formalization; strong reliability angle.	safety-certificates, formal-verification, adversarial-testing, uncertainty, reliability, auditing
`2603.12249`	SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning PDF	cs.CL, cs.AI, cs.CV	92	300K scientific multimodal doc-reasoning dataset + expert eval benchmark; reusable for MLLM training/testing	multimodal, document-reasoning, dataset, benchmark, evaluation, scientific-qa, grounding
`2603.11493`	OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure PDF	cs.CV, cs.AI, cs.CY	91	Concept erasure for T2I via SAE disentanglement + orthogonal projection; safety-relevant, less collateral damage	text-to-image, safety, concept-erasure, sparse-autoencoders, feature-disentanglement, robustness
`2603.12145`	Automatic Generation of High-Performance RL Environments PDF	cs.LG, cs.AI, cs.SE	90	Agentic workflow to auto-translate RL envs into high-perf code w/ verification; big speedups, reusable recipe	agents, RL, code-generation, verification, simulation, tooling
`2603.11935`	MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices? PDF	cs.LG, cs.AI	88	Benchmark for whether LLMs can generate efficient mobile kernels; practical eval + tooling pipeline.	LLM-evaluation, code-generation, systems, efficiency, benchmark, mobile
`2603.11559`	AI Knows What's Wrong But Cannot Fix It: Helicoid Dynamics in Frontier LLMs Under High-Stakes Decisions PDF	cs.AI, cs.HC	88	Documents a high-stakes failure mode across frontier LLMs; useful for safety evals despite case-series limits	LLM-safety, failure-modes, high-stakes, evaluation, reliability, behavioral-dynamics
`2603.09160`	RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning PDF	cs.CV, cs.AI, cs.LG	86	RL for open-ended captioning using LLM-written rubrics as dense rewards; scalable supervision alternative	RLHF, LLM-judges, rubrics, vision-language, synthetic-data, evaluation
`2603.11650`	QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate PDF	cs.CL	86	Question-aware chunking via multi-agent debate; directly targets RAG failure mode (chunk quality).	RAG, chunking, multi-agent, retrieval, domain-adaptation
`2603.11414`	MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models PDF	cs.CL, cond-mat.mtrl-sci	86	Figure-centric benchmark for college materials problems; targets real multimodal reasoning failure modes	multimodal, benchmark, figures, STEM, evaluation, reasoning
`2603.11974`	Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-agent AI PDF	cs.AI	86	Framework to study norm emergence/coordination in multi-agent AI via translated human experiments	multi-agent, norms, governance, evaluation, social-dynamics
`2603.09643`	MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings PDF	cs.ET, cs.AI	84	New multimodal agent benchmark with persona/dual-control robustness; relevant to real deployments.	agent-evaluation, multimodal-agents, robustness, persona, benchmark, TTS
`2603.11811`	RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset PDF	cs.RO, cs.AI, cs.CV	84	Fully autonomous closed-loop robot data generation using VLM planning + causal resets; reduces human bottleneck	robotics, agents, VLM, data-generation, autonomy, embodied-ai
`2603.09214`	PrivPRISM: Automatically Detecting Discrepancies Between Google Play Data Safety Declarations and Developer Privacy Policies PDF	cs.AI	84	LLM-based detection of privacy disclosure inconsistencies at scale; concrete real-world compliance impact.	privacy, policy-analysis, LLMs, compliance, auditing
`2603.09151`	Deep Tabular Research via Continual Experience-Driven Execution PDF	cs.AI	84	Agentic framework for long-horizon table reasoning with closed-loop execution; relevant to tool-using agents	agents, tabular-reasoning, long-horizon, planning, tool-use, information-extraction
`2603.09481`	GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models PDF	cs.AI	84	LLM+evolution generates interpretable generalized PDDL planners; strong benchmark performance vs baselines	LLM-agents, planning, PDDL, program-synthesis, generalization
`2603.11653`	Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning PDF	cs.LG, cs.RO	83	Finds simple sequential FT+LoRA avoids forgetting in VLA continual RL; impactful for agent training.	embodied-agents, VLA, continual-learning, reinforcement-learning, LoRA, post-training
`2603.09938`	Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions PDF	cs.CL	82	Comprehensive LLM model merging survey + taxonomy; useful for capability composition and governance.	model-merging, LLMs, survey, taxonomy, fine-tuning, deployment
`2603.11554`	MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks PDF	cs.CV, cs.AI, cs.RO	82	Generates multi-floor building-scale 3D scenes + 1k-building dataset for long-horizon embodied tasks	embodied-ai, benchmarks, 3D-scene-generation, long-horizon, robotics
`2603.09827`	MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents PDF	cs.CV, cs.AI	82	Multi-agent egocentric video QA + memory aggregation benchmark; relevant to embodied multi-agent systems.	embodied-agents, multi-agent, video-QA, long-horizon, benchmark
`2603.11515`	Multi-Agent Collaboration for Automated Design Exploration on High Performance Computing Systems PDF	cs.AI	82	LLM multi-agent framework that runs HPC workflows; concrete agentic deployment pattern with real tooling	agents, multi-agent, tool-use, HPC, workflow-automation, scientific-discovery
`2603.11679`	LLMs can construct powerful representations and streamline sample-efficient supervised learning PDF	cs.AI	80	Agentic rubric-based representation construction for sample-efficient supervised learning across tasks.	agentic-pipelines, representations, sample-efficiency, LLMs, automation, supervised-learning
`2603.09881`	Do What I Say: A Spoken Prompt Dataset for Instruction-Following PDF	cs.CL	80	Multilingual spoken-prompt dataset to evaluate speech LLM instruction following; shows modality gaps	speech-LLMs, instruction-following, benchmark, robustness, multilingual
`2603.09774`	World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models PDF	cs.AI	80	Training-free toolkit builds 3D cognitive maps to boost foundation models' allocentric spatial reasoning	agents, spatial-reasoning, tool-augmented, 3D-mapping, multimodal
`2603.09400`	Reward Prediction with Factorized World States PDF	cs.CL	79	Factorized world-state representations via LMs for reward prediction; could improve goal generalization.	agents, reward-modeling, world-models, state-representation, generalization
`2603.11924`	Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding PDF	cs.LG, cs.CL	79	New task+benchmark for 4D trajectory-to-language chemical dynamics reasoning; enables eval of temporal MLLMs	benchmark, multimodal, scientific-LLMs, temporal-reasoning, chemistry
`2603.09716`	AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents PDF	cs.AI	78	Multi-agent framework with evolving cognition + elastic memory; relevant but claims need scrutiny.	agents, memory, orchestration, multi-agent, tool-use, frameworks
`2603.11798`	DocSage: An Information Structuring Agent for Multi-Doc Multi-Entity Question Answering PDF	cs.AI	78	Schema-aware agent for multi-doc multi-entity QA; targets evidence chains beyond vector/graph RAG limits	RAG, agents, information-extraction, multi-document-QA, reasoning
`2603.11395`	ARROW: Augmented Replay for RObust World models PDF	cs.LG, cs.AI	78	Model-based continual RL (DreamerV3) with memory-efficient replay; tackles catastrophic forgetting.	continual-RL, world-models, Dreamer, replay, robustness
`2603.09043`	Time, Identity and Consciousness in Language Model Agents PDF	cs.AI	78	Proposes instrumented metrics for identity persistence in LM agents; evaluation angle for agentic behavior	agent-evaluation, identity, instrumentation, behavioral-metrics, scaffolding
`2603.11578`	Streaming Translation and Transcription Through Speech-to-Text Causal Alignment PDF	cs.CL	78	End-to-end streaming speech translation with WAIT-token policy; strong systems contribution and latency tradeoffs	speech-to-text, simultaneous-translation, streaming, sequence-modeling, latency, training-tricks

AI Paper Insight Brief

2026-03-16

0) Executive takeaways (read this first)

“Self-report / recall” evaluations for agents can be structurally misleading: identity ingredients may appear within a window without ever co-instantiating at a decision step, so “stable identity” can fail to constrain actions even when tests pass (Time, Identity and Consciousness in Language Model Agents).
Structured, executable intermediates are winning across domains: object–attribute world states for reward estimation, relational schemas + SQL for multi-doc QA, meta-graphs + operator execution for messy tables, and AST allocentric maps for spatial reasoning all show sizable gains vs flat-text/RAG baselines.
LLM-as-judge is increasingly the bottleneck: multimodal agent benchmarks show judge inconsistency and safety-label noise; rubric-guided RL and policy/label pipelines depend heavily on judge calibration and can be gamed or drift (MM-tau-p², RubiCap).
Agentic “plan–execute–verify–repair” loops are moving from demos to measurable engineering wins: mobile kernel generation jumps from low compile/correctness to high CSR/FCR with multi-agent iteration and hardware-in-loop evaluation (MobileKernelBench); similar closed loops appear in tabular research and RL environment translation.
Continual learning is splitting into two practical recipes: (a) world-model replay with distribution-matching buffers reduces forgetting in continual RL (ARROW); (b) for large pretrained VLAs, simple sequential LoRA + on-policy RL can yield near-zero forgetting across benchmarks (Simple Recipe Works).
Safety/compliance work is becoming pipeline-verified and auditable: large-scale PP↔DS discrepancy detection triangulated with APK evidence (PrivPRISM) and falsifiable “proof-carrying” certificates for ML interatomic potentials with adversarial search + Lean proofs (Proof-Carrying Materials).

2) Key themes (clusters)

Theme: Temporal grounding & “identity actually constrains action”

Why it matters: Agent evaluations that check whether traits/memories appear somewhere in context can overestimate stability and safety; what matters is whether the full grounded conjunction is present at the decision step.
Representative papers:
- Time, Identity and Consciousness in Language Model Agents
- AI Knows What’s Wrong But Cannot Fix It: Helicoid Dynamics in Frontier LLMs Under High-Stakes Decisions
Common approach:
- Formalize agent scaffolds / interaction steps and define operational metrics over traces.
- Emphasize regimes where meta-recognition or recall exists but behavior does not reliably change.
Open questions / failure modes:
- How to instrument real agent stacks to measure co-instantiation (vs ingredient occurrence) at action time.
- Whether interventions that improve “recognition” (reflection, self-critique) can reliably improve behavior in unverifiable/high-stakes settings.

Theme: Structured state & reward as a backbone for planning

Why it matters: Zero-shot/generalizable planning needs dense progress signals; flat text similarity or judge-based rewards often misalign with step-wise task progress.
Representative papers:
- Reward Prediction with Factorized World States
- ARROW: Augmented Replay for RObust World models
Common approach:
- Build factorized latent/state representations (object–attribute beliefs; RSSM world models).
- Use replay/imagined rollouts to train policies while controlling forgetting.
Open questions / failure modes:
- Sensitivity to embedding/LLM choice for state extraction and similarity geometry (StateFactory).
- Task-order and reward-scale sensitivity in continual settings; fixed buffer splits and scaling issues (ARROW).

Theme: RAG is becoming “structure-first” (schemas, chunks, operators)

Why it matters: Retrieval quality is increasingly limited by upstream structuring (chunking, schema discovery, table meta-structure), not just embedding models.
Representative papers:
Common approach:
- Convert unstructured corpora into query-aware structured artifacts (relational tables + SQL; question-aware chunks; operation graphs).
- Use execution feedback (SQL execution, operator execution) and iterative correction/memory.
Open questions / failure modes:
- Cost/latency of multi-stage pipelines and robustness under noisy/contradictory documents.
- Whether chunk “completion” can introduce subtle leakage or overfitting to document phrasing (needs careful constraints; QChunker claims completion uses only explicit doc info).

Theme: Multimodal evaluation realism (voice, figures, long papers)

Why it matters: Text-only evaluation overstates capability; real deployments involve speech, figures, and long multimodal documents where attention dilution and pipeline noise dominate.
Representative papers:
Common approach:
- Build benchmarks that force grounding in modality-specific evidence (spoken prompts; figures; full-paper contexts).
- Add explicit localization / rubric-based judging and analyze modality gaps.
Open questions / failure modes:
- Judge inconsistency and correlated label noise (MM-tau-p²).
- Memorization shortcuts where models answer without using figures (MaterialFigBENCH).
- Large oracle→full-context drops indicating unresolved long-context multimodal retrieval (SciMDR).

Theme: Agentic engineering loops with hard verifiers (compile/run/measure)

Why it matters: When verifiers exist (compilers, unit tests, on-device benchmarks), multi-agent iterative repair can turn LLMs into practical automation tools.
Representative papers:
Common approach:
- Iterative generate→validate→repair loops with increasingly strong verification (compile + functional tests + performance; L1–L4 verification; plan validators).
- Search/optimization over programs (evolutionary selection; multi-agent roles).
Open questions / failure modes:
- Generalization across frameworks/devices/backends (MobileKernelBench currently MNN CPU on one SoC).
- Empirical verification coverage vs rare-path bugs (RL env translation uses finite rollout tests).
- Domains without compact general strategies remain hard (GenePlan on Sokoban).

Theme: Auditable safety/compliance pipelines (privacy + concept erasure + formal certs)

Why it matters: Deployment needs inspectable artifacts (matrices, proofs, structured discrepancies) rather than opaque “model says it’s safe”.
Representative papers:
Common approach:
- Combine LLM extraction with verifier models / constraints (self-supervised verifiers; logical envelopes; geometric null-space projection).
- Triangulate across sources (policy text vs DS labels vs APK evidence; adversarial search + DFT recomputation; safety vs fidelity metrics).
Open questions / failure modes:
- Static analysis misses runtime behavior (PrivPRISM).
- SAE quality and limited null-space for mass erasure (OrthoEraser).
- Proofs certify reasoning under axioms, not physics; compositional probes may differ from real structures (PCM).

3) Technical synthesis

Many papers converge on “intermediate representations as contracts”: AST (allocentric spatial tree), object–attribute states, relational schemas/SQL tables, meta-graphs for tables, and rubrics for captions all serve as checkable interfaces between perception/retrieval and generation.
Closed-loop execution feedback is the dominant robustness lever: operator execution traces (DTR), compile/test/profile (MobileKernelBench), hierarchical verification (RL env translation), SQL execution + constraint checking (DocSage), and VQA boolean success checks + reset FSM (RADAR).
Evaluation is shifting from single scalar scores to multi-metric dashboards (MM-tau-p²’s 12 metrics; identity weak/strong persistence; compliance matrices; EPIC distance for reward prediction), reflecting that “pass/fail” hides failure modes.
Judge dependence is a recurring fragility: rubric RL uses an LLM judge; MM-tau-p² shows judge inconsistency; SciMDR uses LLMs for synthesis and evaluation; these pipelines need calibration/robustness checks akin to software testing.
Continual learning results suggest architecture matters more than “CL tricks” in some regimes: world-model replay buffers (ARROW) vs large-pretrained + LoRA + on-policy RL (VLA continual RL) show different paths to stability.
RAG-specific insight: upstream chunking/schema/structure can dominate downstream QA quality; QChunker’s ChunkScore correlates strongly with ROUGE-L (λ≈0.3), and DocSage’s structured extraction is the most critical ablated component.
Safety/robustness is becoming geometry- and logic-aware: OrthoEraser uses null-space projection to avoid collateral damage; DocSage uses cross-record constraints; PCM uses bootstrap envelopes + Lean proofs.
Modality realism exposes hidden gaps: spoken prompts degrade text-output tasks (DOWIS), persona conditioning can degrade safety recall (MM-tau-p²), and full-paper contexts sharply reduce performance vs oracle contexts (SciMDR).

4) Top 5 papers (with “why now”)

1) Time, Identity and Consciousness in Language Model Agents

Formalizes a concrete eval failure: within-window “occurrence” doesn’t imply decision-time co-instantiation (Theorem 3.10), so recall/self-report can be false reassurance.
Provides instrumentable metrics (weak/strong persistence) and an algorithm to compute them from traces—actionable for agent stack logging.
Architectural implications: RAG can raise weak persistence while not improving (or reducing) strong persistence; concurrency capacity bounds co-instantiation.
Skepticism: theoretical/methodological; no empirical measurements reported.

2) DocSage: An Information Structuring Agent for Multi-Doc Multi-Entity Question Answering

Big empirical jump on MDMEQA: 0.892 vs 0.620 for GPT-4o+RAG on MEBench (+27.2pp).
Shows a practical recipe: interactive schema discovery + constraint-checked extraction + SQL reasoning with provenance.
Ablations identify what matters most (structured extraction).
Skepticism: multi-stage pipeline cost and dependence on foundation model quality; may degrade on very noisy/contradictory corpora.

3) MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Demonstrates that “LLM writes kernels” is mostly blocked by compilation/API hallucinations—until you add repository-aware multi-agent iteration + device-in-loop verification.
MoKA reaches CSR 93.7% and FCR 75.3%, with 27.4% kernels >1.5× faster than native MNN; includes a 6.82× LayerNorm2D case study.
Provides a benchmark + automated pipeline (registration→compile→verify→on-device perf).
Skepticism: evaluated on one engine (MNN CPU) and one device/SoC; broader backend generality untested.

4) PrivPRISM

Large-scale measurement: ~53% PP–DS discrepancies in 7,770 popular games; ~61% in 1,711 non-game apps.
Encoder–decoder + self-supervised verifiers is a pragmatic pattern for reducing LLM hallucinations while keeping interpretability.
Triangulates with APK static analysis and manual audits (e.g., policy URL redirection issues).
Skepticism: static analysis can miss runtime behavior; some discrepancies may be interpretive/ambiguous.

5) Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

Quantifies a severe deployment failure: MLIP screening recall 0.07 on 25k materials (misses 93% of DFT-stable).
PCM pipeline combines adversarial search, bootstrap safety envelopes, and Lean 4 machine-checked proofs; validates failures with independent DFT recomputation (20/20, median force ratio ~12×).
Adds a prospective risk model (AUC-ROC 0.938 ± 0.004) and a case study improving thermoelectric yield (+62 stable materials at 20% DFT budget).
Skepticism: proofs depend on axioms; DFT is ground truth (not experiment); compositional probes introduce approximation gaps.

5) Practical next steps

Agent evaluation: add trace-level logging of “identity ingredient activations” and compute both Pweak and Pstrong; treat recall/self-report as weak evidence unless co-instantiation at action time is shown.
RAG systems: pilot structure-first pipelines—(a) question-aware chunking + completion (QChunker-style), or (b) query-specific schema + constraint-checked extraction + SQL (DocSage-style)—and measure gains vs embedding-only RAG.
Judge reliability: for any LLM-as-judge metric, run multi-judge / multi-seed consistency checks and explicitly track disagreement rates (MM-tau-p² shows correlated label noise on escalation cases).
Closed-loop agents: where verifiers exist (compile/tests/execution), invest in iterative repair loops with role separation (coder/debugger/optimizer) and hardware-in-loop measurement (MobileKernelBench pattern).
Continual learning: if you’re doing continual RL, compare (i) world-model replay with distribution-matching buffers (ARROW) vs (ii) SeqFT + LoRA + on-policy RL (for large pretrained VLAs) under the same task-order perturbations.
Multimodal realism: add spoken-prompt evaluation (DOWIS-style) and full-document multimodal QA with explicit localization (SciMDR-style) to avoid overestimating capability from text-only tests.
Safety/compliance auditing: adopt “auditable artifacts” (compliance matrices, discrepancy reports, safety envelopes) and triangulate across sources (policy text + declarations + code evidence; adversarial discovery + independent recomputation).

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-16

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Temporal grounding & “identity actually constrains action”

Theme: Structured state & reward as a backbone for planning

Theme: RAG is becoming “structure-first” (schemas, chunks, operators)

Theme: Multimodal evaluation realism (voice, figures, long papers)

Theme: Agentic engineering loops with hard verifiers (compile/run/measure)

Theme: Auditable safety/compliance pipelines (privacy + concept erasure + formal certs)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps