Daily AI Paper Report (2026-03-15)

Published: March 15, 2026

Chinese version: [中文]

Run stats

Candidates: 437
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-13T00:00:00Z → 2026-03-14T00:00:00Z (weekend_backlog_sat, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.12023`	Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems PDF	cs.CR, cs.AI	93	Shows how classic CVEs + hardware side-channels can amplify attacks in compound LLM toolchains	agent-security, compound-systems, CVE, side-channels, threat-modeling, tool-use
`2603.12094`	Human-Centred LLM Privacy Audits: Findings and Frictions PDF	cs.HC, cs.AI, cs.CL, cs.CY	93	Human-centered privacy auditing tool + large user studies; measures name-conditioned personal inference risks.	privacy, LLM auditing, PII, user study, measurement, deployment
`2603.11768`	Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework PDF	cs.AI	90	Targets long-term agent memory risks (drift, corruption, privacy) with a governance framework (SSGM)	agents, memory, governance, privacy, safety, robustness
`2603.09641`	PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution PDF	cs.AI, cs.IR	90	Agent memory safety: exact rule retrieval + conflict-aware reliability + invalidation; tackles stale/adversarial knowledge.	agents, memory, robustness, test-time adaptation, prompt evolution, knowledge integrity
`2603.09692`	ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning PDF	cs.LG, cs.AI, cs.CL	90	Active learning cuts RLHF preference-label cost; new pair-selection methods with strong empirical gains.	RLHF, preference-data, active-learning, uncertainty, alignment, data-efficiency
`2603.09803`	Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning PDF	cs.LG	89	RLVR variant that upweights high-quality reasoning traces via in-context utility signal (Evidence Gain).	reasoning, RLVR, post-training, trace-quality, alignment, reward-weighting
`2603.09821`	One-Eval: An Agentic System for Automated and Traceable LLM Evaluation PDF	cs.CL	88	Agentic, traceable evaluation workflows from NL requests; improves reproducibility and auditability	evaluation, agentic-eval, reproducibility, benchmarking, tooling, auditing
`2603.12056`	XSkill: Continual Learning from Experience and Skills in Multimodal Agents PDF	cs.AI, cs.CL	88	Continual improvement for multimodal agents via experience+skill memory without finetuning.	agents, multimodal, continual-learning, tool-use, memory, retrieval
`2603.09403`	LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation PDF	cs.CL	86	Scalable metric validation via LLM-generated synthetic degradations; high meta-correlation to human rankings.	evaluation, metrics, synthetic data, LLMs, multilingual, benchmarking
`2603.09044`	Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection PDF	cs.CR, cs.SE	86	Targets LLM-enabled zero-day malware; hybrid concolic+LLM analysis with formal guarantees.	security, malware, LLM-misuse, program-analysis, concolic-execution, robust-detection
`2603.11687`	SemBench: A Universal Semantic Framework for LLM Evaluation PDF	cs.CL, cs.AI	86	Auto-generates semantic eval benchmarks from dictionaries; scalable, multilingual LLM understanding tests	LLM evaluation, semantics, benchmark generation, multilingual, WiC
`2603.11413`	Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI PDF	cs.HC, cs.AI	86	Safety-relevant: shows triage failures depend heavily on eval format; naturalistic protocols change results.	evaluation, healthcare, safety, triage, protocol-design, human-factors
`2603.09154`	Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety PDF	cs.CL	84	Safety-relevant bias eval + tuning: measures LLM disposition toward bio vs synthetic solutions	alignment, bias, eval, biosecurity, preference-tuning, safety-metrics
`2603.11864`	Social, Legal, Ethical, Empathetic and Cultural Norm Operationalisation for AI Agents PDF	cs.AI, cs.SE	84	Process to operationalize social/legal/ethical norms into verifiable agent requirements; surveys tools and gaps.	AI governance, agent norms, requirements, verification, ethics, safety engineering
`2603.11915`	CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks? PDF	cs.CL	84	New multimodal, multi-turn Theory-of-Mind benchmark; useful for social reasoning evals.	evaluation, benchmark, theory-of-mind, multimodal, multi-turn, llm-evals
`2603.11665`	Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge PDF	cs.CL	84	Multi-task RL to improve MLLM-as-judge consistency and human correlation; relevant to eval reliability	LLM-as-judge, evaluation, reinforcement learning, multimodal, preference modeling
`2603.09835`	Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents PDF	cs.CL	84	Improves long-context agent pipelines by optimizing chunk order to reduce bounded-memory information loss.	long-context, agents, chain-of-agents, memory-bottleneck, ordering, reasoning
`2603.11799`	Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA PDF	cs.LG, cs.CR	83	Unifies LiRA/RMIA/BASE MIAs; practical for privacy auditing of ML/LLMs via a single framework	privacy, membership-inference, auditing, security, theory, evaluation
`2603.09052`	From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring PDF	cs.AI, cs.CL, cs.LG	83	Autonomous clinical triage agent w/ tools and clinician validation; real-world agent reliability.	agents, tool-use, evaluation, healthcare, reliability, MCP
`2603.12142`	Understanding Disclosure Risk in Differential Privacy with Applications to Noise Calibration and Auditing (Extended Version) PDF	cs.CR, cs.IT	82	New DP disclosure-risk metric spanning membership/attribute/reconstruction; helps calibration & audits	differential-privacy, auditing, noise-calibration, inference-attacks, privacy-metrics
`2603.09192`	Explainable Innovation Engine: Dual-Tree Agent-RAG with Methods-as-Nodes and Verifiable Write-Back PDF	cs.AI	82	Agent-RAG with auditable method provenance + verifier write-back; improves controllability and traceability.	RAG, agents, provenance, auditing, verification, knowledge base
`2603.11838`	DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining PDF	cs.CL, q-fin.GN	82	Time-aware pretraining to prevent lookahead bias; strong methodology for leakage control.	data-contamination, temporal-generalization, evaluation, pretraining, finance
`2603.09344`	Robust Regularized Policy Iteration under Transition Uncertainty PDF	cs.AI, stat.ML	82	Robust offline RL via worst-case transition uncertainty; targets distribution shift failures.	offline-rl, robust-rl, distribution-shift, uncertainty, policy-iteration, safety
`2603.11545`	One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries PDF	cs.CL, cs.AI, cs.LG	82	Agentic tool orchestration across modalities with learned routing; strong efficiency metrics on 2,847 queries.	agents, tool-use, orchestration, routing, multimodal, systems
`2603.09454`	ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models PDF	cs.CR	80	Diffusion watermarking that preserves diversity while improving robustness; provenance angle.	watermarking, diffusion-models, provenance, robustness, content-authenticity, security
`2603.12089`	EmbTracker: Traceable Black-box Watermarking for Federated Language Models PDF	cs.CR	79	Client-traceable black-box watermarking for federated LMs; addresses model leakage accountability	watermarking, federated-learning, model-leakage, accountability, backdoors, security
`2603.09909`	MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems PDF	cs.AI	79	Unified benchmark/orchestration for multimodal medical multi-agent systems; standardizes eval.	multi-agent, benchmark, multimodal, evaluation, healthcare, orchestration
`2603.11721`	When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows PDF	cs.AI	78	Proposes restricted-execution, doc-centric agent OS for hospitals; focuses on reliability/security needs	agents, healthcare, sandboxing, permissions, deployment, reliability
`2603.09152`	DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering PDF	cs.AI, cs.DB, cs.IR	78	Multi-agent TableQA targeting context limits and hallucinations via coordinated DB/KG teams (claims; need details).	multi-agent, TableQA, hallucinations, tool use, structured data, ReAct
`2603.11689`	Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks PDF	cs.AI	78	Adds explicit logic/probabilistic reasoning channel to validate/enhance black-box MLLMs.	multimodal-llm, reasoning, verification, probabilistic-inference, zero-shot, reliability

AI Paper Insight Brief

2026-03-15

0) Executive takeaways (read this first)

“Governance + structure” is emerging as the antidote to brittle agent memory and RAG: multiple papers replace flat vector retrieval with structured, auditable memory (manifests, provenance trees, exact-match keys) and add explicit gates/verification to prevent drift, poisoning, and compounding errors.
Evaluation is increasingly treated as a first-class system problem (not a metric): agentic evaluation planners, semantic judges, and synthetic validation protocols aim to make evaluations traceable, format-robust, and cheaper—and one health triage replication shows format alone can manufacture “failures.”
Security threat models are widening from “LLM jailbreaks” to cross-stack reality: work on compound AI pipelines shows composed software/hardware gadgets (e.g., bit flips) can bypass guardrails; in parallel, malware defense is moving toward LLM-guided concolic exploration with formal guarantees.
Robustness is being operationalized via uncertainty and worst-case optimization: offline RL robustifies against transition uncertainty with a tractable KL-regularized robust Bellman operator; preference-data generation uses uncertainty to actively select informative comparisons.
Provenance and IP protection for generative models is maturing: diffusion watermarking shifts from fragile value-encoding to structural permutation encoding with diversity preservation; federated LMs get client-traceable black-box watermarking via embedding-space triggers.

2) Key themes (clusters)

Theme: Governed long-term memory for agents (drift/poisoning/auditability)

Why it matters: As agents self-modify memory, failures persist and compound (drift, poisoning, privacy leakage). Governance layers and auditable substrates aim to make long-horizon behavior safer and debuggable.
Representative papers:
Common approach:
- Insert middleware/gates between cognition and memory (read filtering, write validation, reconciliation).
- Prefer structured memory (document trees/manifests, exact-match condition keys) over pure semantic similarity.
- Add provenance/immutability (append-only ledgers, audit trails) to enable rollback and accountability.
Open questions / failure modes:
- Latency and scalability costs of governance checks; stability–plasticity trade-offs (over-blocking legitimate updates).
- Concurrency/conflict resolution for multi-writer memory (explicitly noted for hospital AOS-style designs).
- How to empirically validate drift/poisoning resistance on long-horizon benchmarks (SSGM is conceptual).

Theme: Structured / multi-agent RAG for controllable synthesis (and hallucination reduction)

Why it matters: Flat chunk RAG struggles with multi-step synthesis and provenance; multi-agent decomposition and method-level indexing aim to reduce hallucinations and improve traceability.
Representative papers:
Common approach:
- Replace “retrieve chunks” with structured units (methods-as-nodes; SQL + KG modalities).
- Use orchestration policies (ReAct leaders; dependency-aware ordering) to manage bounded memory and tool calls.
- Add verification/provenance checks (conflict resolution, backtracking, rubric scoring).
Open questions / failure modes:
- Cost/token overhead and interaction loops (explicitly noted for multi-agent TableQA).
- Sensitivity to representation choices (e.g., ordering gains depend on dense embeddings; BM25 weaker).
- Continual write-back risks: error amplification without falsifiers (Agent-RAG notes lack of falsification).

Theme: Evaluation infrastructure & “format realism” (judges, synthetic validation, orchestration)

Why it matters: Deployment decisions need evaluations that are reproducible, format-robust, and aligned with real usage; brittle scaffolds can mis-measure safety.
Representative papers:
Common approach:
- Convert NL evaluation intent into traceable executable plans (benchmark resolution, schema normalization).
- Use semantic judging (VLM-based semantic judge; LLM adjudicators) to avoid regex/exact-match brittleness.
- Use synthetic controlled degradations to validate metric rankings via meta-correlation.
Open questions / failure modes:
- Judge bias and single-judge dependence (MedMASLab; Meta-Judge variability by task/language/LLM).
- Evaluation scaffolds can be behaviorally active (triage replication shows forced A/B/C/D can dominate outcomes).
- Automation gaps remain (One-Eval full-plan success 84%; long-tail benchmark coverage).

Theme: Cross-stack security, privacy auditing, and provenance/watermarking

Why it matters: Real systems fail at the seams—between models, tools, and hardware. Meanwhile, provenance and privacy auditing are becoming operational necessities.
Representative papers:
Common approach:
- Treat security as system composition (gadget chains; concolic exploration guided by LLM priors).
- Move from fragile signals to structural encodings (permutation-based diffusion watermarking).
- Provide black-box verifiability and traceability (FedLM client attribution via trigger embeddings).
Open questions / failure modes:
- Practicality of strong attacker assumptions (e.g., Rowhammer/co-location; precise bitflip control).
- Dependence on inversion quality (diffusion watermarking) and on trusted server/non-colluding clients (FedLM watermarking).
- Formal guarantees contingent on learned components (malware detection completeness depends on LLM ranking; soundness assumes classifier correctness).

Theme: Robust optimization & sample-efficient alignment signals (uncertainty, intrinsic quality)

Why it matters: Robustness and alignment are shifting toward principled objectives (worst-case dynamics, uncertainty-driven data collection, intrinsic reasoning-quality signals) that reduce brittleness without massive human labeling.
Representative papers:
Common approach:
- Use uncertainty estimates (transition ensembles; epistemic reward uncertainty) to drive conservative planning or active sampling.
- Replace expensive step labels with implicit signals (Evidence Gain via in-context learning utility).
- Provide theoretical operators/identities to justify training objectives (robust Bellman contraction; Bayesian identity for reward reweighting).
Open questions / failure modes:
- Compute intensity (ActiveUltraFeedback reports ~200k GPU hours; robust RL needs ensembles).
- Generality beyond tested domains (Evidence Gain shown on math; robust RL uncertainty set approximation).
- Reliance on LLM judges for “annotations” in preference pipelines.

3) Technical synthesis

Structured memory is converging on “append-only provenance + mutable working set”: AOS-H uses append-only document mutation trails; SSGM proposes an immutable episodic ledger plus mutable active graph with periodic reconciliation.
“Gating” is the common safety primitive across domains: memory write validation (SSGM), forbidden-set pruning (PRECEPT), verifiable write-back thresholds (Agent-RAG), and black-box verification thresholds (EmbTracker VR>γ; ShapeMark calibrated FPR).
Bounded-context reasoning is being attacked at the ordering/orchestration layer: CL–ORDER optimizes chunk order under memory bottlenecks; DataFactory and Supervisor-style systems route tasks to specialized modules/tools to avoid monolithic context overload.
Evaluation is moving from scalar scores to traceable pipelines: One-Eval’s BenchInfo artifacts and MedMASLab’s unified I/O + ledgers mirror the auditability goals in memory governance papers.
Semantic judges are replacing brittle exact-match metrics: MedMASLab’s VLM-SJ and the triage replication’s adjudication pipeline highlight that “format” can dominate measured performance.
Robustness via worst-case selection appears in both RL and security: RRPI selects worst-case ensemble dynamics during backups; Cascade composes worst-case gadget chains; CogniCrypt prioritizes worst-case (most malicious) paths via LLM scoring.
Bayesian thinking is reappearing as a stabilizer: PRECEPT uses Beta priors + Thompson sampling for source reliability; BaVarIA uses NIG shrinkage for variance in membership inference; RAD formalizes advantage with auxiliary knowledge in DP.
Provenance/traceability is being engineered into generative systems: ShapeMark preserves diversity while enabling extremely low-FPR detection; EmbTracker adds client-level attribution in federated settings.
Time as a first-class axis of robustness: DatedGPT trains year-cutoff model families and uses perplexity reversal to probe temporal leakage; SSGM uses freshness decay in read filtering.
Clinical agent work is bifurcating into “agent performance” vs “agent infrastructure”: Sentinel shows retrospective triage performance with tool retrieval; AOS-H focuses on OS-level constraints and auditable workflows (no empirical results reported).

4) Top 5 papers (with “why now”)

1) Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

Shows cross-layer attack chains (algorithmic + CVEs + hardware primitives) that bypass guardrails in compound pipelines.
Reports guardrail evasion rates under bitflip strategies (e.g., 82%/72%/94% in Table 1) and 80% jailbreak success with long runtimes.
Useful now because real deployments are increasingly compound systems, and model-only red-teaming misses critical paths.
Be skeptical about: feasibility assumptions (co-location / fault injection control; transfer to proprietary stacks not fully demonstrated).

2) Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection

Combines LLM-guided path prioritization + concolic execution + transformer classifier + RL refinement for AI-generated malware.
Reports 97.5% accuracy on a 2,500-sample AI-Gen-Malware dataset and 73.2% fewer paths to reach 95% coverage vs DFS.
Why now: LLM-enabled malware increases polymorphism and trigger-based behavior; this targets zero-day + evasive patterns.
Be skeptical about: guarantees depend on classifier correctness and LLM ranking (relative completeness requires malicious path in top‑B); heavy hardware requirements.

3) ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

Introduces structural permutation encoding (SE) and payload-debiasing randomization (PDSR) to preserve diversity.
Reports TPR ~1.000 clean / 0.999 attacked at FPR 1e‑6 and strong per-bit recovery (0.987 attacked), plus high LPIPS diversity.
Why now: provenance demands are rising; prior NaW schemes trade robustness for diversity.
Be skeptical about: dependence on inversion quality and calibration via tail extrapolation; limited evaluation against adaptive spoofing/removal.

4) ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Treats preference collection as active selection with uncertainty (ENN) and proposes DRTS/DELTAUCB acquisition.
Reports sample-efficiency (body: match/outperform with ~1/3 data; abstract claims 1/6) across reward modeling and DPO/IPO/SimPO.
Why now: preference data is a major scaling bottleneck; active acquisition is a direct lever.
Be skeptical about: reliance on an LLM judge for annotations and very high compute (~200k GPU hours).

5) Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

Mechanistic replication shows naturalistic free-text prompts improve triage accuracy vs exam scaffold (+6.4pp, p=0.015).
Isolates forced A/B/C/D discretization as dominant failure mechanism (e.g., GPT‑5.2 asthma 16% vs 100%).
Why now: health AI regulation and public narratives hinge on evaluation results; this demonstrates protocol sensitivity.
Be skeptical about: small 17-scenario bank; adjudication via LLM judges; not testing the deployed ChatGPT Health product directly.

5) Practical next steps

For agent memory safety: prototype a governance layer with (a) read freshness decay + ACL filtering and (b) write-time contradiction checks against protected core facts; measure drift over long-horizon tasks with periodic reconciliation (SSGM-style).
For RAG/agent reliability: add a “structured retrieval” path (exact-match keys or manifest/provenance navigation) before vector similarity; track when each path is used and its error modes (PRECEPT/AOS-H patterns).
For evaluation pipelines: run A/B tests where only output format constraints change (forced-choice vs free text; regex vs semantic judge) to quantify format-induced failure (triage replication + MedMASLab lesson).
For compound-system security: expand red-teaming beyond prompts—inventory software CVEs, toolchain dependencies, and hardware assumptions; attempt gadget-chain exercises (Cascade framing) and log which layer breaks first.
For provenance: if deploying diffusion generation, test structural watermarking under your real post-processing pipeline (compression, resizing) and validate calibration at your target FPR; separately, for federated fine-tuning, evaluate black-box traceability via trigger queries (EmbTracker).
For alignment data efficiency: replace static pair sampling with uncertainty-driven acquisition; compare downstream RM and DPO performance at fixed label budgets (ActiveUltraFeedback).
For robust decision-making: in offline RL or agent planning under model uncertainty, test worst-case ensemble selection vs average-model planning and monitor whether Q-values drop in high-uncertainty regions (RRPI diagnostic).

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-03-15

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Governed long-term memory for agents (drift/poisoning/auditability)

Theme: Structured / multi-agent RAG for controllable synthesis (and hallucination reduction)

Theme: Evaluation infrastructure & “format realism” (judges, synthetic validation, orchestration)

Theme: Cross-stack security, privacy auditing, and provenance/watermarking

Theme: Robust optimization & sample-efficient alignment signals (uncertainty, intrinsic quality)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps