Daily AI Paper Report (2026-03-15)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 437
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-13T00:00:00Z → 2026-03-14T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.12023Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems
PDF
cs.CR, cs.AI93Shows how classic CVEs + hardware side-channels can amplify attacks in compound LLM toolchainsagent-security, compound-systems, CVE, side-channels, threat-modeling, tool-use
2603.12094Human-Centred LLM Privacy Audits: Findings and Frictions
PDF
cs.HC, cs.AI, cs.CL, cs.CY93Human-centered privacy auditing tool + large user studies; measures name-conditioned personal inference risks.privacy, LLM auditing, PII, user study, measurement, deployment
2603.11768Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework
PDF
cs.AI90Targets long-term agent memory risks (drift, corruption, privacy) with a governance framework (SSGM)agents, memory, governance, privacy, safety, robustness
2603.09641PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution
PDF
cs.AI, cs.IR90Agent memory safety: exact rule retrieval + conflict-aware reliability + invalidation; tackles stale/adversarial knowledge.agents, memory, robustness, test-time adaptation, prompt evolution, knowledge integrity
2603.09692ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
PDF
cs.LG, cs.AI, cs.CL90Active learning cuts RLHF preference-label cost; new pair-selection methods with strong empirical gains.RLHF, preference-data, active-learning, uncertainty, alignment, data-efficiency
2603.09803Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning
PDF
cs.LG89RLVR variant that upweights high-quality reasoning traces via in-context utility signal (Evidence Gain).reasoning, RLVR, post-training, trace-quality, alignment, reward-weighting
2603.09821One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
PDF
cs.CL88Agentic, traceable evaluation workflows from NL requests; improves reproducibility and auditabilityevaluation, agentic-eval, reproducibility, benchmarking, tooling, auditing
2603.12056XSkill: Continual Learning from Experience and Skills in Multimodal Agents
PDF
cs.AI, cs.CL88Continual improvement for multimodal agents via experience+skill memory without finetuning.agents, multimodal, continual-learning, tool-use, memory, retrieval
2603.09403LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
PDF
cs.CL86Scalable metric validation via LLM-generated synthetic degradations; high meta-correlation to human rankings.evaluation, metrics, synthetic data, LLMs, multilingual, benchmarking
2603.09044Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection
PDF
cs.CR, cs.SE86Targets LLM-enabled zero-day malware; hybrid concolic+LLM analysis with formal guarantees.security, malware, LLM-misuse, program-analysis, concolic-execution, robust-detection
2603.11687SemBench: A Universal Semantic Framework for LLM Evaluation
PDF
cs.CL, cs.AI86Auto-generates semantic eval benchmarks from dictionaries; scalable, multilingual LLM understanding testsLLM evaluation, semantics, benchmark generation, multilingual, WiC
2603.11413Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI
PDF
cs.HC, cs.AI86Safety-relevant: shows triage failures depend heavily on eval format; naturalistic protocols change results.evaluation, healthcare, safety, triage, protocol-design, human-factors
2603.09154Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety
PDF
cs.CL84Safety-relevant bias eval + tuning: measures LLM disposition toward bio vs synthetic solutionsalignment, bias, eval, biosecurity, preference-tuning, safety-metrics
2603.11864Social, Legal, Ethical, Empathetic and Cultural Norm Operationalisation for AI Agents
PDF
cs.AI, cs.SE84Process to operationalize social/legal/ethical norms into verifiable agent requirements; surveys tools and gaps.AI governance, agent norms, requirements, verification, ethics, safety engineering
2603.11915CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?
PDF
cs.CL84New multimodal, multi-turn Theory-of-Mind benchmark; useful for social reasoning evals.evaluation, benchmark, theory-of-mind, multimodal, multi-turn, llm-evals
2603.11665Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
PDF
cs.CL84Multi-task RL to improve MLLM-as-judge consistency and human correlation; relevant to eval reliabilityLLM-as-judge, evaluation, reinforcement learning, multimodal, preference modeling
2603.09835Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents
PDF
cs.CL84Improves long-context agent pipelines by optimizing chunk order to reduce bounded-memory information loss.long-context, agents, chain-of-agents, memory-bottleneck, ordering, reasoning
2603.11799Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA
PDF
cs.LG, cs.CR83Unifies LiRA/RMIA/BASE MIAs; practical for privacy auditing of ML/LLMs via a single frameworkprivacy, membership-inference, auditing, security, theory, evaluation
2603.09052From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
PDF
cs.AI, cs.CL, cs.LG83Autonomous clinical triage agent w/ tools and clinician validation; real-world agent reliability.agents, tool-use, evaluation, healthcare, reliability, MCP
2603.12142Understanding Disclosure Risk in Differential Privacy with Applications to Noise Calibration and Auditing (Extended Version)
PDF
cs.CR, cs.IT82New DP disclosure-risk metric spanning membership/attribute/reconstruction; helps calibration & auditsdifferential-privacy, auditing, noise-calibration, inference-attacks, privacy-metrics
2603.09192Explainable Innovation Engine: Dual-Tree Agent-RAG with Methods-as-Nodes and Verifiable Write-Back
PDF
cs.AI82Agent-RAG with auditable method provenance + verifier write-back; improves controllability and traceability.RAG, agents, provenance, auditing, verification, knowledge base
2603.11838DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining
PDF
cs.CL, q-fin.GN82Time-aware pretraining to prevent lookahead bias; strong methodology for leakage control.data-contamination, temporal-generalization, evaluation, pretraining, finance
2603.09344Robust Regularized Policy Iteration under Transition Uncertainty
PDF
cs.AI, stat.ML82Robust offline RL via worst-case transition uncertainty; targets distribution shift failures.offline-rl, robust-rl, distribution-shift, uncertainty, policy-iteration, safety
2603.11545One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries
PDF
cs.CL, cs.AI, cs.LG82Agentic tool orchestration across modalities with learned routing; strong efficiency metrics on 2,847 queries.agents, tool-use, orchestration, routing, multimodal, systems
2603.09454ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models
PDF
cs.CR80Diffusion watermarking that preserves diversity while improving robustness; provenance angle.watermarking, diffusion-models, provenance, robustness, content-authenticity, security
2603.12089EmbTracker: Traceable Black-box Watermarking for Federated Language Models
PDF
cs.CR79Client-traceable black-box watermarking for federated LMs; addresses model leakage accountabilitywatermarking, federated-learning, model-leakage, accountability, backdoors, security
2603.09909MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems
PDF
cs.AI79Unified benchmark/orchestration for multimodal medical multi-agent systems; standardizes eval.multi-agent, benchmark, multimodal, evaluation, healthcare, orchestration
2603.11721When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows
PDF
cs.AI78Proposes restricted-execution, doc-centric agent OS for hospitals; focuses on reliability/security needsagents, healthcare, sandboxing, permissions, deployment, reliability
2603.09152DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering
PDF
cs.AI, cs.DB, cs.IR78Multi-agent TableQA targeting context limits and hallucinations via coordinated DB/KG teams (claims; need details).multi-agent, TableQA, hallucinations, tool use, structured data, ReAct
2603.11689Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks
PDF
cs.AI78Adds explicit logic/probabilistic reasoning channel to validate/enhance black-box MLLMs.multimodal-llm, reasoning, verification, probabilistic-inference, zero-shot, reliability

AI Paper Insight Brief

2026-03-15

0) Executive takeaways (read this first)

  • “Governance + structure” is emerging as the antidote to brittle agent memory and RAG: multiple papers replace flat vector retrieval with structured, auditable memory (manifests, provenance trees, exact-match keys) and add explicit gates/verification to prevent drift, poisoning, and compounding errors.
  • Evaluation is increasingly treated as a first-class system problem (not a metric): agentic evaluation planners, semantic judges, and synthetic validation protocols aim to make evaluations traceable, format-robust, and cheaper—and one health triage replication shows format alone can manufacture “failures.”
  • Security threat models are widening from “LLM jailbreaks” to cross-stack reality: work on compound AI pipelines shows composed software/hardware gadgets (e.g., bit flips) can bypass guardrails; in parallel, malware defense is moving toward LLM-guided concolic exploration with formal guarantees.
  • Robustness is being operationalized via uncertainty and worst-case optimization: offline RL robustifies against transition uncertainty with a tractable KL-regularized robust Bellman operator; preference-data generation uses uncertainty to actively select informative comparisons.
  • Provenance and IP protection for generative models is maturing: diffusion watermarking shifts from fragile value-encoding to structural permutation encoding with diversity preservation; federated LMs get client-traceable black-box watermarking via embedding-space triggers.

2) Key themes (clusters)

Theme: Governed long-term memory for agents (drift/poisoning/auditability)

  • Why it matters: As agents self-modify memory, failures persist and compound (drift, poisoning, privacy leakage). Governance layers and auditable substrates aim to make long-horizon behavior safer and debuggable.
  • Representative papers:
  • Common approach:
    • Insert middleware/gates between cognition and memory (read filtering, write validation, reconciliation).
    • Prefer structured memory (document trees/manifests, exact-match condition keys) over pure semantic similarity.
    • Add provenance/immutability (append-only ledgers, audit trails) to enable rollback and accountability.
  • Open questions / failure modes:
    • Latency and scalability costs of governance checks; stability–plasticity trade-offs (over-blocking legitimate updates).
    • Concurrency/conflict resolution for multi-writer memory (explicitly noted for hospital AOS-style designs).
    • How to empirically validate drift/poisoning resistance on long-horizon benchmarks (SSGM is conceptual).

Theme: Structured / multi-agent RAG for controllable synthesis (and hallucination reduction)

  • Why it matters: Flat chunk RAG struggles with multi-step synthesis and provenance; multi-agent decomposition and method-level indexing aim to reduce hallucinations and improve traceability.
  • Representative papers:
  • Common approach:
    • Replace “retrieve chunks” with structured units (methods-as-nodes; SQL + KG modalities).
    • Use orchestration policies (ReAct leaders; dependency-aware ordering) to manage bounded memory and tool calls.
    • Add verification/provenance checks (conflict resolution, backtracking, rubric scoring).
  • Open questions / failure modes:
    • Cost/token overhead and interaction loops (explicitly noted for multi-agent TableQA).
    • Sensitivity to representation choices (e.g., ordering gains depend on dense embeddings; BM25 weaker).
    • Continual write-back risks: error amplification without falsifiers (Agent-RAG notes lack of falsification).

Theme: Evaluation infrastructure & “format realism” (judges, synthetic validation, orchestration)

Theme: Cross-stack security, privacy auditing, and provenance/watermarking

Theme: Robust optimization & sample-efficient alignment signals (uncertainty, intrinsic quality)

  • Why it matters: Robustness and alignment are shifting toward principled objectives (worst-case dynamics, uncertainty-driven data collection, intrinsic reasoning-quality signals) that reduce brittleness without massive human labeling.
  • Representative papers:
  • Common approach:
    • Use uncertainty estimates (transition ensembles; epistemic reward uncertainty) to drive conservative planning or active sampling.
    • Replace expensive step labels with implicit signals (Evidence Gain via in-context learning utility).
    • Provide theoretical operators/identities to justify training objectives (robust Bellman contraction; Bayesian identity for reward reweighting).
  • Open questions / failure modes:
    • Compute intensity (ActiveUltraFeedback reports ~200k GPU hours; robust RL needs ensembles).
    • Generality beyond tested domains (Evidence Gain shown on math; robust RL uncertainty set approximation).
    • Reliance on LLM judges for “annotations” in preference pipelines.

3) Technical synthesis

  • Structured memory is converging on “append-only provenance + mutable working set”: AOS-H uses append-only document mutation trails; SSGM proposes an immutable episodic ledger plus mutable active graph with periodic reconciliation.
  • “Gating” is the common safety primitive across domains: memory write validation (SSGM), forbidden-set pruning (PRECEPT), verifiable write-back thresholds (Agent-RAG), and black-box verification thresholds (EmbTracker VR>γ; ShapeMark calibrated FPR).
  • Bounded-context reasoning is being attacked at the ordering/orchestration layer: CL–ORDER optimizes chunk order under memory bottlenecks; DataFactory and Supervisor-style systems route tasks to specialized modules/tools to avoid monolithic context overload.
  • Evaluation is moving from scalar scores to traceable pipelines: One-Eval’s BenchInfo artifacts and MedMASLab’s unified I/O + ledgers mirror the auditability goals in memory governance papers.
  • Semantic judges are replacing brittle exact-match metrics: MedMASLab’s VLM-SJ and the triage replication’s adjudication pipeline highlight that “format” can dominate measured performance.
  • Robustness via worst-case selection appears in both RL and security: RRPI selects worst-case ensemble dynamics during backups; Cascade composes worst-case gadget chains; CogniCrypt prioritizes worst-case (most malicious) paths via LLM scoring.
  • Bayesian thinking is reappearing as a stabilizer: PRECEPT uses Beta priors + Thompson sampling for source reliability; BaVarIA uses NIG shrinkage for variance in membership inference; RAD formalizes advantage with auxiliary knowledge in DP.
  • Provenance/traceability is being engineered into generative systems: ShapeMark preserves diversity while enabling extremely low-FPR detection; EmbTracker adds client-level attribution in federated settings.
  • Time as a first-class axis of robustness: DatedGPT trains year-cutoff model families and uses perplexity reversal to probe temporal leakage; SSGM uses freshness decay in read filtering.
  • Clinical agent work is bifurcating into “agent performance” vs “agent infrastructure”: Sentinel shows retrospective triage performance with tool retrieval; AOS-H focuses on OS-level constraints and auditable workflows (no empirical results reported).

4) Top 5 papers (with “why now”)

1) Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

  • Shows cross-layer attack chains (algorithmic + CVEs + hardware primitives) that bypass guardrails in compound pipelines.
  • Reports guardrail evasion rates under bitflip strategies (e.g., 82%/72%/94% in Table 1) and 80% jailbreak success with long runtimes.
  • Useful now because real deployments are increasingly compound systems, and model-only red-teaming misses critical paths.
  • Be skeptical about: feasibility assumptions (co-location / fault injection control; transfer to proprietary stacks not fully demonstrated).

2) Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection

  • Combines LLM-guided path prioritization + concolic execution + transformer classifier + RL refinement for AI-generated malware.
  • Reports 97.5% accuracy on a 2,500-sample AI-Gen-Malware dataset and 73.2% fewer paths to reach 95% coverage vs DFS.
  • Why now: LLM-enabled malware increases polymorphism and trigger-based behavior; this targets zero-day + evasive patterns.
  • Be skeptical about: guarantees depend on classifier correctness and LLM ranking (relative completeness requires malicious path in top‑B); heavy hardware requirements.

3) ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

  • Introduces structural permutation encoding (SE) and payload-debiasing randomization (PDSR) to preserve diversity.
  • Reports TPR ~1.000 clean / 0.999 attacked at FPR 1e‑6 and strong per-bit recovery (0.987 attacked), plus high LPIPS diversity.
  • Why now: provenance demands are rising; prior NaW schemes trade robustness for diversity.
  • Be skeptical about: dependence on inversion quality and calibration via tail extrapolation; limited evaluation against adaptive spoofing/removal.

4) ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

  • Treats preference collection as active selection with uncertainty (ENN) and proposes DRTS/DELTAUCB acquisition.
  • Reports sample-efficiency (body: match/outperform with ~1/3 data; abstract claims 1/6) across reward modeling and DPO/IPO/SimPO.
  • Why now: preference data is a major scaling bottleneck; active acquisition is a direct lever.
  • Be skeptical about: reliance on an LLM judge for annotations and very high compute (~200k GPU hours).

5) Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

  • Mechanistic replication shows naturalistic free-text prompts improve triage accuracy vs exam scaffold (+6.4pp, p=0.015).
  • Isolates forced A/B/C/D discretization as dominant failure mechanism (e.g., GPT‑5.2 asthma 16% vs 100%).
  • Why now: health AI regulation and public narratives hinge on evaluation results; this demonstrates protocol sensitivity.
  • Be skeptical about: small 17-scenario bank; adjudication via LLM judges; not testing the deployed ChatGPT Health product directly.

5) Practical next steps

  • For agent memory safety: prototype a governance layer with (a) read freshness decay + ACL filtering and (b) write-time contradiction checks against protected core facts; measure drift over long-horizon tasks with periodic reconciliation (SSGM-style).
  • For RAG/agent reliability: add a “structured retrieval” path (exact-match keys or manifest/provenance navigation) before vector similarity; track when each path is used and its error modes (PRECEPT/AOS-H patterns).
  • For evaluation pipelines: run A/B tests where only output format constraints change (forced-choice vs free text; regex vs semantic judge) to quantify format-induced failure (triage replication + MedMASLab lesson).
  • For compound-system security: expand red-teaming beyond prompts—inventory software CVEs, toolchain dependencies, and hardware assumptions; attempt gadget-chain exercises (Cascade framing) and log which layer breaks first.
  • For provenance: if deploying diffusion generation, test structural watermarking under your real post-processing pipeline (compression, resizing) and validate calibration at your target FPR; separately, for federated fine-tuning, evaluate black-box traceability via trigger queries (EmbTracker).
  • For alignment data efficiency: replace static pair sampling with uncertainty-driven acquisition; compare downstream RM and DPO performance at fixed label budgets (ActiveUltraFeedback).
  • For robust decision-making: in offline RL or agent planning under model uncertainty, test worst-case ensemble selection vs average-model planning and monitor whether Q-values drop in high-uncertainty regions (RRPI diagnostic).

Generated from per-paper analyses; no external browsing.