Daily AI Paper Report (2026-03-15)
Published:
Chinese version: [中文]
Run stats
- Candidates: 437
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-13T00:00:00Z → 2026-03-14T00:00:00Z (weekend_backlog_sat, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.12023 | Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems | cs.CR, cs.AI | 93 | Shows how classic CVEs + hardware side-channels can amplify attacks in compound LLM toolchains | agent-security, compound-systems, CVE, side-channels, threat-modeling, tool-use |
2603.12094 | Human-Centred LLM Privacy Audits: Findings and Frictions | cs.HC, cs.AI, cs.CL, cs.CY | 93 | Human-centered privacy auditing tool + large user studies; measures name-conditioned personal inference risks. | privacy, LLM auditing, PII, user study, measurement, deployment |
2603.11768 | Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework | cs.AI | 90 | Targets long-term agent memory risks (drift, corruption, privacy) with a governance framework (SSGM) | agents, memory, governance, privacy, safety, robustness |
2603.09641 | PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution | cs.AI, cs.IR | 90 | Agent memory safety: exact rule retrieval + conflict-aware reliability + invalidation; tackles stale/adversarial knowledge. | agents, memory, robustness, test-time adaptation, prompt evolution, knowledge integrity |
2603.09692 | ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning | cs.LG, cs.AI, cs.CL | 90 | Active learning cuts RLHF preference-label cost; new pair-selection methods with strong empirical gains. | RLHF, preference-data, active-learning, uncertainty, alignment, data-efficiency |
2603.09803 | Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning | cs.LG | 89 | RLVR variant that upweights high-quality reasoning traces via in-context utility signal (Evidence Gain). | reasoning, RLVR, post-training, trace-quality, alignment, reward-weighting |
2603.09821 | One-Eval: An Agentic System for Automated and Traceable LLM Evaluation | cs.CL | 88 | Agentic, traceable evaluation workflows from NL requests; improves reproducibility and auditability | evaluation, agentic-eval, reproducibility, benchmarking, tooling, auditing |
2603.12056 | XSkill: Continual Learning from Experience and Skills in Multimodal Agents | cs.AI, cs.CL | 88 | Continual improvement for multimodal agents via experience+skill memory without finetuning. | agents, multimodal, continual-learning, tool-use, memory, retrieval |
2603.09403 | LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation | cs.CL | 86 | Scalable metric validation via LLM-generated synthetic degradations; high meta-correlation to human rankings. | evaluation, metrics, synthetic data, LLMs, multilingual, benchmarking |
2603.09044 | Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection | cs.CR, cs.SE | 86 | Targets LLM-enabled zero-day malware; hybrid concolic+LLM analysis with formal guarantees. | security, malware, LLM-misuse, program-analysis, concolic-execution, robust-detection |
2603.11687 | SemBench: A Universal Semantic Framework for LLM Evaluation | cs.CL, cs.AI | 86 | Auto-generates semantic eval benchmarks from dictionaries; scalable, multilingual LLM understanding tests | LLM evaluation, semantics, benchmark generation, multilingual, WiC |
2603.11413 | Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI | cs.HC, cs.AI | 86 | Safety-relevant: shows triage failures depend heavily on eval format; naturalistic protocols change results. | evaluation, healthcare, safety, triage, protocol-design, human-factors |
2603.09154 | Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety | cs.CL | 84 | Safety-relevant bias eval + tuning: measures LLM disposition toward bio vs synthetic solutions | alignment, bias, eval, biosecurity, preference-tuning, safety-metrics |
2603.11864 | Social, Legal, Ethical, Empathetic and Cultural Norm Operationalisation for AI Agents | cs.AI, cs.SE | 84 | Process to operationalize social/legal/ethical norms into verifiable agent requirements; surveys tools and gaps. | AI governance, agent norms, requirements, verification, ethics, safety engineering |
2603.11915 | CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks? | cs.CL | 84 | New multimodal, multi-turn Theory-of-Mind benchmark; useful for social reasoning evals. | evaluation, benchmark, theory-of-mind, multimodal, multi-turn, llm-evals |
2603.11665 | Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge | cs.CL | 84 | Multi-task RL to improve MLLM-as-judge consistency and human correlation; relevant to eval reliability | LLM-as-judge, evaluation, reinforcement learning, multimodal, preference modeling |
2603.09835 | Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents | cs.CL | 84 | Improves long-context agent pipelines by optimizing chunk order to reduce bounded-memory information loss. | long-context, agents, chain-of-agents, memory-bottleneck, ordering, reasoning |
2603.11799 | Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA | cs.LG, cs.CR | 83 | Unifies LiRA/RMIA/BASE MIAs; practical for privacy auditing of ML/LLMs via a single framework | privacy, membership-inference, auditing, security, theory, evaluation |
2603.09052 | From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring | cs.AI, cs.CL, cs.LG | 83 | Autonomous clinical triage agent w/ tools and clinician validation; real-world agent reliability. | agents, tool-use, evaluation, healthcare, reliability, MCP |
2603.12142 | Understanding Disclosure Risk in Differential Privacy with Applications to Noise Calibration and Auditing (Extended Version) | cs.CR, cs.IT | 82 | New DP disclosure-risk metric spanning membership/attribute/reconstruction; helps calibration & audits | differential-privacy, auditing, noise-calibration, inference-attacks, privacy-metrics |
2603.09192 | Explainable Innovation Engine: Dual-Tree Agent-RAG with Methods-as-Nodes and Verifiable Write-Back | cs.AI | 82 | Agent-RAG with auditable method provenance + verifier write-back; improves controllability and traceability. | RAG, agents, provenance, auditing, verification, knowledge base |
2603.11838 | DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining | cs.CL, q-fin.GN | 82 | Time-aware pretraining to prevent lookahead bias; strong methodology for leakage control. | data-contamination, temporal-generalization, evaluation, pretraining, finance |
2603.09344 | Robust Regularized Policy Iteration under Transition Uncertainty | cs.AI, stat.ML | 82 | Robust offline RL via worst-case transition uncertainty; targets distribution shift failures. | offline-rl, robust-rl, distribution-shift, uncertainty, policy-iteration, safety |
2603.11545 | One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries | cs.CL, cs.AI, cs.LG | 82 | Agentic tool orchestration across modalities with learned routing; strong efficiency metrics on 2,847 queries. | agents, tool-use, orchestration, routing, multimodal, systems |
2603.09454 | ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models | cs.CR | 80 | Diffusion watermarking that preserves diversity while improving robustness; provenance angle. | watermarking, diffusion-models, provenance, robustness, content-authenticity, security |
2603.12089 | EmbTracker: Traceable Black-box Watermarking for Federated Language Models | cs.CR | 79 | Client-traceable black-box watermarking for federated LMs; addresses model leakage accountability | watermarking, federated-learning, model-leakage, accountability, backdoors, security |
2603.09909 | MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems | cs.AI | 79 | Unified benchmark/orchestration for multimodal medical multi-agent systems; standardizes eval. | multi-agent, benchmark, multimodal, evaluation, healthcare, orchestration |
2603.11721 | When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows | cs.AI | 78 | Proposes restricted-execution, doc-centric agent OS for hospitals; focuses on reliability/security needs | agents, healthcare, sandboxing, permissions, deployment, reliability |
2603.09152 | DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering | cs.AI, cs.DB, cs.IR | 78 | Multi-agent TableQA targeting context limits and hallucinations via coordinated DB/KG teams (claims; need details). | multi-agent, TableQA, hallucinations, tool use, structured data, ReAct |
2603.11689 | Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks | cs.AI | 78 | Adds explicit logic/probabilistic reasoning channel to validate/enhance black-box MLLMs. | multimodal-llm, reasoning, verification, probabilistic-inference, zero-shot, reliability |
AI Paper Insight Brief
2026-03-15
0) Executive takeaways (read this first)
- “Governance + structure” is emerging as the antidote to brittle agent memory and RAG: multiple papers replace flat vector retrieval with structured, auditable memory (manifests, provenance trees, exact-match keys) and add explicit gates/verification to prevent drift, poisoning, and compounding errors.
- Evaluation is increasingly treated as a first-class system problem (not a metric): agentic evaluation planners, semantic judges, and synthetic validation protocols aim to make evaluations traceable, format-robust, and cheaper—and one health triage replication shows format alone can manufacture “failures.”
- Security threat models are widening from “LLM jailbreaks” to cross-stack reality: work on compound AI pipelines shows composed software/hardware gadgets (e.g., bit flips) can bypass guardrails; in parallel, malware defense is moving toward LLM-guided concolic exploration with formal guarantees.
- Robustness is being operationalized via uncertainty and worst-case optimization: offline RL robustifies against transition uncertainty with a tractable KL-regularized robust Bellman operator; preference-data generation uses uncertainty to actively select informative comparisons.
- Provenance and IP protection for generative models is maturing: diffusion watermarking shifts from fragile value-encoding to structural permutation encoding with diversity preservation; federated LMs get client-traceable black-box watermarking via embedding-space triggers.
2) Key themes (clusters)
Theme: Governed long-term memory for agents (drift/poisoning/auditability)
- Why it matters: As agents self-modify memory, failures persist and compound (drift, poisoning, privacy leakage). Governance layers and auditable substrates aim to make long-horizon behavior safer and debuggable.
- Representative papers:
- Common approach:
- Insert middleware/gates between cognition and memory (read filtering, write validation, reconciliation).
- Prefer structured memory (document trees/manifests, exact-match condition keys) over pure semantic similarity.
- Add provenance/immutability (append-only ledgers, audit trails) to enable rollback and accountability.
- Open questions / failure modes:
- Latency and scalability costs of governance checks; stability–plasticity trade-offs (over-blocking legitimate updates).
- Concurrency/conflict resolution for multi-writer memory (explicitly noted for hospital AOS-style designs).
- How to empirically validate drift/poisoning resistance on long-horizon benchmarks (SSGM is conceptual).
Theme: Structured / multi-agent RAG for controllable synthesis (and hallucination reduction)
- Why it matters: Flat chunk RAG struggles with multi-step synthesis and provenance; multi-agent decomposition and method-level indexing aim to reduce hallucinations and improve traceability.
- Representative papers:
- Common approach:
- Replace “retrieve chunks” with structured units (methods-as-nodes; SQL + KG modalities).
- Use orchestration policies (ReAct leaders; dependency-aware ordering) to manage bounded memory and tool calls.
- Add verification/provenance checks (conflict resolution, backtracking, rubric scoring).
- Open questions / failure modes:
- Cost/token overhead and interaction loops (explicitly noted for multi-agent TableQA).
- Sensitivity to representation choices (e.g., ordering gains depend on dense embeddings; BM25 weaker).
- Continual write-back risks: error amplification without falsifiers (Agent-RAG notes lack of falsification).
Theme: Evaluation infrastructure & “format realism” (judges, synthetic validation, orchestration)
- Why it matters: Deployment decisions need evaluations that are reproducible, format-robust, and aligned with real usage; brittle scaffolds can mis-measure safety.
- Representative papers:
- Common approach:
- Convert NL evaluation intent into traceable executable plans (benchmark resolution, schema normalization).
- Use semantic judging (VLM-based semantic judge; LLM adjudicators) to avoid regex/exact-match brittleness.
- Use synthetic controlled degradations to validate metric rankings via meta-correlation.
- Open questions / failure modes:
- Judge bias and single-judge dependence (MedMASLab; Meta-Judge variability by task/language/LLM).
- Evaluation scaffolds can be behaviorally active (triage replication shows forced A/B/C/D can dominate outcomes).
- Automation gaps remain (One-Eval full-plan success 84%; long-tail benchmark coverage).
Theme: Cross-stack security, privacy auditing, and provenance/watermarking
- Why it matters: Real systems fail at the seams—between models, tools, and hardware. Meanwhile, provenance and privacy auditing are becoming operational necessities.
- Representative papers:
- Cascade: Composing Software-Hardware Attack Gadgets…
- Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection
- ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models
- EmbTracker: Traceable Black-box Watermarking for Federated Language Models
- Common approach:
- Treat security as system composition (gadget chains; concolic exploration guided by LLM priors).
- Move from fragile signals to structural encodings (permutation-based diffusion watermarking).
- Provide black-box verifiability and traceability (FedLM client attribution via trigger embeddings).
- Open questions / failure modes:
- Practicality of strong attacker assumptions (e.g., Rowhammer/co-location; precise bitflip control).
- Dependence on inversion quality (diffusion watermarking) and on trusted server/non-colluding clients (FedLM watermarking).
- Formal guarantees contingent on learned components (malware detection completeness depends on LLM ranking; soundness assumes classifier correctness).
Theme: Robust optimization & sample-efficient alignment signals (uncertainty, intrinsic quality)
- Why it matters: Robustness and alignment are shifting toward principled objectives (worst-case dynamics, uncertainty-driven data collection, intrinsic reasoning-quality signals) that reduce brittleness without massive human labeling.
- Representative papers:
- Common approach:
- Use uncertainty estimates (transition ensembles; epistemic reward uncertainty) to drive conservative planning or active sampling.
- Replace expensive step labels with implicit signals (Evidence Gain via in-context learning utility).
- Provide theoretical operators/identities to justify training objectives (robust Bellman contraction; Bayesian identity for reward reweighting).
- Open questions / failure modes:
- Compute intensity (ActiveUltraFeedback reports ~200k GPU hours; robust RL needs ensembles).
- Generality beyond tested domains (Evidence Gain shown on math; robust RL uncertainty set approximation).
- Reliance on LLM judges for “annotations” in preference pipelines.
3) Technical synthesis
- Structured memory is converging on “append-only provenance + mutable working set”: AOS-H uses append-only document mutation trails; SSGM proposes an immutable episodic ledger plus mutable active graph with periodic reconciliation.
- “Gating” is the common safety primitive across domains: memory write validation (SSGM), forbidden-set pruning (PRECEPT), verifiable write-back thresholds (Agent-RAG), and black-box verification thresholds (EmbTracker VR>γ; ShapeMark calibrated FPR).
- Bounded-context reasoning is being attacked at the ordering/orchestration layer: CL–ORDER optimizes chunk order under memory bottlenecks; DataFactory and Supervisor-style systems route tasks to specialized modules/tools to avoid monolithic context overload.
- Evaluation is moving from scalar scores to traceable pipelines: One-Eval’s BenchInfo artifacts and MedMASLab’s unified I/O + ledgers mirror the auditability goals in memory governance papers.
- Semantic judges are replacing brittle exact-match metrics: MedMASLab’s VLM-SJ and the triage replication’s adjudication pipeline highlight that “format” can dominate measured performance.
- Robustness via worst-case selection appears in both RL and security: RRPI selects worst-case ensemble dynamics during backups; Cascade composes worst-case gadget chains; CogniCrypt prioritizes worst-case (most malicious) paths via LLM scoring.
- Bayesian thinking is reappearing as a stabilizer: PRECEPT uses Beta priors + Thompson sampling for source reliability; BaVarIA uses NIG shrinkage for variance in membership inference; RAD formalizes advantage with auxiliary knowledge in DP.
- Provenance/traceability is being engineered into generative systems: ShapeMark preserves diversity while enabling extremely low-FPR detection; EmbTracker adds client-level attribution in federated settings.
- Time as a first-class axis of robustness: DatedGPT trains year-cutoff model families and uses perplexity reversal to probe temporal leakage; SSGM uses freshness decay in read filtering.
- Clinical agent work is bifurcating into “agent performance” vs “agent infrastructure”: Sentinel shows retrospective triage performance with tool retrieval; AOS-H focuses on OS-level constraints and auditable workflows (no empirical results reported).
4) Top 5 papers (with “why now”)
- Shows cross-layer attack chains (algorithmic + CVEs + hardware primitives) that bypass guardrails in compound pipelines.
- Reports guardrail evasion rates under bitflip strategies (e.g., 82%/72%/94% in Table 1) and 80% jailbreak success with long runtimes.
- Useful now because real deployments are increasingly compound systems, and model-only red-teaming misses critical paths.
- Be skeptical about: feasibility assumptions (co-location / fault injection control; transfer to proprietary stacks not fully demonstrated).
2) Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection
- Combines LLM-guided path prioritization + concolic execution + transformer classifier + RL refinement for AI-generated malware.
- Reports 97.5% accuracy on a 2,500-sample AI-Gen-Malware dataset and 73.2% fewer paths to reach 95% coverage vs DFS.
- Why now: LLM-enabled malware increases polymorphism and trigger-based behavior; this targets zero-day + evasive patterns.
- Be skeptical about: guarantees depend on classifier correctness and LLM ranking (relative completeness requires malicious path in top‑B); heavy hardware requirements.
3) ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models
- Introduces structural permutation encoding (SE) and payload-debiasing randomization (PDSR) to preserve diversity.
- Reports TPR ~1.000 clean / 0.999 attacked at FPR 1e‑6 and strong per-bit recovery (0.987 attacked), plus high LPIPS diversity.
- Why now: provenance demands are rising; prior NaW schemes trade robustness for diversity.
- Be skeptical about: dependence on inversion quality and calibration via tail extrapolation; limited evaluation against adaptive spoofing/removal.
4) ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
- Treats preference collection as active selection with uncertainty (ENN) and proposes DRTS/DELTAUCB acquisition.
- Reports sample-efficiency (body: match/outperform with ~1/3 data; abstract claims 1/6) across reward modeling and DPO/IPO/SimPO.
- Why now: preference data is a major scaling bottleneck; active acquisition is a direct lever.
- Be skeptical about: reliance on an LLM judge for annotations and very high compute (~200k GPU hours).
- Mechanistic replication shows naturalistic free-text prompts improve triage accuracy vs exam scaffold (+6.4pp, p=0.015).
- Isolates forced A/B/C/D discretization as dominant failure mechanism (e.g., GPT‑5.2 asthma 16% vs 100%).
- Why now: health AI regulation and public narratives hinge on evaluation results; this demonstrates protocol sensitivity.
- Be skeptical about: small 17-scenario bank; adjudication via LLM judges; not testing the deployed ChatGPT Health product directly.
5) Practical next steps
- For agent memory safety: prototype a governance layer with (a) read freshness decay + ACL filtering and (b) write-time contradiction checks against protected core facts; measure drift over long-horizon tasks with periodic reconciliation (SSGM-style).
- For RAG/agent reliability: add a “structured retrieval” path (exact-match keys or manifest/provenance navigation) before vector similarity; track when each path is used and its error modes (PRECEPT/AOS-H patterns).
- For evaluation pipelines: run A/B tests where only output format constraints change (forced-choice vs free text; regex vs semantic judge) to quantify format-induced failure (triage replication + MedMASLab lesson).
- For compound-system security: expand red-teaming beyond prompts—inventory software CVEs, toolchain dependencies, and hardware assumptions; attempt gadget-chain exercises (Cascade framing) and log which layer breaks first.
- For provenance: if deploying diffusion generation, test structural watermarking under your real post-processing pipeline (compression, resizing) and validate calibration at your target FPR; separately, for federated fine-tuning, evaluate black-box traceability via trigger queries (EmbTracker).
- For alignment data efficiency: replace static pair sampling with uncertainty-driven acquisition; compare downstream RM and DPO performance at fixed label budgets (ActiveUltraFeedback).
- For robust decision-making: in offline RL or agent planning under model uncertainty, test worst-case ensemble selection vs average-model planning and monitor whether Q-values drop in high-uncertainty regions (RRPI diagnostic).
Generated from per-paper analyses; no external browsing.
