Daily AI Paper Report (2026-03-22)
Published:
Chinese version: [中文]
Run stats
- Candidates: 1253
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-20T00:00:00Z → 2026-03-21T00:00:00Z (weekend_backlog_unknown, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.14987 | Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI | cs.CL, cs.DB | 93 | Argues for representative trustworthiness eval for agentic AI; proposes HAA framework. | agent-evaluation, trustworthiness, sociotechnical, benchmarks, agents |
2603.19011 | Security awareness in LLM agents: the NDAI zone case | cs.CR, cs.AI | 92 | Measures whether LLM agents can infer secure vs insecure execution; key for TEE/tool-use safety. | agent-security, TEE, situational-awareness, evaluation, tool-use |
2603.18577 | MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning | cs.AI | 92 | Large benchmark + grounded reasoning for medical deepfake detection; strong safety relevance | deepfake-detection, multimodal, benchmark, grounded-reasoning, medical-safety, localization |
2603.15542 | InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems | cs.CY, cs.AI | 92 | InterveneBench: 744 real studies to test LLM causal intervention & design reasoning; strong eval gap. | benchmark, evaluation, causal-reasoning, interventions, social-science, LLM |
2603.14761 | BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models | cs.AI | 92 | New commonsense benchmark; shows big gaps on brainteasers even for frontier LLMs. | evaluation, commonsense, reasoning, benchmark, robustness |
2603.17623 | ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery | cs.LG, cs.CR | 92 | Practical gradient inversion attack (no arch mods) reconstructs data from large FL batches. | federated-learning, privacy, gradient-inversion, security, data-leakage, attack |
2603.14730 | GNNVerifier: Graph-based Verifier for LLM Task Planning | cs.LG | 91 | Non-LLM graph verifier for LLM plans; targets structural hallucinations & dependency errors in agents | agents, planning, verification, hallucinations, graph-methods, robustness |
2603.15397 | SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration | cs.CR, cs.AI | 90 | Monitors/calibrates unsafe intermediate CoT steps to resist jailbreaks, not just final output. | jailbreaks, chain-of-thought, safety-monitoring, calibration, defense |
2603.15615 | Mechanistic Origin of Moral Indifference in Language Models | cs.CL, cs.AI | 90 | Mechanistic study of moral concept collapse + latent “moral indifference”; proposes representation fix. | mechanistic-interpretability, alignment, representations, moral-reasoning, safety |
2603.18895 | From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making | cs.HC, cs.AI, cs.LG | 90 | Practical readiness metrics for human-AI teaming; targets miscalibrated reliance & safety signals. | human-AI teaming, evaluation, calibration, reliance, safety-metrics, deployment |
2603.17948 | VideoAtlas: Navigating Long-Form Video in Logarithmic Compute | cs.CV, cs.AI | 90 | Hierarchical lossless video representation enabling long-video navigation with log compute. | long-context, video, agents, memory, efficient-inference, multimodal |
2603.18767 | A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models | cs.AI | 89 | Improves diffusion concept unlearning beyond keywords; reduces brittle/over-forgetting in safety edits | diffusion, unlearning, content-safety, model-editing, robustness |
2603.15364 | CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving | cs.AI, cs.CL | 89 | LLM agent for AV incident analysis + curated 2,168-case dataset; practical safety auditing | agent, autonomous-driving, safety, incident-analysis, dataset, LLM |
2603.15372 | SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations | cs.SE, cs.AI, cs.CR | 88 | Tool-using LLM agent benchmark with live mock APIs + deterministic rubrics for telecom ops. | agents, tool-use, benchmark, evaluation, enterprise, APIs |
2603.14778 | $p^2$RAG: Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval | cs.CR, cs.AI | 88 | Privacy-preserving RAG enabling arbitrary top-k without costly secure sorting; practical for LLM apps. | RAG, privacy, secure-retrieval, cryptography, deployment |
2603.17759 | Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor | cs.CL, cs.AI | 88 | Multimodal+multilingual benchmark for harmful humor incl. covert harm; strong safety eval value | AI safety, benchmark, harmful content, multimodal, multilingual, humor, toxicity detection, Arabic |
2603.17683 | Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents | cs.AI, cs.LG | 88 | Structured test-time learning for LLM game agents; curriculum + steerable context control-plane. | llm-agents, test-time-learning, curriculum-learning, agent-architecture, memory, evaluation |
2603.18680 | Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend | cs.LG, cs.CR | 88 | Reframes label inference in VFL via mutual info; explains vulnerabilities and proposes defenses. | vertical-federated-learning, privacy, label-inference, mutual-information, defense |
2603.18793 | Functional Subspace Watermarking for Large Language Models | cs.CR, cs.AI | 86 | LLM watermarking robust to fine-tune/quantize/distill by anchoring signals in functional subspace. | watermarking, model-ownership, robustness, LLMs, security |
2603.14756 | Towards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark | cs.CL, cs.AI | 86 | Defines inference-time privacy task+benchmark for MT; fills evaluation gap for privacy-preserving NLP | privacy, machine-translation, benchmark, inference, evaluation |
2603.14771 | OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence | cs.AI | 86 | Interactive arena to evolve/benchmark multi-agent collective intelligence; strong eval framing. | agents, multi-agent, collective-intelligence, benchmark, evaluation, healthcare |
2603.14911 | Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs | cs.CR, cs.CL | 86 | CVE→CWE classifier competitive with LLMs; large dataset + strong macro-F1 on rare classes. | cybersecurity, vulnerability-classification, CVE, CWE, robustness, dataset |
2603.14855 | PCodeTrans: Translate Decompiled Pseudocode to Compilable and Executable Equivalent | cs.SE, cs.AI | 86 | Feedback + dynamic validation to prevent semantic hallucinations in decompiled code recovery. | code, verification, hallucinations, program-synthesis, security |
2603.15566 | Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents | cs.SE, cs.AI, eess.SY | 86 | Practical protocol to preserve agent coding rationale in git; improves auditability & safer agent workflows | coding-agents, software-engineering, auditability, agent-workflows, knowledge-management, tooling |
2603.09253 | Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training | cs.LG | 86 | Training-only priors for efficient reasoning at fixed test-time compute; broadly reusable. | efficient-reasoning, test-time-compute, attention, training-tricks, transformers |
2603.18570 | Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks | cs.LG, cs.CR | 85 | Shows approximate unlearning can be weaponized into attacks; introduces unlearning corruption. | machine-unlearning, adversarial-attacks, privacy, GNNs, security |
2603.17522 | Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions | cs.CL, cs.AI | 84 | Broad benchmark of AI-text detectors across domains/LLMs with adversarial conditions; useful for eval. | evaluation, AI-generated-text, robustness, adversarial, benchmark |
2603.18538 | Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning | cs.LG, stat.ME | 84 | Active auditing metrics + topology-aware defenses for decentralized FL backdoors; practical security angle | federated-learning, backdoors, auditing, anomaly-detection, security, graph-topology |
2603.19182 | Box Maze: A Process-Control Architecture for Reliable LLM Reasoning | cs.AI, cs.CL | 84 | Process-control architecture to reduce hallucination/adversarial failures; safety-oriented framing | LLM-safety, hallucination, robustness, process-supervision, architecture, adversarial |
2603.15421 | CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents | cs.CL, cs.AI | 84 | Agent memory clustering to reduce irrelevant/corrupt context; practical for small-model agents | agents, memory, retrieval, small language models, RAG, context management, robustness |
AI Paper Insight Brief
2026-03-22
0) Executive takeaways (read this first)
- Verification is shifting from “ask another LLM” to structured, inspectable signals: graph-structured plan verification with node/edge risk (GNNVerifier) and stepwise CoT safety scoring + intervention (SFCoT) both show large robustness gains versus prompt-only baselines.
- Privacy/security work is becoming more “systems-realistic”: private RAG now targets arbitrary large top‑k efficiently (p²RAG), FL attacks remove “architecture modification” assumptions (ARES), and VFL defenses exploit where label information actually lives (move the cut layer).
- Benchmarks are getting more diagnostic (and more multi-dimensional): BrainBench separates accuracy vs consistency (stochasticity), harmful-humor adds multimodal + Arabic + implicit harm, and AI-text detection is stress-tested under length-matching + domain shift + adversarial rewriting.
- Agent reliability bottlenecks are increasingly about representation and memory organization: CLAG’s cluster-local memory evolution improves SLM robustness and latency; “moral indifference” work argues behavioral alignment can leave latent geometry misaligned and shows SAE-based steering improves adversarial safety metrics.
- Execution-grounded feedback loops beat static checks in code/security pipelines: PCodeTrans uses in-situ binary substitution + ASan + differential tracing to drive LLM repair to near-perfect function-level equivalence on coreutils/binutils.
2) Key themes (clusters)
Theme: Structured verification & process-level safety for agents
- Why it matters: Agent failures often come from cross-step structure (plans) or intermediate reasoning (CoT) that final-answer filters miss. Verifiers that expose where things go wrong enable targeted fixes and safer autonomy.
- Representative papers:
- Common approach:
- Convert unstructured agent artifacts into structured objects (plan graphs; stepwise CoT segments; scenario distributions).
- Produce localized diagnostics (node/edge risk; per-step safety scores) and gate edits/continuations on verifier signals.
- Use synthetic supervision / controlled perturbations when real fine-grained labels are missing (plan-graph perturbations; scenario suites).
- Open questions / failure modes:
- Synthetic perturbations may not match real planner errors (distribution gap in GNNVerifier).
- Runtime overhead and scalability of stepwise CoT evaluation + paraphrase variance checks (SFCoT doesn’t report latency).
- “Representative scenario sampling” remains under-validated at scale (HAAF demo is 24 scenarios, single model).
Theme: Privacy-preserving inference & leakage-aware ML systems
- Why it matters: As LLM services and federated learning move into sensitive domains, practical privacy hinges on inference-time leakage (RAG/MT) and realistic attacker capabilities (FL/VFL).
- Representative papers:
- $p^2$RAG: Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval
- Towards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark
- ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery
- Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend
- Common approach:
- Replace “one-size” cryptographic primitives with protocol redesign (p²RAG bisection thresholding vs secure sorting).
- Define task + metric to make privacy measurable (PPMT + PER).
- Use information-theoretic / mechanistic analysis to explain leakage (VFL MI-by-depth; FL activation recovery framed as sparse recovery).
- Open questions / failure modes:
- Deployment assumptions: trusted dealer + two non-colluding servers (p²RAG); malicious server can set weights/biases (ARES).
- NER quality dominates privacy exposure in implicit MT settings (PPMT).
- Defenses can shift leakage elsewhere (VFL cut-layer defense may increase feature leakage risk).
Theme: Memory, long-context navigation, and fixed-compute efficiency
- Why it matters: Many failures are “context management” failures—retrieving distractors, losing long-horizon dependencies, or paying linear cost for long inputs (text/video). New work targets structure and budgeting under fixed compute.
- Representative papers:
- Common approach:
- Add hierarchy: cluster-first retrieval (CLAG), recursive K×K grids for video (VideoAtlas), length-aware attention priors (RPA).
- Keep inference cost fixed or bounded via cached biases / depth budgets / two-stage retrieval.
- Use training-only controllers or external orchestration to preserve gains without inference overhead (Guardian; Master–Worker; cluster evolution).
- Open questions / failure modes:
- Generalization beyond the demonstrated settings (RPA+Guardian only on WikiText-2; VideoAtlas mainly MCQ QA).
- Router/prompt sensitivity and distribution shift for clustered memory (CLAG).
- “No measurable latency change” claims without microbenchmarks (RPA).
Theme: Benchmarks that expose reliability gaps (stochasticity, shift, implicit harm)
- Why it matters: Many “solved” metrics hide brittleness: stochastic reasoning, domain shift, implicit harms, and adversarial rewriting. New benchmarks isolate these axes.
- Representative papers:
- BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models
- Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions
- Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor
- InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems
- Common approach:
- Measure more than accuracy: consistency vs accuracy (BrainBench), multi-metric detector evaluation (AUROC/AUPRC/EER/Brier/FPR@95), implicit vs explicit recall (harmful humor), rubric-based causal design scoring (InterveneBench).
- Stress distribution shift (cross-domain detectors; Chinese translation; Arabic + multimodal).
- Use multi-run protocols to quantify variance and reliability (BrainBench 10 runs/question).
- Open questions / failure modes:
- Small dataset sizes in some diagnostics (BrainBench 100 questions).
- Benchmark dependence on automated judges and proprietary models (harmful humor video; BrainBench judge).
- Detector generalization remains poor across domain+generator simultaneously (AI-text detection).
Theme: Security & provenance for models and ML pipelines
- Why it matters: As models are fine-tuned, quantized, distilled, and redistributed, provenance and robustness need to survive realistic transformations; meanwhile new attack surfaces emerge (unlearning, decentralized FL).
- Representative papers:
- Common approach:
- Target invariance under transformations (FSW uses a Fisher-vs-compression GEVP to find stable subspaces).
- Model dynamics and triggers rather than static metrics (DFL diffusion + active probes; unlearning as the “trigger”).
- Combine theory + empirical stress tests across attacks/defenses (FSW robustness tables; DFL ASR/ACC; unlearning collapse).
- Open questions / failure modes:
- Watermark threat model limits: robustness assumes functional backbone preserved; payload capacity ~16 bits (FSW).
- Active probing cost and non-IID limits in decentralized FL (DFL defense notes extreme non-IID breaks discriminability).
- Unlearning attacks depend on surrogate fidelity and are shown on academic graphs (Cora/Citeseer/Pubmed/Flickr).
3) Technical synthesis
- “Structure-first” is a recurring pattern: plans→graphs (GNNVerifier), CoT→steps (SFCoT), memory→clusters (CLAG), video→recursive grids (VideoAtlas). The shared bet is that explicit structure enables better diagnostics, gating, and compute control.
- Synthetic supervision is becoming the default when fine-grained labels are missing: plan perturbations (REPLACE/DROP/COMPRESS), sandbox scenarios (HAAF), synthetic patients (OpenHospital), medical forgery generation (MedForge-90K).
- Verification loops increasingly require acceptance criteria: GNNVerifier accepts edits only if graph score improves; SFCoT rewrites/truncates based on per-step safety; PCodeTrans iterates until tests + ASan/BP-Diff pass.
- Compute budgeting is being formalized as a first-class knob: VideoAtlas depth bound d; RPA cached bias + training-only controller; CLAG two-stage retrieval reduces search space and latency.
- Information localization matters for privacy: VFL shows label information concentrates in deeper/top layers; defenses can be structural (cut-layer placement) rather than noise-only.
- Attack realism is increasing: ARES assumes attacker can set weights/biases (no architecture change) and uses sparse recovery; unlearning corruption uses legally-mandated deletion as the trigger; p²RAG targets arbitrary top‑k (practical long-context use).
- Reliability is being measured as variance, not just mean: BrainBench’s accuracy–consistency gap (10.3 pp average) highlights stochastic reasoning as a safety/reliability axis.
- “Judge models” are everywhere, but with different roles: grading (InterveneBench), disclosure scoring (NDAI-zone study), reasoning quality (MedForge), and BrainBench answer judging—raising a cross-cutting concern about judge bias and reproducibility.
- Execution-grounded evaluation is a strong differentiator: PCodeTrans uses the original binary + official test suites as an oracle; this is a template for reducing “semantic hallucination” in code transformations.
4) Top 5 papers (with “why now”)
1) GNNVerifier: Graph-based Verifier for LLM Task Planning
- Adds a graph-structured verifier that scores whole plans and localizes risky nodes/edges (tool/step mismatches, dependency issues).
- Uses synthetic perturbations to create node/edge supervision where real labels are missing, enabling diagnosis heads.
- Demonstrates verification-guided local edits (replace/insert) accepted only when the verifier score improves; reports consistent gains vs VeriPlan across datasets/planners.
- Skepticism: synthetic error distribution may not match real planner failures; no live tool-execution evaluation.
2) $p^2$RAG: Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval
- Replaces secure sorting with interactive bisection to support arbitrary/large k efficiently—aligned with long-context LLM trends.
- Uses standard MPC primitives (Shamir sharing, Beaver triples, DCFs) and reports 3–300× speedups vs PRAG for k=16–1024.
- Provides explicit leakage bounds (physical leakage O(log²N) + functional leakage k+ξ).
- Skepticism: assumes trusted dealer + two non-colluding semi-honest servers; PIR and offline stages not benchmarked.
3) SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration
- Moves safety from final-output filtering to stepwise CoT monitoring with lexical/semantic/policy scoring and gray-zone calibration.
- Reports a large jailbreak reduction: ASR 58.97% → 12.31%, while preserving ~91.2% average utility on MMLU/GSM8K/MBPP.
- Ablations attribute gains to the consistency verifier and rewrite intervention.
- Skepticism: runtime/latency overhead not reported; evaluated on a single model (Qwen3-8B).
4) PCodeTrans: Translate Decompiled Pseudocode to Compilable and Executable Equivalent
- Introduces in-situ substitutable execution: hot-swap repaired functions into the original binary to use real execution as an equivalence oracle.
- Uses ASan (substitute-only) + breakpoint-matched differential tracing to generate actionable runtime deltas for iterative LLM repair.
- Achieves 100% function-level compilation and ~99.6–99.9% behavioral equivalence on coreutils/binutils (unstripped).
- Skepticism: platform-specific (Linux ELF/x86_64); indirect-call signature recovery and standalone recompilation remain hard.
5) Mechanistic Origin of Moral Indifference in Language Models
- Diagnoses “moral indifference” as a latent-geometry problem (categorical/gradient/structural/dimensional) using a prototype-based moral vector ground truth.
- Uses SAEs + targeted feature fine-tuning + additive steering to improve adversarial safety outcomes on Flames (e.g., PSC1 908→953; win-rate peak 75.4%).
- Bridges mechanistic interpretability with alignment by showing a causal intervention on internal features.
- Skepticism: intervention demonstrated mainly on Qwen3-8B; only a tiny fraction of SAE features correlate with moral dimensions; steering is sensitive to α.
5) Practical next steps
- If you build tool-using agents: prototype a plan-graph verifier that outputs node/edge risk and use it to drive local edits with acceptance tests (score must improve), mirroring GNNVerifier.
- For jailbreak defense in CoT-enabled systems: measure ASR with and without stepwise CoT gating; log per-step safety scores and quantify utility retention on your core tasks (SFCoT-style).
- For private RAG: evaluate whether your product needs dynamic/large top‑k; if yes, benchmark threshold/bisection-style retrieval vs sorting-based secure top‑k under realistic RTT and PIR costs (p²RAG highlights what to measure).
- For federated/vertical FL deployments: run MI-by-layer diagnostics to see where label information concentrates, then test cut-layer advancement as a zero-overhead mitigation—while also measuring feature leakage risk (VFL paper’s trade-off).
- For long-context memory in small agents: try cluster-local memory evolution + two-stage retrieval and track both answer quality and latency; ablate localized evolution vs global retrieval (CLAG).
- For evaluation: add multi-run consistency (not just accuracy) to your internal reasoning benchmarks (BrainBench protocol), and include domain shift + adversarial rewriting if you rely on AI-text detectors.
- For provenance/IP: if you distribute models that may be quantized/distilled, test subspace watermark robustness under your actual transformation pipeline and keep payload modest (FSW suggests ~16-bit practical capacity).
Generated from per-paper analyses; no external browsing.
