Daily AI Paper Report (2026-03-20)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 211
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-18T00:00:00Z → 2026-03-19T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.17476UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
PDF
cs.CV, cs.AI, cs.CL94Comprehensive system-level multimodal safety benchmark across 7 I/O modes; strong reuse for eval/red-teaming.multimodal-safety, benchmark, evaluation, red-teaming, UMM, cross-modality
2603.17372Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
PDF
cs.CV, cs.AI94Analyzes VLM jailbreak mechanism via rep shift; proposes defense using jailbreak direction.VLM, jailbreaks, representation, robustness, safety, defense
2603.17368Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
PDF
cs.AI94Targets CoT-induced safety regressions by forcing safety decisions before reasoning.safety, reasoning-models, chain-of-thought, alignment, guardrails
2603.17239LAAF: Logic-layer Automated Attack Framework A Systematic Red-Teaming Methodology for LPCI Vulnerabilities in Agentic Large Language Model Systems
PDF
cs.CR93Automated red-teaming for agentic systems w/ memory+RAG; LPCI taxonomy + staged escalation looks impactful.agent-security, prompt-injection, red-teaming, memory-attacks, RAG-security, framework
2603.17373SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems
PDF
cs.CL93Benchmark for pedagogical safety in AI tutors with taxonomy of harms; fills eval gap beyond toxicity.AI safety, evaluation, education, tutoring, harm taxonomy, benchmarks, LLM reliability
2603.17292SEAL-Tag: Self-Tag Evidence Aggregation with Probabilistic Circuits for PII-Safe Retrieval-Augmented Generation
PDF
cs.CR92PII-safe RAG runtime: verify-then-route with evidence tables + probabilistic circuits to prevent exfiltration.privacy, PII, RAG, data-exfiltration, tool-use, verification, probabilistic-circuits
2603.17902Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs
PDF
cs.CR, cs.AI91DP framework for enterprise-data leakage in agents; token/message-level DP and tradeoff analysis.privacy, differential privacy, agents, data leakage, enterprise, security, LLMs
2603.17445When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution
PDF
cs.AI, cs.CL91Token-level attribution for multi-agent outputs without logs via keyed implicit execution traces.multi-agent, attribution, auditing, monitoring, watermarking, accountability
2603.17639VeriGrey: Greybox Agent Validation
PDF
cs.AI90Greybox testing for LLM agents using tool-invocation feedback; targets rare dangerous tool calls/injections.agent-evaluation, security-testing, greybox-fuzzing, tool-use, prompt-injection, robustness
2603.17815Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain
PDF
cs.CL90Automatic step-level labels for PRMs via info gain; cheaper process supervision for CoT.process-supervision, PRM, reasoning, reliability, information-theory
2603.17915IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia
PDF
cs.CL, cs.AI89Large multilingual safety benchmark for 12 Indic languages; shows major cross-language safety drift.safety-evaluation, multilingual, benchmark, toxicity, refusal, low-resource-languages
2603.17775CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution
PDF
cs.CL, cs.AI, cs.LG89Fixes label-free RL 'consensus trap' with generator-verifier co-evolution for better reasoning.LLM, reasoning, RL, self-training, verification, robustness
2603.17305Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
PDF
cs.AI, cs.CL, cs.LG88Latent-space RL + contrastive learning to separate safe/unsafe reasoning trajectories; aims at jailbreak robustness.alignment, jailbreak-defense, reasoning-models, representation-learning, RL, hidden-states
2603.17973TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis
PDF
cs.SE, cs.AI88Tool+benchmark to cut coding-agent regressions via code-test graphs; strong SWE-bench results.agents, software engineering, evaluation, robustness, regressions, GraphRAG, SWE-bench
2603.17781Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory
PDF
cs.AI88Shows prompt-memory failure modes; proposes hash-addressed Knowledge Objects for persistent facts.LLM memory, RAG, knowledge management, reliability, long-context, evaluation
2603.17839How do LLMs Compute Verbal Confidence
PDF
cs.CL, cs.AI, cs.LG88Mechanistic evidence on how LLMs form verbal confidence; useful for calibration/monitoring.uncertainty, calibration, interpretability, mechanistic, confidence
2603.17357WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
PDF
cs.CR, cs.AI87Web screenshot PII detection benchmark for computer-use agents; fine-grained taxonomy + scalable synthetic gen.privacy, PII-detection, computer-use-agents, benchmark, vision-language, UI-security
2603.17504Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
PDF
cs.CL87Targeted SFT dataset/benchmark to induce uncertainty admission and reduce hallucinations; many runs.hallucinations, calibration, SFT, datasets, benchmarks, LLM reliability, uncertainty
2603.17662FINER: MLLMs Hallucinate under Fine-grained Negative Queries
PDF
cs.CV, cs.AI86New fine-grained negative-query benchmarks for MLLM hallucinations; DPO tuning boosts robustness.MLLM, hallucination, benchmark, DPO, evaluation, robustness
2603.17233Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning
PDF
cs.AI86Verification + diversity reduces semantic failures in auto-formalization for sound reasoning.formal-verification, auto-formalization, reasoning, reliability, verification
2603.17419Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare
PDF
cs.CR, cs.AI85Zero-trust security architecture for production healthcare agents; practical controls for PHI/HIPAA contexts.agent-security, zero-trust, healthcare, PHI, deployment, access-control, governance
2603.17673Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards
PDF
cs.CR, cs.AI84Post-training local 4B agent for Linux privesc with verifiable rewards; relevant to security-agent capability/safety.cybersecurity, agents, post-training, verifiable-rewards, privilege-escalation, local-LLMs
2603.17829CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents
PDF
cs.SE, cs.AI, cs.CL84RL recipe trains code-search agents using only a Unix terminal; simplifies agent tooling assumptions.coding agents, reinforcement learning, code search, tool use, agents, efficiency
2603.18000AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse
PDF
cs.AI84Self-evolving agents that store reusable executable subagents; raises capability & safety stakes.agents, self-improvement, tool-use, code-generation, reusability
2603.17787Governed Memory: A Production Architecture for Multi-Agent Workflows
PDF
cs.AI, cs.CL, cs.MA82Shared memory + governance layer for multi-agent enterprise workflows; tackles context quality and oversight gaps.multi-agent, memory, governance, enterprise, RAG, observability, reliability
2603.17244Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures
PDF
cs.AI, cs.IR, cs.LO82Formal belief-revision semantics for versioned agent memory graphs; bridges AGM to graph ops.agent memory, belief revision, formal methods, knowledge representation, graphs, AGM
2603.17893scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns
PDF
cs.SE, cs.AI, cs.LG82LLM-generated lint patterns to catch scientific methodology bugs (leakage/CV/seeds) sustainably.code, LLM tools, reliability, static analysis, data leakage, evaluation
2603.17677Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models
PDF
cs.CL, cs.AI, cs.LG82Training-free adaptive guidance for RAG in diffusion LMs; mitigates noisy retrieval conflicts.RAG, grounding, diffusion-language-models, robustness, retrieval
2603.17863Procedural Generation of Algorithm Discovery Tasks in Machine Learning
PDF
cs.LG, cs.AI81Procedurally generated task suite for ML algorithm discovery; mitigates contamination/saturation.evaluation, benchmarks, procedural generation, AutoML, algorithm discovery, meta-learning
2603.17942Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
PDF
cs.CL81Training-free multi-token prediction speeds decoding without loss; potentially big inference win.inference, speculative-decoding, multi-token-prediction, efficiency, LLMs

AI Paper Insight Brief

2026-03-20

0) Executive takeaways (read this first)

  • Inference-time “generate many, verify hard, then vote” is winning for correctness-sensitive reasoning: Draft-and-Prune shows that solver-checked well-definedness (existence+uniqueness) can turn many executable-but-wrong formalizations into a high-accuracy AF pipeline.
  • Safety failures increasingly look like “representation/state shifts” rather than simple intent-misclassification: the VLM jailbreak work finds a separable jailbreak state and removes its component at inference with large ASR drops while keeping benign utility.
  • Agent security is converging on two complementary tracks: (a) infrastructure zero-trust controls (sandboxing, secret isolation, egress allowlists, audits) for regulated deployments, and (b) systematic agent red-teaming via greybox fuzzing using tool-sequence feedback.
  • Memory is becoming a governed, versioned substrate—not just retrieval: two architectures (graph-native belief revision; enterprise governed memory) emphasize provenance, revision semantics, consolidation safety, and policy routing as first-class primitives.
  • Post-training with verifiable rewards is making small local agents competitive in narrow but real security tasks: a 4B model reaches near-frontier success on Linux priv-esc with >100× lower per-success inference cost (at the evaluated operating point).
  • Benchmarks are shifting toward system-level multimodal risk: UniSAFE’s shared-target, multi-I/O evaluation highlights that multi-image composition and multi-turn editing are disproportionately risky vs text-output tasks.

2) Key themes (clusters)

Theme: Verified reasoning via pruning, process signals, and solvers

  • Why it matters: As models are used for high-stakes reasoning, the bottleneck is often not producing an answer but ensuring the produced reasoning/program is semantically faithful and doesn’t silently fail.
  • Representative papers:
  • Common approach:
    • Generate multiple candidates (plans/traces), then gate them with a verifier (solver checks; task validators).
    • Prefer cheap, scalable labeling/verification (existence/uniqueness checks; MCNIG’s O(N) step labeling).
    • Use aggregation/selection (majority vote over pruned formalizations; best-of-K via PRM scoring).
  • Open questions / failure modes:
    • Coverage failures: sampling may never include a faithful formalization (not fixed by pruning).
    • Validator mismatch: step labels/solver checks may not capture all semantic errors or downstream objectives.
    • Compute cost: multi-path sampling + verification can be expensive at inference.

Theme: Jailbreak mitigation by acting on internal states (pre-CoT and multimodal shifts)

  • Why it matters: Safety can degrade specifically when models enter “reasoning mode” (CoT) or when images are added; targeted interventions can reduce ASR without large utility loss.
  • Representative papers:
  • Common approach:
    • Identify a decision/state that predicts unsafe behavior (pre-CoT refusal signal; jailbreak direction in hidden space).
    • Apply lightweight training-time alignment (PreSafe) or training-free inference-time projection removal (JRS-Rem).
    • Evaluate across multiple attacks/benchmarks and check benign utility retention.
  • Open questions / failure modes:
    • Residual ASR on harder sets (e.g., WildJailbreak remains non-trivial for PreSafe).
    • Sensitivity to decoding randomness (PreSafe ASR rises with higher temperature/top-p/top-k).
    • Dependence on baseline alignment quality (JRS-Rem assumes a reasonably aligned LM backbone).

Theme: Agent security in practice: zero-trust deployment + greybox red-teaming

  • Why it matters: Tool-using agents expand the attack surface (secrets, egress, prompt injection, fleet drift). Practical defenses need both preventive controls and continuous discovery of failures.
  • Representative papers:
  • Common approach:
    • Treat tool use as the core security surface: isolate execution (gVisor), isolate secrets (credential proxy), restrict egress (allowlists), and test tool-sequence vulnerabilities (VeriGrey feedback).
    • Add auditing/provenance: fleet audit agents (Tony); keyed implicit tracing to recover attribution/topology from final text.
    • Measure success with operational metrics (found HIGH severity issues; ITSR improvements; token-level attribution accuracy).
  • Open questions / failure modes:
    • Prompt integrity remains brittle vs infra controls (explicitly noted in healthcare zero-trust stack).
    • Instrumentation requirements: VeriGrey needs tool-call logging; IET requires decode-time modulation and key management.
    • Adaptive attackers: how robust are tool-sequence fuzzing gains and representation/provenance signals under deliberate evasion?

Theme: Memory as a governed, versioned, belief-revising substrate

  • Why it matters: Persistent agents need auditable provenance, deterministic “current truth,” and safe consolidation; retrieval alone doesn’t solve governance, versioning, or belief change.
  • Representative papers:
  • Common approach:
    • Use versioned primitives (immutable revisions + mutable tags; dual stores; provenance metadata).
    • Add formal or operational guarantees (AGM/Hansson postulates; adversarial entity-isolation tests; governance routing).
    • Consolidate asynchronously with safety guards / bounded reflection loops.
  • Open questions / failure modes:
    • Formal scope limits (belief revision proofs over a weak propositional fragment; K7/K8 open).
    • Heuristic gates and LLM-as-judge bias in production governance pipelines.
    • Concurrency/conflict resolution for simultaneous multi-agent writes remains under-validated.

Theme: System-level evaluation & reliability tooling (multimodal safety, methodology linting, diffusion-RAG)

3) Technical synthesis

  • Multiple papers converge on a two-stage pattern: diversify candidates (sample plans; sample CoTs; mutate prompts; retrieve contexts) then apply a verifier/gate (solver existence/uniqueness; validators; tool-sequence novelty; judge ensembles).
  • “Verification” is broadening beyond correctness to well-definedness and governance: D&P prunes ambiguous/contradictory executable programs; governed memory enforces entity isolation and governance routing; healthcare stack enforces egress/secret isolation.
  • Representation-level interventions are becoming practical defenses: JRS-Rem subtracts a learned jailbreak direction; PreSafe aligns a pre-CoT latent decision signal—both aim to preserve utility while reducing ASR.
  • Tool invocation sequences are emerging as the agent analogue of coverage: VeriGrey uses them as greybox feedback; zero-trust stacks harden the tool surface; provenance work (IET) encodes agent identity/topology into the output stream.
  • Memory systems are converging on versioning + consolidation: Kumiho’s revision/tag graph with Dream State consolidation parallels Governed Memory’s dedup + reflection-bounded retrieval + schema lifecycle monitoring.
  • Benchmarks are shifting from single-turn text to system-level modalities and workflows: UniSAFE emphasizes multi-image composition and multi-turn editing; LoCoMo/LoCoMo-Plus appear as memory stress tests in Kumiho and Governed Memory.
  • Post-training trends: verifiable reward settings (priv-esc) show strong gains with small models; process supervision (MCNIG) reduces labeling cost for PRMs; both rely on validators rather than human labels.
  • Several methods explicitly trade compute for reliability: D&P (k paths + solver calls), MCNIG (K rollouts but fewer tokens than prior labelers), ARAM (per-token/per-step entropy/KL computations), VeriGrey (campaign executions).

4) Top 5 papers (with “why now”)

1) Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

  • Turns AF brittleness into a controllable pipeline by sampling plan diversity then solver-pruning to keep only existence+uniqueness solutions.
  • Large reported AR-LSAT gains (e.g., pruning boosts AccAF from 45.13% to 78.43% at k=20) suggest semantic gating is the main lever, not just syntax repair.
  • “Why now”: as solver-backed reasoning becomes more common, semantic faithfulness is the limiting factor; this is an inference-time, modular fix.
  • Skepticism: higher inference cost; remaining failures when no correct formalization is sampled.

2) UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

  • Provides a shared-target benchmark across 7 multimodal I/O task types with ASR/ARR/SAS metrics and validated ensemble judging (r=0.962 vs humans).
  • Finds composition (IC) and multi-turn editing (MT) are especially vulnerable; image-output tasks are more vulnerable than text-output tasks.
  • “Why now”: unified any-to-any models are shipping; safety evaluation needs to match real workflows (composition/editing), not single-step prompts.
  • Skepticism: model support differs across tasks; refusal mechanisms complicate apples-to-apples comparisons.

3) VeriGrey: Greybox Agent Validation

  • Replaces branch coverage with tool-sequence feedback and uses context-bridging mutations to craft more plausible injections.
  • Strong empirical deltas (e.g., +33 pp ITSR on AgentDojo with GPT-4.1; ablations show both feedback and context-bridging matter).
  • “Why now”: indirect prompt injection is a dominant real-world agent failure mode; teams need scalable pre-deploy validation.
  • Skepticism: requires instrumentation; scoped to single-session attacks (not multi-session memory poisoning).

4) Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

  • Shows jailbreak/refusal/benign states are separable; defines a jailbreak direction and removes its projection at inference (JRS-Rem).
  • Large ASR reductions (e.g., LLaVA-1.5-7B HADES 77.3%→12.2%) with negligible benign utility change on reported benchmarks.
  • “Why now”: multimodal jailbreaks are a deployment blocker; training-free, low-overhead defenses are attractive.
  • Skepticism: depends on backbone alignment; untested at much larger scales and against adaptive evasion.

5) Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

  • Demonstrates SFT + RLVR can make a local 4B agent highly reliable on a verifiable multi-step security task (95.8% at R=20).
  • Reports >100× lower expected inference cost per successful escalation vs a frontier API model at the chosen operating point.
  • “Why now”: organizations want local, reproducible agents for sensitive environments; verifiable tasks are a natural fit for RLVR.
  • Skepticism: RL training is compute-heavy (4×H100 for ~29h); domain is narrow and generator families may not cover real-world long tail.

5) Practical next steps

  • For solver-backed reasoning systems: implement existence/uniqueness pruning (or analogous well-definedness checks) and measure how much of your error is “executable but wrong,” as in D&P’s decomposition.
  • For CoT-enabled deployments: evaluate safety with CoT on vs off and test whether a pre-decision alignment approach (like PreSafe’s pre-CoT latent alignment) reduces ASR without harming reasoning on your key tasks.
  • For VLM products: add a representation-shift monitor (projection onto a jailbreak direction) and run a τ sweep to map the safety–utility frontier (as JRS-Rem does).
  • For agent security programs: adopt tool-sequence logging as a first-class telemetry signal; use it both for greybox fuzzing (VeriGrey-style) and for runtime anomaly detection.
  • For regulated agent deployments: prioritize infra controls (sandboxing, secret isolation, egress allowlists) and treat prompt-integrity layers as best-effort; add continuous auditing (like the “Tony” audit agent) with tight privilege scoping.
  • For multi-agent provenance: if you can instrument decoding, consider keyed implicit tracing to preserve attribution/topology even when logs are stripped; define key management and audit workflows early.
  • For memory/RAG stacks: move from “retrieve text” to versioned, governed memory with provenance, dedup, and bounded reflection; explicitly test cross-entity leakage and governance bypass scenarios.
  • For evaluation: add system-level multimodal tasks (composition, multi-turn editing) to your safety suite (UniSAFE-style) and track not just ASR but severity (ARR) and self-awareness (SAS).

Generated from per-paper analyses; no external browsing.