Daily AI Paper Report (2026-04-02)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 235
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-03-31T00:00:00Z → 2026-04-01T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2603.29403Security in LLM-as-a-Judge: A Comprehensive SoK
PDF
cs.CR, cs.AI94First SoK on LLM-as-a-Judge security; maps attacks/risks for eval pipelines.LLM-as-a-judge, security, evaluation, adversarial, SoK, reliability
2603.29231Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
PDF
cs.AI94Reliability metrics for long-horizon agents; shows pass@1 fails as duration grows; large eval.agents, reliability, evaluation, long-horizon, benchmarks, deployment
2603.30016Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks
PDF
cs.CR, cs.AI92System-level design guidance for indirect prompt injection defenses in agents.agents, prompt-injection, system-design, security, tool-use, policies
2603.29993Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation
PDF
cs.AI92Reproduces+extends MONA reward-hacking mitigation; probes learned approval assumptions & tooling.alignment, reward-hacking, RL, MONA, reproducibility, safety
2603.29357BenchScope: How Many Independent Signals Does Your Benchmark Provide?
PDF
cs.AI92Quantifies benchmark redundancy via effective dimensionality; actionable for eval design/leaderboards.evaluation, benchmarks, measurement, leaderboards, metrics
2603.29665Near-Miss: Latent Policy Failure Detection in Agentic Workflows
PDF
cs.CL90Detects latent policy failures (near-misses) in agent workflows beyond end-state checks.agents, policy-compliance, evaluation, monitoring, ToolGuard, safety-metrics
2603.29418Adversarial Prompt Injection Attack on Multimodal Large Language Models
PDF
cs.CV, cs.AI90Imperceptible visual prompt injection against closed MLLMs; practical multimodal attack surface.security, prompt-injection, multimodal, adversarial, red-teaming, MLLM
2603.29500Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries
PDF
cs.AI, cs.LG90Process reward using structured formal intermediates to improve step reliability without losing accuracy.reasoning, process-reward, formal-methods, RL, reliability
2603.29846SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models
PDF
cs.CL88Benchmark for strategic communication & secret-keeping; targets info leakage in LLMs.information-leakage, multi-agent, benchmarks, security, strategic-communication, LLM-eval
2603.29429CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
PDF
cs.CL88Auditing toolkit for mental-health dialogues with evidence-linked, multi-metric risk reports.evaluation, auditing, safety, mental-health, rubrics, LLM
2603.29353Nomad: Autonomous Exploration and Discovery
PDF
cs.AI88Exploration-first agent architecture with hypothesis generation + independent verification; relevant to agent reliability.agents, autonomous-research, tool-use, verification, evaluation
2603.29373Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
PDF
cs.CL86Realistic medical safety eval: challenging patient behaviors + concrete unsafe failure criteria.medical, safety-evaluation, robustness, hallucinations, high-stakes, LLM
2603.29492Calibrated Confidence Expression for Radiology Report Generation
PDF
cs.CL86RL framework to calibrate verbalized confidence in radiology reports; targets hallucination risk.calibration, medical, vision-language, hallucinations, RL, reliability
2603.29632An Empirical Study of Multi-Agent Collaboration for Automated Research
PDF
cs.MA, cs.AI86Controlled empirical study of multi-agent coordination for automated research; useful evidence for MAS design/safety.multi-agent, coordination, automated-research, benchmarks, agent-evaluation
2603.29194Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention
PDF
cs.CV, cs.AI86Agent memory layering + retrieval gating reduces drift/false memories under bounded context budgets.agents, memory, long-context, retrieval, reliability
2603.29493MemFactory: Unified Inference & Training Framework for Agent Memory
PDF
cs.CL, cs.AI85Unified framework for training/inference of agent memory with modular components; reusable infra.agents, memory, framework, RL, tooling, long-term
2603.29902ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
PDF
cs.AI84ATP-Bench evaluates agentic tool planning for interleaved multimodal generation.agents, tool-planning, multimodal, benchmark, MLLM, evaluation
2603.29405Hallucination-aware intermediate representation edit in large vision-language models
PDF
cs.CV, cs.AI84Low-overhead hallucination mitigation for VLMs via intermediate-representation detection and edits; practical reliability gain.hallucinations, vision-language, reliability, representation-editing, multimodal
2603.29676A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
PDF
cs.LG, cs.CL, cs.CV84PID-based decomposition to measure redundant/unique/synergistic info in 26 LVLMs across tasks.interpretability, vision-language, information-decomposition, multimodal, analysis
2603.29497Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models
PDF
cs.CL83Distills LLM privacy sensitivity judgments into small models for scalable deployment.privacy, distillation, data-governance, classification, LLM-judge, efficiency
2603.29318PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
PDF
cs.AI83Personalization benchmark for smartphone GUI agents with 12.8k instructions across apps/scenarios.agents, benchmarks, GUI, smartphones, personalization, evaluation
2603.29139SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
PDF
cs.AI, cs.GR, cs.HC82Benchmark for scientific analysis/visualization agents with taxonomy + outcome-centric evaluation.agents, benchmarks, scientific-workflows, tool-use, evaluation, visualization
2603.29466An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms
PDF
cs.LG, cs.AI, cs.CL82Cheap uncertainty estimates from gradient norms (single backward pass) for large models; helps calibration/monitoring.uncertainty, calibration, gradient-norm, epistemic-uncertainty, monitoring
2603.29288Sima AIunty: Caste Audit in LLM-Driven Matchmaking
PDF
cs.CY, cs.AI, cs.CL, cs.HC, cs.SI82Controlled audit of caste bias in LLM matchmaking across model families and income strata.bias, fairness, auditing, sociotechnical, evaluation
2603.29759TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios
PDF
cs.CV, cs.AI81Large real-world VLM benchmark for trustworthy indoor safety hazard assessment.VLM, safety, benchmark, hazard-detection, robust-evaluation, vision-language
2603.29232Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
PDF
cs.CL, cs.AI, cs.LG80Structured long-doc QA (CoST) enabling verifiable outputs; aims for accuracy+latency with SLMs.long-context, QA, structured-output, verification, SLMs, reliability
2603.29871ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
PDF
cs.AI80Shapley-style reward allocation for multi-candidate LLM post-training; reduces free-riding vs set-level rewards.LLM-training, RLHF, GRPO, credit-assignment, shapley
2603.29109SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization
PDF
cs.SE, cs.AI80Grounds free-form LLM reasoning into structured intermediates for more verifiable fault localization.software, debugging, LLM-reasoning, grounding, verification
2603.29088WybeCoder: Verified Imperative Code Generation
PDF
cs.SE, cs.AI79Agentic verified code generation with co-evolving invariants/proofs; improves reliability.code-generation, verification, agents, Lean, SMT, reliability
2603.29824Curvature-Guided LoRA: Steering in the pretrained NTK subspace
PDF
cs.LG79Curvature/NTK-guided LoRA aims to match full fine-tuning predictions with efficient second-order updates.PEFT, LoRA, optimization, second-order, fine-tuning

AI Paper Insight Brief

2026-04-02

0) Executive takeaways (read this first)

  • Evaluation is shifting from “did it work once?” to “did it work reliably and safely over trajectories?” New metrics/benchmarks target long-horizon reliability decay (RDC/VAF/GDS/MOP), latent policy failures (“near-misses”), and benchmark redundancy (effective dimensionality), suggesting many current leaderboards overstate progress.
  • Structured, executable intermediates are becoming the dominant pattern for grounding LLM reasoning. This shows up in verified imperative code generation (VC subgoals + Lean/SMT), semantic fault localization (LLM→executable constraints), and long-doc QA (LLM→structured outputs distilled into SLMs).
  • Memory is not “free”: naive memory scaffolds can hurt long-horizon performance. Reliability study finds memory-augmented ReAct never improves long-horizon GDS and often hurts; in contrast, more explicit layered memory (working/episodic/semantic + retention regularization) reports retention/FMR gains—pointing to memory design as the key variable.
  • LLM-as-a-judge is now a critical security dependency. A security SoK catalogs high-ASR attacks (prompt injection, poisoning/backdoors, tokenization exploits) against judges; multiple new benchmarks also rely on MLLM judges, increasing the need for judge hardening and meta-evaluation.
  • Multimodal systems face a two-sided squeeze: stronger benchmarks + stronger attacks. New hazard-assessment and tool-planning benchmarks raise realism/coverage, while covert multimodal prompt injection achieves high black-box ASR against commercial MLLMs—deployment needs system-level defenses, not just model tweaks.
  • Formal verification is expanding from functional proofs to imperative programs at scale. WybeCoder reports high solve rates on translated imperative benchmarks and a large verified artifact (Heapsort), indicating agentic proof+code co-evolution is becoming practical (with caveats).

2) Key themes (clusters)

Theme: Trajectory-level reliability & hidden failures in agents

Theme: Structured intermediates for grounding, auditability, and distillation

Theme: Benchmarking the benchmark (redundancy, judge validity, domain realism)

Theme: Security & privacy risks in evaluators and multimodal systems

Theme: Memory & long-context retention (what works vs what backfires)

Theme: Multimodal trustworthiness: hallucinations, calibration, and fusion diagnostics


3) Technical synthesis

  • GRPO is emerging as a common post-training primitive across domains: structured long-doc QA distillation (LITECOST), calibrated radiology confidence (ConRad), memory-RL infrastructure (MemFactory), and process-reward formal reasoning (PRoSFI), plus Shapley-enhanced multi-candidate RL (ShapE-GRPO).
  • “Make it executable” is the unifying anti-hallucination strategy: cbfl-ir constraints executed across tests (SemLoc), VCs discharged by SMT/Lean (WybeCoder), formal step intermediates checked by provers (PRoSFI), and tool-plan tags judged for precision/recall (ATP-Bench).
  • Multi-agent decomposition is used to scale verification and evaluation: WybeCoder dispatches VC subgoals to parallel prover agents; ATP-Bench uses a multi-agent judge (precision/recall/chief); Nomad separates explorer vs verifier for discovery.
  • Reliability failures are increasingly characterized as distribution over runs, not a point estimate: repeated episodes (k=3) reveal variance amplification and rank inversions; near-misses show “correct final state” can hide policy violations.
  • Memory is a double-edged sword: explicit layered memory with retention regularization reports retention/FMR gains, while a memory-augmented ReAct scaffold in reliability experiments never improves long-horizon GDS and often hurts—suggesting interference/overhead dominates unless memory is carefully structured and trained.
  • Judge dependence is expanding, raising security stakes: SciVisAgentBench, ATP-Bench, TSHA, and CounselReflect all use LLM/MLLM judging with robustness checks; the LaaJ security SoK documents high-ASR attacks that could corrupt these pipelines.
  • Benchmark design is becoming more scientific: BenchScope’s effective dimensionality + null/reliability tests provide a way to audit whether a suite actually measures multiple independent capabilities.
  • Multimodal trustworthiness is being attacked and defended at different layers: attacks manipulate inputs (CoTTA), defenses manipulate hidden states (HIRE) or train calibrated confidence (ConRad), while PID tries to measure whether “vision mattered” at all.
  • System-level security proposals converge on “constrain what the model sees/decides”: structured artifacts, decoupled recognition vs action, and programmatic validators echo the same principle used in SemLoc/WybeCoder—reduce free-form degrees of freedom.

4) Top 5 papers (with “why now”)

1) Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

  • Introduces a metric suite (RDC/RDS, VAF, GDS, MOP) that exposes long-horizon failure modes hidden by pass@1.
  • Large-scale study (396 tasks, 23,392 episodes) shows universal reliability decay and rank inversions at long horizons.
  • Finds a sharp, actionable result: memory-augmented ReAct never improves long-horizon GDS and often hurts.
  • Skepticism / limitation: duration buckets use estimated human time (imperfect proxy); only 10 open-weight models and 3 domains.

2) WybeCoder: Verified Imperative Code Generation

  • Demonstrates agentic co-evolution of imperative code + invariants + proofs with SMT + Lean.
  • Reports strong solve rates on translated imperative benchmarks (e.g., 74.1% Verina-Loom, 62.1% Clever-Loom) and a large verified Heapsort artifact.
  • Multi-agent VC subgoal decomposition + proof transfer via deterministic naming is a concrete scaling recipe.
  • Skepticism / limitation: Loom/Velvet pipeline is experimental; managed-memory target; some manual spec/decomposition; open models lag.

3) SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization

  • Converts LLM “semantic intent” into executable constraints anchored in SSA, enabling spectrum-style scoring across tests.
  • Big localization gains vs SBFL (e.g., Acc@1 42.8% vs 6.4%, and far fewer suspicious lines).
  • Counterfactual patching step materially improves Acc@1 (ablation shows ~12pp drop without it).
  • Skepticism / limitation: high constraint waste (many never trigger / over-approximate); dataset is single-fault, small programs; repo-scale setup issues.

4) BenchScope: How Many Independent Signals Does Your Benchmark Provide?

  • Provides a fast diagnostic (Effective Dimensionality) to detect redundant benchmark suites and fragile composites.
  • Empirically shows major suites can collapse to ~1–2 effective axes (e.g., Open LLM Leaderboard ≈1.7).
  • Adds practical maintainer workflow (nulls, saturation, split-half reliability, ED-greedy selection).
  • Skepticism / limitation: ED is population-conditional; binary SVD overestimates dimensionality (needs corrections).

5) Adversarial Prompt Injection Attack on Multimodal Large Language Models

  • Shows a stealthy, expressive multimodal injection (covert text trigger + ℓ∞ perturbation) with high black-box ASR on commercial MLLMs.
  • Dual-target alignment (text + iteratively updated target image) is empirically critical (ablation: large ASR drop without it).
  • Directly relevant to agentic deployments where images are untrusted inputs.
  • Skepticism / limitation: limited tasks (captioning/VQA) and budgets; no human perceptual study reported.

5) Practical next steps

  • Adopt trajectory-level evaluation in your agent stack: run k-repeat episodes and compute reliability decay by task duration; log tool-call entropy to detect meltdowns (MOP-style) and correlate with failures.
  • Add near-miss auditing for any tool-using agent: for each mutating action, verify the required read-only evidence exists earlier in the trace (guard-code replay + history search).
  • Harden LLM-as-judge pipelines: treat judges as attack targets; use constrained schemas, ensemble/committee checks where feasible, and track judge drift/stability (prompt perturbation tests).
  • Prefer structured intermediates over free-form reasoning: require JSON/IR outputs that can be executed/checked (constraints, tool plans, formal steps), and discard malformed/ungrounded outputs.
  • Be cautious with “memory augmentation”: test whether your memory scaffold improves long-horizon GDS (partial credit) rather than just pass@1; consider layered memory with drift regularization rather than naive episodic scratchpads.
  • For multimodal agents, assume untrusted images: evaluate against covert prompt injection; add system-level defenses (plan/policy separation, structured validators) rather than relying on prompt instructions alone.
  • Audit your benchmark suite for redundancy before optimizing: compute effective dimensionality and run split-half/permutation null checks to ensure you’re not overfitting to a single latent axis.
  • If training multi-candidate generators, consider reward allocation that avoids free-riding (candidate-level credit assignment) rather than broadcasting a set-level scalar to all candidates.

Generated from per-paper analyses; no external browsing.