Daily AI Paper Report (2026-04-02)

Published: April 02, 2026

Chinese version: [中文]

Run stats

Candidates: 235
Selected: 30
Deepread completed: 30
Window (UTC): 2026-03-31T00:00:00Z → 2026-04-01T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2603.29403`	Security in LLM-as-a-Judge: A Comprehensive SoK PDF	cs.CR, cs.AI	94	First SoK on LLM-as-a-Judge security; maps attacks/risks for eval pipelines.	LLM-as-a-judge, security, evaluation, adversarial, SoK, reliability
`2603.29231`	Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents PDF	cs.AI	94	Reliability metrics for long-horizon agents; shows pass@1 fails as duration grows; large eval.	agents, reliability, evaluation, long-horizon, benchmarks, deployment
`2603.30016`	Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks PDF	cs.CR, cs.AI	92	System-level design guidance for indirect prompt injection defenses in agents.	agents, prompt-injection, system-design, security, tool-use, policies
`2603.29993`	Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation PDF	cs.AI	92	Reproduces+extends MONA reward-hacking mitigation; probes learned approval assumptions & tooling.	alignment, reward-hacking, RL, MONA, reproducibility, safety
`2603.29357`	BenchScope: How Many Independent Signals Does Your Benchmark Provide? PDF	cs.AI	92	Quantifies benchmark redundancy via effective dimensionality; actionable for eval design/leaderboards.	evaluation, benchmarks, measurement, leaderboards, metrics
`2603.29665`	Near-Miss: Latent Policy Failure Detection in Agentic Workflows PDF	cs.CL	90	Detects latent policy failures (near-misses) in agent workflows beyond end-state checks.	agents, policy-compliance, evaluation, monitoring, ToolGuard, safety-metrics
`2603.29418`	Adversarial Prompt Injection Attack on Multimodal Large Language Models PDF	cs.CV, cs.AI	90	Imperceptible visual prompt injection against closed MLLMs; practical multimodal attack surface.	security, prompt-injection, multimodal, adversarial, red-teaming, MLLM
`2603.29500`	Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries PDF	cs.AI, cs.LG	90	Process reward using structured formal intermediates to improve step reliability without losing accuracy.	reasoning, process-reward, formal-methods, RL, reliability
`2603.29846`	SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models PDF	cs.CL	88	Benchmark for strategic communication & secret-keeping; targets info leakage in LLMs.	information-leakage, multi-agent, benchmarks, security, strategic-communication, LLM-eval
`2603.29429`	CounselReflect: A Toolkit for Auditing Mental-Health Dialogues PDF	cs.CL	88	Auditing toolkit for mental-health dialogues with evidence-linked, multi-metric risk reports.	evaluation, auditing, safety, mental-health, rubrics, LLM
`2603.29353`	Nomad: Autonomous Exploration and Discovery PDF	cs.AI	88	Exploration-first agent architecture with hypothesis generation + independent verification; relevant to agent reliability.	agents, autonomous-research, tool-use, verification, evaluation
`2603.29373`	Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations PDF	cs.CL	86	Realistic medical safety eval: challenging patient behaviors + concrete unsafe failure criteria.	medical, safety-evaluation, robustness, hallucinations, high-stakes, LLM
`2603.29492`	Calibrated Confidence Expression for Radiology Report Generation PDF	cs.CL	86	RL framework to calibrate verbalized confidence in radiology reports; targets hallucination risk.	calibration, medical, vision-language, hallucinations, RL, reliability
`2603.29632`	An Empirical Study of Multi-Agent Collaboration for Automated Research PDF	cs.MA, cs.AI	86	Controlled empirical study of multi-agent coordination for automated research; useful evidence for MAS design/safety.	multi-agent, coordination, automated-research, benchmarks, agent-evaluation
`2603.29194`	Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention PDF	cs.CV, cs.AI	86	Agent memory layering + retrieval gating reduces drift/false memories under bounded context budgets.	agents, memory, long-context, retrieval, reliability
`2603.29493`	MemFactory: Unified Inference & Training Framework for Agent Memory PDF	cs.CL, cs.AI	85	Unified framework for training/inference of agent memory with modular components; reusable infra.	agents, memory, framework, RL, tooling, long-term
`2603.29902`	ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation PDF	cs.AI	84	ATP-Bench evaluates agentic tool planning for interleaved multimodal generation.	agents, tool-planning, multimodal, benchmark, MLLM, evaluation
`2603.29405`	Hallucination-aware intermediate representation edit in large vision-language models PDF	cs.CV, cs.AI	84	Low-overhead hallucination mitigation for VLMs via intermediate-representation detection and edits; practical reliability gain.	hallucinations, vision-language, reliability, representation-editing, multimodal
`2603.29676`	A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models PDF	cs.LG, cs.CL, cs.CV	84	PID-based decomposition to measure redundant/unique/synergistic info in 26 LVLMs across tasks.	interpretability, vision-language, information-decomposition, multimodal, analysis
`2603.29497`	Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models PDF	cs.CL	83	Distills LLM privacy sensitivity judgments into small models for scalable deployment.	privacy, distillation, data-governance, classification, LLM-judge, efficiency
`2603.29318`	PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent PDF	cs.AI	83	Personalization benchmark for smartphone GUI agents with 12.8k instructions across apps/scenarios.	agents, benchmarks, GUI, smartphones, personalization, evaluation
`2603.29139`	SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents PDF	cs.AI, cs.GR, cs.HC	82	Benchmark for scientific analysis/visualization agents with taxonomy + outcome-centric evaluation.	agents, benchmarks, scientific-workflows, tool-use, evaluation, visualization
`2603.29466`	An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms PDF	cs.LG, cs.AI, cs.CL	82	Cheap uncertainty estimates from gradient norms (single backward pass) for large models; helps calibration/monitoring.	uncertainty, calibration, gradient-norm, epistemic-uncertainty, monitoring
`2603.29288`	Sima AIunty: Caste Audit in LLM-Driven Matchmaking PDF	cs.CY, cs.AI, cs.CL, cs.HC, cs.SI	82	Controlled audit of caste bias in LLM matchmaking across model families and income strata.	bias, fairness, auditing, sociotechnical, evaluation
`2603.29759`	TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios PDF	cs.CV, cs.AI	81	Large real-world VLM benchmark for trustworthy indoor safety hazard assessment.	VLM, safety, benchmark, hazard-detection, robust-evaluation, vision-language
`2603.29232`	Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs PDF	cs.CL, cs.AI, cs.LG	80	Structured long-doc QA (CoST) enabling verifiable outputs; aims for accuracy+latency with SLMs.	long-context, QA, structured-output, verification, SLMs, reliability
`2603.29871`	ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training PDF	cs.AI	80	Shapley-style reward allocation for multi-candidate LLM post-training; reduces free-riding vs set-level rewards.	LLM-training, RLHF, GRPO, credit-assignment, shapley
`2603.29109`	SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization PDF	cs.SE, cs.AI	80	Grounds free-form LLM reasoning into structured intermediates for more verifiable fault localization.	software, debugging, LLM-reasoning, grounding, verification
`2603.29088`	WybeCoder: Verified Imperative Code Generation PDF	cs.SE, cs.AI	79	Agentic verified code generation with co-evolving invariants/proofs; improves reliability.	code-generation, verification, agents, Lean, SMT, reliability
`2603.29824`	Curvature-Guided LoRA: Steering in the pretrained NTK subspace PDF	cs.LG	79	Curvature/NTK-guided LoRA aims to match full fine-tuning predictions with efficient second-order updates.	PEFT, LoRA, optimization, second-order, fine-tuning

AI Paper Insight Brief

2026-04-02

0) Executive takeaways (read this first)

Evaluation is shifting from “did it work once?” to “did it work reliably and safely over trajectories?” New metrics/benchmarks target long-horizon reliability decay (RDC/VAF/GDS/MOP), latent policy failures (“near-misses”), and benchmark redundancy (effective dimensionality), suggesting many current leaderboards overstate progress.
Structured, executable intermediates are becoming the dominant pattern for grounding LLM reasoning. This shows up in verified imperative code generation (VC subgoals + Lean/SMT), semantic fault localization (LLM→executable constraints), and long-doc QA (LLM→structured outputs distilled into SLMs).
Memory is not “free”: naive memory scaffolds can hurt long-horizon performance. Reliability study finds memory-augmented ReAct never improves long-horizon GDS and often hurts; in contrast, more explicit layered memory (working/episodic/semantic + retention regularization) reports retention/FMR gains—pointing to memory design as the key variable.
LLM-as-a-judge is now a critical security dependency. A security SoK catalogs high-ASR attacks (prompt injection, poisoning/backdoors, tokenization exploits) against judges; multiple new benchmarks also rely on MLLM judges, increasing the need for judge hardening and meta-evaluation.
Multimodal systems face a two-sided squeeze: stronger benchmarks + stronger attacks. New hazard-assessment and tool-planning benchmarks raise realism/coverage, while covert multimodal prompt injection achieves high black-box ASR against commercial MLLMs—deployment needs system-level defenses, not just model tweaks.
Formal verification is expanding from functional proofs to imperative programs at scale. WybeCoder reports high solve rates on translated imperative benchmarks and a large verified artifact (Heapsort), indicating agentic proof+code co-evolution is becoming practical (with caveats).

2) Key themes (clusters)

Theme: Trajectory-level reliability & hidden failures in agents

Why it matters: Production agents fail via variance, meltdowns, and policy omissions that outcome-only metrics miss. Measuring how agents succeed (or nearly fail) is becoming as important as success rate.
Representative papers:
Common approach:
- Evaluate repeated episodes and duration buckets, not single-shot pass@1.
- Add trajectory diagnostics (meltdown detection via tool-call entropy; guard-code replay to check required reads before writes).
- Treat security as system architecture (plan/policy separation, constrained judges, programmatic validators), not only prompt hardening.
Open questions / failure modes:
- “Duration” proxies can misalign with agent step complexity; domain effects can invert trends.
- Meltdown detection is descriptive—how to turn MOP into reliable interventions (restart/checkpoint) without harming task completion?
- Near-miss detection depends on correctness of compiled guard code and generalization beyond τ2-verified Airlines.

Theme: Structured intermediates for grounding, auditability, and distillation

Why it matters: Free-form reasoning is hard to verify; structured artifacts make outputs executable, scorable, and transferable to smaller models or formal tools.
Representative papers:
Common approach:
- Convert model outputs into closed schemas (cbfl-ir constraints; CoST serialized structured outputs; JSON formal steps; VCs/subgoals).
- Use execution/verification loops (pytest instrumentation; Lean/SMT discharge; prover checks) to score and refine.
- Distill structured behavior into smaller models (SLMs via SFT→GRPO) or scale via multi-agent subgoal parallelism.
Open questions / failure modes:
- Constraint/step noise is high (e.g., many inferred constraints never trigger); how to improve discriminative coverage without exploding cost?
- Verification can be brittle to spec engineering and decomposition (manual steps still present in imperative verification).
- Reliance on LLM judges for “process consistency” or “soundness” can reintroduce judge vulnerabilities.

Theme: Benchmarking the benchmark (redundancy, judge validity, domain realism)

Why it matters: If benchmarks are redundant or judge pipelines are unstable, leaderboard movement can be illusory; domain-specific agent benchmarks are proliferating and need validity checks.
Representative papers:
Common approach:
- Add taxonomy + expert curation and outcome-centric rubrics; combine deterministic checks with MLLM judging.
- Measure process-level performance (TDG alignment APR/PPR; tool-call precision/recall; artifact recording for post-hoc verification).
- Quantify judge validity/stability (human–LLM alignment, robustness to perturbations; multi-judge aggregation).
Open questions / failure modes:
- Effective dimensionality is population-conditional; suites can compress over time and become fragile to weighting.
- Outcome-centric benchmarks may exclude tasks with multiple valid outputs; process-based evaluation remains hard.
- Judge pipelines are a security and reliability dependency (prompt injection, rubric manipulation, evaluator drift).

Theme: Security & privacy risks in evaluators and multimodal systems

Why it matters: As evaluation and training pipelines depend on LLM judges and multimodal agents, attacks can corrupt rewards, rankings, and downstream behavior.
Representative papers:
Common approach:
- Systematize attack surfaces (inference-time injection, training-time poisoning/backdoors, tokenization exploits) and defenses (ensembles, detectors).
- Demonstrate black-box transfer attacks on commercial MLLMs via surrogate optimization (dual-target alignment).
- Build operational metrics for privacy/leakage (Likert sensitivity distillation; utility–leakage SoftScore).
Open questions / failure modes:
- How to harden judge pipelines without losing scalability (committee methods help but add cost/complexity)?
- Multimodal “imperceptible” attacks are bounded by ℓ∞ but lack human perceptual validation in reported results.
- Privacy sensitivity is a single scalar and inherits teacher biases; context (audience/purpose) is missing.

Theme: Memory & long-context retention (what works vs what backfires)

Why it matters: Long-horizon agents need persistence without drift; but adding memory can increase interference and failure cascades.
Representative papers:
Common approach:
- Explicit memory decomposition (working/episodic/semantic) with gating and drift regularization.
- Treat memory ops as RL-optimizable modules (Extractor/Updater/Retriever) with GRPO training infrastructure.
- Evaluate on long-horizon benchmarks (LOCOMO/LOCCO/LoCoMo) and multi-episode reliability protocols.
Open questions / failure modes:
- Reliability study finds naive episodic scratchpad harms long-horizon GDS—what distinguishes “good” memory from harmful memory?
- OOD behavior can degrade (MemFactory notes slight OOD decrease for smaller model).
- Calibration/validation of memory “truth maintenance” remains under-specified.

Theme: Multimodal trustworthiness: hallucinations, calibration, and fusion diagnostics

Why it matters: LVLMs can be fluent but wrong; deployment needs controllable hallucination reduction, calibrated confidence, and tools to diagnose whether models actually use vision.
Representative papers:
Common approach:
- Inference-time interventions without full retraining (representation editing with selective router; controllable strength α).
- RL fine-tuning with proper scoring rules to elicit calibrated verbalized confidence (GRPO + log scoring rule).
- Quantitative interpretability via PID (redundancy/unique/synergy) and realism-focused safety benchmarks.
Open questions / failure modes:
- Representation editing requires training auxiliary modules and relies on induced hallucination pairs; robustness beyond standard benchmarks is unclear.
- Confidence filtering trades precision vs coverage; sentence-level calibration remains harder than report-level.
- PID uses approximate unimodal masking; extension to open-ended generation is not covered.

3) Technical synthesis

GRPO is emerging as a common post-training primitive across domains: structured long-doc QA distillation (LITECOST), calibrated radiology confidence (ConRad), memory-RL infrastructure (MemFactory), and process-reward formal reasoning (PRoSFI), plus Shapley-enhanced multi-candidate RL (ShapE-GRPO).
“Make it executable” is the unifying anti-hallucination strategy: cbfl-ir constraints executed across tests (SemLoc), VCs discharged by SMT/Lean (WybeCoder), formal step intermediates checked by provers (PRoSFI), and tool-plan tags judged for precision/recall (ATP-Bench).
Multi-agent decomposition is used to scale verification and evaluation: WybeCoder dispatches VC subgoals to parallel prover agents; ATP-Bench uses a multi-agent judge (precision/recall/chief); Nomad separates explorer vs verifier for discovery.
Reliability failures are increasingly characterized as distribution over runs, not a point estimate: repeated episodes (k=3) reveal variance amplification and rank inversions; near-misses show “correct final state” can hide policy violations.
Memory is a double-edged sword: explicit layered memory with retention regularization reports retention/FMR gains, while a memory-augmented ReAct scaffold in reliability experiments never improves long-horizon GDS and often hurts—suggesting interference/overhead dominates unless memory is carefully structured and trained.
Judge dependence is expanding, raising security stakes: SciVisAgentBench, ATP-Bench, TSHA, and CounselReflect all use LLM/MLLM judging with robustness checks; the LaaJ security SoK documents high-ASR attacks that could corrupt these pipelines.
Benchmark design is becoming more scientific: BenchScope’s effective dimensionality + null/reliability tests provide a way to audit whether a suite actually measures multiple independent capabilities.
Multimodal trustworthiness is being attacked and defended at different layers: attacks manipulate inputs (CoTTA), defenses manipulate hidden states (HIRE) or train calibrated confidence (ConRad), while PID tries to measure whether “vision mattered” at all.
System-level security proposals converge on “constrain what the model sees/decides”: structured artifacts, decoupled recognition vs action, and programmatic validators echo the same principle used in SemLoc/WybeCoder—reduce free-form degrees of freedom.

4) Top 5 papers (with “why now”)

1) Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Introduces a metric suite (RDC/RDS, VAF, GDS, MOP) that exposes long-horizon failure modes hidden by pass@1.
Large-scale study (396 tasks, 23,392 episodes) shows universal reliability decay and rank inversions at long horizons.
Finds a sharp, actionable result: memory-augmented ReAct never improves long-horizon GDS and often hurts.
Skepticism / limitation: duration buckets use estimated human time (imperfect proxy); only 10 open-weight models and 3 domains.

2) WybeCoder: Verified Imperative Code Generation

Demonstrates agentic co-evolution of imperative code + invariants + proofs with SMT + Lean.
Reports strong solve rates on translated imperative benchmarks (e.g., 74.1% Verina-Loom, 62.1% Clever-Loom) and a large verified Heapsort artifact.
Multi-agent VC subgoal decomposition + proof transfer via deterministic naming is a concrete scaling recipe.
Skepticism / limitation: Loom/Velvet pipeline is experimental; managed-memory target; some manual spec/decomposition; open models lag.

3) SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization

Converts LLM “semantic intent” into executable constraints anchored in SSA, enabling spectrum-style scoring across tests.
Big localization gains vs SBFL (e.g., Acc@1 42.8% vs 6.4%, and far fewer suspicious lines).
Counterfactual patching step materially improves Acc@1 (ablation shows ~12pp drop without it).
Skepticism / limitation: high constraint waste (many never trigger / over-approximate); dataset is single-fault, small programs; repo-scale setup issues.

4) BenchScope: How Many Independent Signals Does Your Benchmark Provide?

Provides a fast diagnostic (Effective Dimensionality) to detect redundant benchmark suites and fragile composites.
Empirically shows major suites can collapse to ~1–2 effective axes (e.g., Open LLM Leaderboard ≈1.7).
Adds practical maintainer workflow (nulls, saturation, split-half reliability, ED-greedy selection).
Skepticism / limitation: ED is population-conditional; binary SVD overestimates dimensionality (needs corrections).

5) Adversarial Prompt Injection Attack on Multimodal Large Language Models

Shows a stealthy, expressive multimodal injection (covert text trigger + ℓ∞ perturbation) with high black-box ASR on commercial MLLMs.
Dual-target alignment (text + iteratively updated target image) is empirically critical (ablation: large ASR drop without it).
Directly relevant to agentic deployments where images are untrusted inputs.
Skepticism / limitation: limited tasks (captioning/VQA) and budgets; no human perceptual study reported.

5) Practical next steps

Adopt trajectory-level evaluation in your agent stack: run k-repeat episodes and compute reliability decay by task duration; log tool-call entropy to detect meltdowns (MOP-style) and correlate with failures.
Add near-miss auditing for any tool-using agent: for each mutating action, verify the required read-only evidence exists earlier in the trace (guard-code replay + history search).
Harden LLM-as-judge pipelines: treat judges as attack targets; use constrained schemas, ensemble/committee checks where feasible, and track judge drift/stability (prompt perturbation tests).
Prefer structured intermediates over free-form reasoning: require JSON/IR outputs that can be executed/checked (constraints, tool plans, formal steps), and discard malformed/ungrounded outputs.
Be cautious with “memory augmentation”: test whether your memory scaffold improves long-horizon GDS (partial credit) rather than just pass@1; consider layered memory with drift regularization rather than naive episodic scratchpads.
For multimodal agents, assume untrusted images: evaluate against covert prompt injection; add system-level defenses (plan/policy separation, structured validators) rather than relying on prompt instructions alone.
Audit your benchmark suite for redundancy before optimizing: compute effective dimensionality and run split-half/permutation null checks to ensure you’re not overfitting to a single latent axis.
If training multi-candidate generators, consider reward allocation that avoids free-riding (candidate-level credit assignment) rather than broadcasting a set-level scalar to all candidates.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-02

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Trajectory-level reliability & hidden failures in agents

Theme: Structured intermediates for grounding, auditability, and distillation

Theme: Benchmarking the benchmark (redundancy, judge validity, domain realism)

Theme: Security & privacy risks in evaluators and multimodal systems

Theme: Memory & long-context retention (what works vs what backfires)

Theme: Multimodal trustworthiness: hallucinations, calibration, and fusion diagnostics

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps