Daily AI Paper Report (2026-04-02)
Published:
Chinese version: [中文]
Run stats
- Candidates: 235
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-31T00:00:00Z → 2026-04-01T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.29403 | Security in LLM-as-a-Judge: A Comprehensive SoK | cs.CR, cs.AI | 94 | First SoK on LLM-as-a-Judge security; maps attacks/risks for eval pipelines. | LLM-as-a-judge, security, evaluation, adversarial, SoK, reliability |
2603.29231 | Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents | cs.AI | 94 | Reliability metrics for long-horizon agents; shows pass@1 fails as duration grows; large eval. | agents, reliability, evaluation, long-horizon, benchmarks, deployment |
2603.30016 | Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks | cs.CR, cs.AI | 92 | System-level design guidance for indirect prompt injection defenses in agents. | agents, prompt-injection, system-design, security, tool-use, policies |
2603.29993 | Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation | cs.AI | 92 | Reproduces+extends MONA reward-hacking mitigation; probes learned approval assumptions & tooling. | alignment, reward-hacking, RL, MONA, reproducibility, safety |
2603.29357 | BenchScope: How Many Independent Signals Does Your Benchmark Provide? | cs.AI | 92 | Quantifies benchmark redundancy via effective dimensionality; actionable for eval design/leaderboards. | evaluation, benchmarks, measurement, leaderboards, metrics |
2603.29665 | Near-Miss: Latent Policy Failure Detection in Agentic Workflows | cs.CL | 90 | Detects latent policy failures (near-misses) in agent workflows beyond end-state checks. | agents, policy-compliance, evaluation, monitoring, ToolGuard, safety-metrics |
2603.29418 | Adversarial Prompt Injection Attack on Multimodal Large Language Models | cs.CV, cs.AI | 90 | Imperceptible visual prompt injection against closed MLLMs; practical multimodal attack surface. | security, prompt-injection, multimodal, adversarial, red-teaming, MLLM |
2603.29500 | Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries | cs.AI, cs.LG | 90 | Process reward using structured formal intermediates to improve step reliability without losing accuracy. | reasoning, process-reward, formal-methods, RL, reliability |
2603.29846 | SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models | cs.CL | 88 | Benchmark for strategic communication & secret-keeping; targets info leakage in LLMs. | information-leakage, multi-agent, benchmarks, security, strategic-communication, LLM-eval |
2603.29429 | CounselReflect: A Toolkit for Auditing Mental-Health Dialogues | cs.CL | 88 | Auditing toolkit for mental-health dialogues with evidence-linked, multi-metric risk reports. | evaluation, auditing, safety, mental-health, rubrics, LLM |
2603.29353 | Nomad: Autonomous Exploration and Discovery | cs.AI | 88 | Exploration-first agent architecture with hypothesis generation + independent verification; relevant to agent reliability. | agents, autonomous-research, tool-use, verification, evaluation |
2603.29373 | Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations | cs.CL | 86 | Realistic medical safety eval: challenging patient behaviors + concrete unsafe failure criteria. | medical, safety-evaluation, robustness, hallucinations, high-stakes, LLM |
2603.29492 | Calibrated Confidence Expression for Radiology Report Generation | cs.CL | 86 | RL framework to calibrate verbalized confidence in radiology reports; targets hallucination risk. | calibration, medical, vision-language, hallucinations, RL, reliability |
2603.29632 | An Empirical Study of Multi-Agent Collaboration for Automated Research | cs.MA, cs.AI | 86 | Controlled empirical study of multi-agent coordination for automated research; useful evidence for MAS design/safety. | multi-agent, coordination, automated-research, benchmarks, agent-evaluation |
2603.29194 | Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention | cs.CV, cs.AI | 86 | Agent memory layering + retrieval gating reduces drift/false memories under bounded context budgets. | agents, memory, long-context, retrieval, reliability |
2603.29493 | MemFactory: Unified Inference & Training Framework for Agent Memory | cs.CL, cs.AI | 85 | Unified framework for training/inference of agent memory with modular components; reusable infra. | agents, memory, framework, RL, tooling, long-term |
2603.29902 | ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation | cs.AI | 84 | ATP-Bench evaluates agentic tool planning for interleaved multimodal generation. | agents, tool-planning, multimodal, benchmark, MLLM, evaluation |
2603.29405 | Hallucination-aware intermediate representation edit in large vision-language models | cs.CV, cs.AI | 84 | Low-overhead hallucination mitigation for VLMs via intermediate-representation detection and edits; practical reliability gain. | hallucinations, vision-language, reliability, representation-editing, multimodal |
2603.29676 | A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models | cs.LG, cs.CL, cs.CV | 84 | PID-based decomposition to measure redundant/unique/synergistic info in 26 LVLMs across tasks. | interpretability, vision-language, information-decomposition, multimodal, analysis |
2603.29497 | Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models | cs.CL | 83 | Distills LLM privacy sensitivity judgments into small models for scalable deployment. | privacy, distillation, data-governance, classification, LLM-judge, efficiency |
2603.29318 | PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent | cs.AI | 83 | Personalization benchmark for smartphone GUI agents with 12.8k instructions across apps/scenarios. | agents, benchmarks, GUI, smartphones, personalization, evaluation |
2603.29139 | SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents | cs.AI, cs.GR, cs.HC | 82 | Benchmark for scientific analysis/visualization agents with taxonomy + outcome-centric evaluation. | agents, benchmarks, scientific-workflows, tool-use, evaluation, visualization |
2603.29466 | An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms | cs.LG, cs.AI, cs.CL | 82 | Cheap uncertainty estimates from gradient norms (single backward pass) for large models; helps calibration/monitoring. | uncertainty, calibration, gradient-norm, epistemic-uncertainty, monitoring |
2603.29288 | Sima AIunty: Caste Audit in LLM-Driven Matchmaking | cs.CY, cs.AI, cs.CL, cs.HC, cs.SI | 82 | Controlled audit of caste bias in LLM matchmaking across model families and income strata. | bias, fairness, auditing, sociotechnical, evaluation |
2603.29759 | TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios | cs.CV, cs.AI | 81 | Large real-world VLM benchmark for trustworthy indoor safety hazard assessment. | VLM, safety, benchmark, hazard-detection, robust-evaluation, vision-language |
2603.29232 | Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs | cs.CL, cs.AI, cs.LG | 80 | Structured long-doc QA (CoST) enabling verifiable outputs; aims for accuracy+latency with SLMs. | long-context, QA, structured-output, verification, SLMs, reliability |
2603.29871 | ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training | cs.AI | 80 | Shapley-style reward allocation for multi-candidate LLM post-training; reduces free-riding vs set-level rewards. | LLM-training, RLHF, GRPO, credit-assignment, shapley |
2603.29109 | SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization | cs.SE, cs.AI | 80 | Grounds free-form LLM reasoning into structured intermediates for more verifiable fault localization. | software, debugging, LLM-reasoning, grounding, verification |
2603.29088 | WybeCoder: Verified Imperative Code Generation | cs.SE, cs.AI | 79 | Agentic verified code generation with co-evolving invariants/proofs; improves reliability. | code-generation, verification, agents, Lean, SMT, reliability |
2603.29824 | Curvature-Guided LoRA: Steering in the pretrained NTK subspace | cs.LG | 79 | Curvature/NTK-guided LoRA aims to match full fine-tuning predictions with efficient second-order updates. | PEFT, LoRA, optimization, second-order, fine-tuning |
AI Paper Insight Brief
2026-04-02
0) Executive takeaways (read this first)
- Evaluation is shifting from “did it work once?” to “did it work reliably and safely over trajectories?” New metrics/benchmarks target long-horizon reliability decay (RDC/VAF/GDS/MOP), latent policy failures (“near-misses”), and benchmark redundancy (effective dimensionality), suggesting many current leaderboards overstate progress.
- Structured, executable intermediates are becoming the dominant pattern for grounding LLM reasoning. This shows up in verified imperative code generation (VC subgoals + Lean/SMT), semantic fault localization (LLM→executable constraints), and long-doc QA (LLM→structured outputs distilled into SLMs).
- Memory is not “free”: naive memory scaffolds can hurt long-horizon performance. Reliability study finds memory-augmented ReAct never improves long-horizon GDS and often hurts; in contrast, more explicit layered memory (working/episodic/semantic + retention regularization) reports retention/FMR gains—pointing to memory design as the key variable.
- LLM-as-a-judge is now a critical security dependency. A security SoK catalogs high-ASR attacks (prompt injection, poisoning/backdoors, tokenization exploits) against judges; multiple new benchmarks also rely on MLLM judges, increasing the need for judge hardening and meta-evaluation.
- Multimodal systems face a two-sided squeeze: stronger benchmarks + stronger attacks. New hazard-assessment and tool-planning benchmarks raise realism/coverage, while covert multimodal prompt injection achieves high black-box ASR against commercial MLLMs—deployment needs system-level defenses, not just model tweaks.
- Formal verification is expanding from functional proofs to imperative programs at scale. WybeCoder reports high solve rates on translated imperative benchmarks and a large verified artifact (Heapsort), indicating agentic proof+code co-evolution is becoming practical (with caveats).
2) Key themes (clusters)
Theme: Trajectory-level reliability & hidden failures in agents
- Why it matters: Production agents fail via variance, meltdowns, and policy omissions that outcome-only metrics miss. Measuring how agents succeed (or nearly fail) is becoming as important as success rate.
- Representative papers:
- Common approach:
- Evaluate repeated episodes and duration buckets, not single-shot pass@1.
- Add trajectory diagnostics (meltdown detection via tool-call entropy; guard-code replay to check required reads before writes).
- Treat security as system architecture (plan/policy separation, constrained judges, programmatic validators), not only prompt hardening.
- Open questions / failure modes:
- “Duration” proxies can misalign with agent step complexity; domain effects can invert trends.
- Meltdown detection is descriptive—how to turn MOP into reliable interventions (restart/checkpoint) without harming task completion?
- Near-miss detection depends on correctness of compiled guard code and generalization beyond τ2-verified Airlines.
Theme: Structured intermediates for grounding, auditability, and distillation
- Why it matters: Free-form reasoning is hard to verify; structured artifacts make outputs executable, scorable, and transferable to smaller models or formal tools.
- Representative papers:
- Common approach:
- Convert model outputs into closed schemas (cbfl-ir constraints; CoST serialized structured outputs; JSON formal steps; VCs/subgoals).
- Use execution/verification loops (pytest instrumentation; Lean/SMT discharge; prover checks) to score and refine.
- Distill structured behavior into smaller models (SLMs via SFT→GRPO) or scale via multi-agent subgoal parallelism.
- Open questions / failure modes:
- Constraint/step noise is high (e.g., many inferred constraints never trigger); how to improve discriminative coverage without exploding cost?
- Verification can be brittle to spec engineering and decomposition (manual steps still present in imperative verification).
- Reliance on LLM judges for “process consistency” or “soundness” can reintroduce judge vulnerabilities.
Theme: Benchmarking the benchmark (redundancy, judge validity, domain realism)
- Why it matters: If benchmarks are redundant or judge pipelines are unstable, leaderboard movement can be illusory; domain-specific agent benchmarks are proliferating and need validity checks.
- Representative papers:
- Common approach:
- Add taxonomy + expert curation and outcome-centric rubrics; combine deterministic checks with MLLM judging.
- Measure process-level performance (TDG alignment APR/PPR; tool-call precision/recall; artifact recording for post-hoc verification).
- Quantify judge validity/stability (human–LLM alignment, robustness to perturbations; multi-judge aggregation).
- Open questions / failure modes:
- Effective dimensionality is population-conditional; suites can compress over time and become fragile to weighting.
- Outcome-centric benchmarks may exclude tasks with multiple valid outputs; process-based evaluation remains hard.
- Judge pipelines are a security and reliability dependency (prompt injection, rubric manipulation, evaluator drift).
Theme: Security & privacy risks in evaluators and multimodal systems
- Why it matters: As evaluation and training pipelines depend on LLM judges and multimodal agents, attacks can corrupt rewards, rankings, and downstream behavior.
- Representative papers:
- Common approach:
- Systematize attack surfaces (inference-time injection, training-time poisoning/backdoors, tokenization exploits) and defenses (ensembles, detectors).
- Demonstrate black-box transfer attacks on commercial MLLMs via surrogate optimization (dual-target alignment).
- Build operational metrics for privacy/leakage (Likert sensitivity distillation; utility–leakage SoftScore).
- Open questions / failure modes:
- How to harden judge pipelines without losing scalability (committee methods help but add cost/complexity)?
- Multimodal “imperceptible” attacks are bounded by ℓ∞ but lack human perceptual validation in reported results.
- Privacy sensitivity is a single scalar and inherits teacher biases; context (audience/purpose) is missing.
Theme: Memory & long-context retention (what works vs what backfires)
- Why it matters: Long-horizon agents need persistence without drift; but adding memory can increase interference and failure cascades.
- Representative papers:
- Common approach:
- Explicit memory decomposition (working/episodic/semantic) with gating and drift regularization.
- Treat memory ops as RL-optimizable modules (Extractor/Updater/Retriever) with GRPO training infrastructure.
- Evaluate on long-horizon benchmarks (LOCOMO/LOCCO/LoCoMo) and multi-episode reliability protocols.
- Open questions / failure modes:
- Reliability study finds naive episodic scratchpad harms long-horizon GDS—what distinguishes “good” memory from harmful memory?
- OOD behavior can degrade (MemFactory notes slight OOD decrease for smaller model).
- Calibration/validation of memory “truth maintenance” remains under-specified.
Theme: Multimodal trustworthiness: hallucinations, calibration, and fusion diagnostics
- Why it matters: LVLMs can be fluent but wrong; deployment needs controllable hallucination reduction, calibrated confidence, and tools to diagnose whether models actually use vision.
- Representative papers:
- Hallucination-aware intermediate representation edit in large vision-language models
- Calibrated Confidence Expression for Radiology Report Generation
- A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
- TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios
- Common approach:
- Inference-time interventions without full retraining (representation editing with selective router; controllable strength α).
- RL fine-tuning with proper scoring rules to elicit calibrated verbalized confidence (GRPO + log scoring rule).
- Quantitative interpretability via PID (redundancy/unique/synergy) and realism-focused safety benchmarks.
- Open questions / failure modes:
- Representation editing requires training auxiliary modules and relies on induced hallucination pairs; robustness beyond standard benchmarks is unclear.
- Confidence filtering trades precision vs coverage; sentence-level calibration remains harder than report-level.
- PID uses approximate unimodal masking; extension to open-ended generation is not covered.
3) Technical synthesis
- GRPO is emerging as a common post-training primitive across domains: structured long-doc QA distillation (LITECOST), calibrated radiology confidence (ConRad), memory-RL infrastructure (MemFactory), and process-reward formal reasoning (PRoSFI), plus Shapley-enhanced multi-candidate RL (ShapE-GRPO).
- “Make it executable” is the unifying anti-hallucination strategy: cbfl-ir constraints executed across tests (SemLoc), VCs discharged by SMT/Lean (WybeCoder), formal step intermediates checked by provers (PRoSFI), and tool-plan tags judged for precision/recall (ATP-Bench).
- Multi-agent decomposition is used to scale verification and evaluation: WybeCoder dispatches VC subgoals to parallel prover agents; ATP-Bench uses a multi-agent judge (precision/recall/chief); Nomad separates explorer vs verifier for discovery.
- Reliability failures are increasingly characterized as distribution over runs, not a point estimate: repeated episodes (k=3) reveal variance amplification and rank inversions; near-misses show “correct final state” can hide policy violations.
- Memory is a double-edged sword: explicit layered memory with retention regularization reports retention/FMR gains, while a memory-augmented ReAct scaffold in reliability experiments never improves long-horizon GDS and often hurts—suggesting interference/overhead dominates unless memory is carefully structured and trained.
- Judge dependence is expanding, raising security stakes: SciVisAgentBench, ATP-Bench, TSHA, and CounselReflect all use LLM/MLLM judging with robustness checks; the LaaJ security SoK documents high-ASR attacks that could corrupt these pipelines.
- Benchmark design is becoming more scientific: BenchScope’s effective dimensionality + null/reliability tests provide a way to audit whether a suite actually measures multiple independent capabilities.
- Multimodal trustworthiness is being attacked and defended at different layers: attacks manipulate inputs (CoTTA), defenses manipulate hidden states (HIRE) or train calibrated confidence (ConRad), while PID tries to measure whether “vision mattered” at all.
- System-level security proposals converge on “constrain what the model sees/decides”: structured artifacts, decoupled recognition vs action, and programmatic validators echo the same principle used in SemLoc/WybeCoder—reduce free-form degrees of freedom.
4) Top 5 papers (with “why now”)
1) Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
- Introduces a metric suite (RDC/RDS, VAF, GDS, MOP) that exposes long-horizon failure modes hidden by pass@1.
- Large-scale study (396 tasks, 23,392 episodes) shows universal reliability decay and rank inversions at long horizons.
- Finds a sharp, actionable result: memory-augmented ReAct never improves long-horizon GDS and often hurts.
- Skepticism / limitation: duration buckets use estimated human time (imperfect proxy); only 10 open-weight models and 3 domains.
2) WybeCoder: Verified Imperative Code Generation
- Demonstrates agentic co-evolution of imperative code + invariants + proofs with SMT + Lean.
- Reports strong solve rates on translated imperative benchmarks (e.g., 74.1% Verina-Loom, 62.1% Clever-Loom) and a large verified Heapsort artifact.
- Multi-agent VC subgoal decomposition + proof transfer via deterministic naming is a concrete scaling recipe.
- Skepticism / limitation: Loom/Velvet pipeline is experimental; managed-memory target; some manual spec/decomposition; open models lag.
3) SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization
- Converts LLM “semantic intent” into executable constraints anchored in SSA, enabling spectrum-style scoring across tests.
- Big localization gains vs SBFL (e.g., Acc@1 42.8% vs 6.4%, and far fewer suspicious lines).
- Counterfactual patching step materially improves Acc@1 (ablation shows ~12pp drop without it).
- Skepticism / limitation: high constraint waste (many never trigger / over-approximate); dataset is single-fault, small programs; repo-scale setup issues.
4) BenchScope: How Many Independent Signals Does Your Benchmark Provide?
- Provides a fast diagnostic (Effective Dimensionality) to detect redundant benchmark suites and fragile composites.
- Empirically shows major suites can collapse to ~1–2 effective axes (e.g., Open LLM Leaderboard ≈1.7).
- Adds practical maintainer workflow (nulls, saturation, split-half reliability, ED-greedy selection).
- Skepticism / limitation: ED is population-conditional; binary SVD overestimates dimensionality (needs corrections).
5) Adversarial Prompt Injection Attack on Multimodal Large Language Models
- Shows a stealthy, expressive multimodal injection (covert text trigger + ℓ∞ perturbation) with high black-box ASR on commercial MLLMs.
- Dual-target alignment (text + iteratively updated target image) is empirically critical (ablation: large ASR drop without it).
- Directly relevant to agentic deployments where images are untrusted inputs.
- Skepticism / limitation: limited tasks (captioning/VQA) and budgets; no human perceptual study reported.
5) Practical next steps
- Adopt trajectory-level evaluation in your agent stack: run k-repeat episodes and compute reliability decay by task duration; log tool-call entropy to detect meltdowns (MOP-style) and correlate with failures.
- Add near-miss auditing for any tool-using agent: for each mutating action, verify the required read-only evidence exists earlier in the trace (guard-code replay + history search).
- Harden LLM-as-judge pipelines: treat judges as attack targets; use constrained schemas, ensemble/committee checks where feasible, and track judge drift/stability (prompt perturbation tests).
- Prefer structured intermediates over free-form reasoning: require JSON/IR outputs that can be executed/checked (constraints, tool plans, formal steps), and discard malformed/ungrounded outputs.
- Be cautious with “memory augmentation”: test whether your memory scaffold improves long-horizon GDS (partial credit) rather than just pass@1; consider layered memory with drift regularization rather than naive episodic scratchpads.
- For multimodal agents, assume untrusted images: evaluate against covert prompt injection; add system-level defenses (plan/policy separation, structured validators) rather than relying on prompt instructions alone.
- Audit your benchmark suite for redundancy before optimizing: compute effective dimensionality and run split-half/permutation null checks to ensure you’re not overfitting to a single latent axis.
- If training multi-candidate generators, consider reward allocation that avoids free-riding (candidate-level credit assignment) rather than broadcasting a set-level scalar to all candidates.
Generated from per-paper analyses; no external browsing.
