Daily AI Paper Report (2026-04-05)

Published: April 05, 2026

Chinese version: [中文]

Run stats

Candidates: 2272
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-03T00:00:00Z → 2026-04-04T00:00:00Z (weekend_backlog_unknown, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.00419`	G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs PDF	cs.LG, cs.AI	93	White-box LLM membership inference via gradient-induced representation drift; stronger privacy auditing signal.	privacy, membership-inference, LLM-security, white-box, auditing
`2604.00430`	Secure Forgetting: A Framework for Privacy-Driven Unlearning in Large Language Model (LLM)-Based Agents PDF	cs.MA, cs.CR	93	Framework for privacy-driven unlearning in LLM agents; timely for deployed agent safety/privacy.	LLM-agents, unlearning, privacy, data-deletion, security, governance
`2604.01161`	Reasoning Shift: How Context Silently Shortens LLM Reasoning PDF	cs.LG	92	Finds context can silently shorten reasoning traces; robustness issue for test-time scaling models	llm-reasoning, robustness, test-time-scaling, long-context, evaluation
`2604.01147`	SERSEM: Selective Entropy-Weighted Scoring for Membership Inference in Code Language Models PDF	cs.SE, cs.CR	92	Stronger membership inference for code LMs; practical contamination/privacy auditing via AST-weighted signals	privacy, membership-inference, data-contamination, code-llms, security, memorization
`2604.00860`	Policy Improvement Reinforcement Learning PDF	cs.LG	92	Adds inter-iteration “did policy improve?” feedback to RLVR; targets drift/collapse in LLM post-training	RLVR, post-training, policy-improvement, reasoning, stability, verification
`2604.01014`	AutoMIA: Improved Baselines for Membership Inference Attack via Agentic Self-Exploration PDF	cs.CR, cs.CV	90	Agentic self-exploration to auto-discover stronger MIA strategies; reusable auditing framework.	privacy, membership-inference, agents, red-teaming, automation
`2604.00478`	The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents PDF	cs.AI	90	Agent anti-sycophancy gating + auditor veto loop; concrete eval on TruthfulQA adversarial dialogs	llm-agents, sycophancy, guardrails, behavioral-gating, oversight, evaluation
`2604.00938`	WARP: Guaranteed Inner-Layer Repair of NLP Transformers PDF	cs.LG, cs.AI	90	Provable inner-layer transformer repair against adversarial perturbations; bridges robustness+verification.	robustness, adversarial, transformers, formal-methods, model-repair, verification
`2603.28101`	Heddle: A Distributed Orchestration System for Agentic RL Rollout PDF	cs.LG	90	Trajectory-centric orchestration for agentic RL rollouts; tackles long-tail tool-call bottlenecks	agentic-RL, LLM-agents, systems, tool-use, scaling, scheduling
`2604.01128`	Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers PDF	cs.CL, cs.AI, cs.LG	90	New framework to measure hallucination/presentation risks in agent-written papers; timely eval methodology	evaluation, hallucinations, agent-reliability, scientific-writing, benchmarks, auditing
`2604.00555`	Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents PDF	cs.AI, cs.CL, cs.SE	89	Ontology-constrained neurosymbolic agent architecture to reduce hallucination and enforce compliance	agents, governance, neurosymbolic, hallucinations, enterprise, compliance
`2604.00442`	Execution-Verified Reinforcement Learning for Optimization Modeling PDF	cs.AI, cs.CL	89	Execution-verified RL with sandboxed solver as verifier for NL→optimization code; reusable agentic training recipe	agents, tool-use, RLVR, verifiable-rewards, code-generation, sandboxing, optimization
`2603.28309`	VulnScout-C: A Lightweight Transformer for C Code Vulnerability Detection PDF	cs.CR	88	Lightweight transformer + curated dataset for C vuln detection; practical secure-dev impact and benchmark value.	code-security, vulnerability-detection, dataset, efficient-LLM, software-security
`2603.28386`	COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game PDF	cs.AI	88	Adversarial co-evolution of LLM-generated envs/policies for automated curricula; strong agent generalization.	agents, curriculum, adversarial-training, LLM-codegen, evaluation, continual-learning
`2604.01221`	HippoCamp: Benchmarking Contextual Agents on Personal Computers PDF	cs.AI, cs.CV	88	Realistic benchmark for contextual PC/file agents with large multimodal data + dense trajectories for analysis	agent-benchmark, tool-use, context, multimodal, evaluation, personal-data
`2603.29328`	Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning PDF	cs.CR, cs.AI, cs.CV, cs.DC, cs.LG	86	Realistic semantic in-distribution backdoors in federated learning; stronger threat model than patch triggers.	backdoors, federated-learning, adversarial-ML, security-eval, robustness
`2603.28281`	Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback PDF	cs.LG	86	Provable robustness to ε-fraction corrupted preference data in offline multi-agent RLHF setting	rlhf, offline-rl, multi-agent, robustness, data-corruption, theory, alignment
`2603.28673`	FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning PDF	cs.LG, cs.CR, cs.DC	86	Client-side pre-training data filtering to mitigate FL backdoors; practical security angle	backdoors, federated-learning, data-poisoning, ml-security, defenses
`2603.08269`	SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM PDF	cs.RO, cs.AI	86	Test-time compute scaling (MCTS) for VLM imitation; reusable recipe for robust robot agents	agents, robotics, VLM, test-time-scaling, MCTS, imitation-learning, planning
`2603.21656`	TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints PDF	cs.LG, cs.CY	86	Federated uncertainty quantification with distribution-free finite-sample coverage under heterogeneity	uncertainty-quantification, federated-learning, reliability, privacy, healthcare, conformal
`2603.29399`	ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities PDF	cs.AI, cs.DB	86	Auditor-Corrector finds benchmark flaws; shows agent capability underestimation and improves evaluation rigor	agent-evaluation, benchmarks, data-quality, auditing, data-engineering, ELT
`2603.28589`	Towards a Medical AI Scientist PDF	cs.AI, cs.LG	86	Autonomous “AI Scientist” tailored to clinical research with evidence grounding and traceability.	agents, autonomous-research, medical, evidence-grounding, traceability, LLM
`2603.29199`	AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction PDF	cs.AI	85	Open benchmark for multimodal agentic AEC tasks; includes harness techniques and baselines	agent-evaluation, benchmarks, multimodal, tool-use, real-world-tasks
`2603.29182`	Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses PDF	cs.LG, cs.CR	84	Shows dummy-class defenses fool AutoAttack; proposes DAWA for proper robustness evaluation.	adversarial-robustness, evaluation, attack-methods, security, benchmarks
`2603.29410`	AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models PDF	cs.CV, cs.AI, cs.LG	84	Adversarially robust VLM fine-tuning while preserving cross-modal alignment and zero-shot performance.	vision-language, adversarial-robustness, alignment-preservation, fine-tuning, zero-shot
`2603.22904`	Separating Diagnosis from Control: Auditable Policy Adaptation in Agent-Based Simulations with LLM-Based Diagnostics PDF	cs.AI	84	Separates LLM diagnosis from deterministic control for auditability in adaptive policy interventions	auditability, LLM, governance, agent-based-simulation, control, safety
`2604.00706`	AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages PDF	cs.CL	84	Fact-checking dataset for 10 African languages; exposes retrieval gaps and supports grounded verification	fact-checking, retrieval, low-resource, multilingual, grounding, misinformation
`2603.09331`	Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning PDF	cs.LG	84	Language-embedding implicit rewards for RL; could impact instruction-following agents & reward hacking analysis.	reinforcement-learning, language-reward, implicit-reward, agents, specification-gaming
`2604.00835`	Agentic Tool Use in Large Language Models PDF	cs.CL	84	Structured survey of LLM tool-use paradigms, failure modes, and eval gaps; useful map for agent safety work	tool-use, agents, survey, evaluation, failure-modes, LLM-systems
`2603.11709`	Scaling Laws for Educational AI Agents PDF	cs.AI	84	Proposes “agent scaling law” + structured AgentProfile; relevant to evaluating/engineering agent capability growth.	agents, scaling-laws, agent-evaluation, multi-agent, education, specifications

AI Paper Insight Brief

2026-04-05

0) Executive takeaways (read this first)

“Test-time scaling” is moving from text to embodied control: SAIL shows large success gains by spending more inference compute via MCTS over continuous trajectories (25%→73% avg success with 45 nodes), then distilling to recover latency.
Verification is becoming the dominant training/eval primitive across domains: solvers as verifiers (EVOM), simulators/digital twins as verifiers (SAIL), benchmark verifiers for agents (AEC-Bench, HippoCamp), and auditing pipelines that correct benchmarks themselves (ELT-Bench-Verified).
Federated learning security is in an arms race: a proactive client-side mitigation (FL-PBM) can drive ASR near 0–5% on tested traffic-sign setups, while a more realistic semantics-aware attack (SABLE) achieves high ASR even under robust aggregators—suggesting patch-trigger evaluations are no longer representative.
Privacy auditing is shifting to stronger threat models and automation: white-box gradient “feature drift” MIA (G-Drift) reports near-ceiling AUCs on Q&A benchmarks; AutoMIA uses an agent to discover logits-level MIAs across VLMs; SERSEM shows code-specific MIAs need structure-aware weighting to beat generic baselines.
Robustness evaluation itself is under attack: DAWA demonstrates that standard adversarial objectives can overestimate robustness for dummy-class defenses (e.g., 58.61%→29.52% robust acc on CIFAR-10 for a leading defense).
Long context can silently reduce deliberation: Reasoning Shift finds up to ~50% shorter reasoning traces under irrelevant long prefixes / multi-turn / bundled subtasks, with reduced self-verification—important for agent pipelines that embed subtasks in long histories.

2) Key themes (clusters)

Theme: Test-time compute & search for robustness (embodied + agents)

Why it matters: Robustness can be bought at inference time by turning one-shot generation into iterative search with scoring, then optionally distilling for deployment latency.
Representative papers:
- SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM
- COvolve: Adversarial Co-Evolution of LLM-Generated Policies and Environments
Common approach:
- Replace single-pass generation with iterative refinement (MCTS over trajectories; PSRO-style population updates).
- Use learned/LLM/VLM scoring signals to guide search (per-frame progress scoring; payoff matrices + MSNE).
- Maintain archives/populations to prevent regressions (successful rollout archive; environment/policy populations).
Open questions / failure modes:
- High test-time cost (SAIL MCTS ~645s vs ~72s distilled) and dependence on simulators/digital twins.
- Curriculum generation can become infeasible/over-hard without domain constraints (COvolve relies on helper functions/feasibility checks).
- How to standardize “compute budgets” for fair comparisons across methods.

Theme: Verifiers everywhere (execution-, simulator-, and harness-verified learning/eval)

Why it matters: When rewards are sparse or correctness is hard to label, external verifiers (solvers, simulators, deterministic scripts) provide scalable supervision and more reliable evaluation.
Representative papers:
Common approach:
- Execute candidate outputs in a sandbox and score outcomes (solver status/objective; simulator rollout success).
- Use structured task formats + automated verifiers to reduce subjective judging (JSONL outputs; harness scoring).
- Add step-wise annotations/trajectories to diagnose where systems fail (HippoCamp’s structured trajectories).
Open questions / failure modes:
- Verifier brittleness and benchmark artifacts can dominate measured performance (motivating ELT-Bench-Verified).
- Execution feedback may not fix deep semantic errors (EVOM notes constraint errors remain a key residual).
- Multimodal grounding remains a bottleneck even with better parsing/retrieval tools (AEC-Bench, HippoCamp).

Theme: Federated uncertainty & security under realistic heterogeneity

Why it matters: Clinical FL needs valid uncertainty under heterogeneity; FL security needs defenses that hold against in-distribution semantic triggers and adaptive attackers.
Representative papers:
Common approach:
- Use representation space to cope with heterogeneity (TrustFed routes by embedding distance; SABLE separates features).
- Minimize shared information to preserve privacy (TrustFed exchanges scalar distances/thresholds).
- Evaluate across multiple aggregators / partitions (SABLE tests FedAvg, Trimmed Mean, MultiKrum, FLAME, FilterFL).
Open questions / failure modes:
- Defense/attack mismatch: client-side sanitization (FL-PBM) vs semantic triggers that evade outlier filtering (SABLE).
- TrustFed’s neighborhood size selection is empirical; extensions beyond classification and single-modality are open.
- Operational assumptions (e.g., FL-PBM assumes a trusted execution environment on clients).

Theme: Privacy auditing & membership inference is diversifying (white-box, agentic, domain-specific)

Why it matters: Auditing training-data leakage is becoming more powerful (white-box drift signals) and more scalable (agent-discovered attacks), but also more domain-tailored (code-specific signals).
Representative papers:
Common approach:
- Go beyond output likelihoods: use gradients/representation drift (G-Drift) or internal probes (SERSEM).
- Automate metric discovery with closed-loop evaluation (AutoMIA strategy library + guidance agent).
- Tailor signals to domain structure (SERSEM downweights boilerplate, upweights identifiers/strings/lint anomalies).
Open questions / failure modes:
- Threat-model constraints: G-Drift is white-box; AutoMIA is grey-box (logits/tokenizer access).
- Robustness to defenses like differential privacy (G-Drift expects DP to weaken separability).
- Generalization beyond benchmark splits and modalities.

Theme: Robustness evaluation pitfalls & repair with guarantees

Why it matters: Some defenses exploit evaluation objectives (dummy-class “safe sink”), while post-hoc repair needs guarantees to avoid breaking previously-correct behavior.
Representative papers:
Common approach:
- Make objectives match the real success criterion (DAWA targets both true and paired dummy logits).
- Constrain updates to preserve prior behavior (WARP’s remain-set constraints; AGFT preserves CLIP alignment via soft targets).
- Provide stronger guarantees or broader generalization claims (WARP certificates; AGFT across 15 datasets).
Open questions / failure modes:
- DAWA evaluated on CIFAR ℓ∞ only; broader threat models not shown.
- WARP relies on first-order linearization and conservative Lipschitz certificates.
- AGFT mainly targets zero-shot classification under ℓ∞; broader multimodal tasks/threats remain open.

Theme: Benchmarking & auditing agent capability in real workflows

Why it matters: Reported agent performance can be dominated by harness design, verifier brittleness, and benchmark errors; auditing benchmarks can materially change conclusions.
Representative papers:
Common approach:
- Use automated verifiers + structured outputs to score end-to-end tasks.
- Add diagnostic layers: per-column audits (ELT), step-wise evidence units (HippoCamp), claim-level hallucination checks (PaperRecon).
- Compare harness/tooling variants to isolate bottlenecks (AEC-Bench H vs H+ parsing tools).
Open questions / failure modes:
- LLM-as-judge confounds and cost (ELT-Bench-Verified notes multi-day, hundreds-$ runs; PaperRecon uses GPT-5.4 judges).
- Multimodal spatial grounding remains weak even when retrieval improves (AEC-Bench).
- Presentation quality can trade off with factuality (PaperRecon: higher rubric but more major contradictions for some agents).

3) Technical synthesis

Search + scoring is converging across modalities: SAIL uses MCTS with VLM-derived per-frame progress rewards; COvolve uses payoff matrices + MSNE to stabilize across an archive—both are “population/search over candidates guided by learned evaluators.”
Representation space is the new routing layer: TrustFed assigns test samples to clients via embedding distances; SABLE explicitly separates triggered vs clean features; both treat embeddings as the operational interface under privacy/robustness constraints.
Outcome-verifiable RL is spreading beyond math: EVOM uses solver execution as reward; similar “verifier loops” appear in SAIL (digital twin execution) and in benchmark harnesses (AEC-Bench/HippoCamp).
Benchmark correctness is now a first-class variable: ELT-Bench-Verified shows 33% of column mismatches were benchmark-attributable and fixes raise SRDT 22.66%→32.51%, implying many “agent failures” are evaluation failures.
Adversarial robustness needs defense-aware objectives: DAWA shows that if the attack objective doesn’t match the defense mechanism (dummy sink), robustness is overstated; this parallels broader concerns about objective mismatch in evaluations.
Alignment/robustness fine-tuning is shifting to “structure-preserving” targets: AGFT uses pre-trained soft image→text distributions (plus calibration) to preserve CLIP alignment while improving robustness.
Agent safety is increasingly “behavioral control-plane”: Silicon Mirror uses risk-based gating + generator-critic rewrites; Secure Forgetting uses a conversion model to generate unlearning prompts and memory edits—both are orchestration-level controls without changing base weights.
Long-context pipelines may reduce safety margins: Reasoning Shift suggests that adding irrelevant context can reduce self-verification behavior, which can interact badly with agent systems that accumulate long histories.
Systems bottlenecks are becoming infrastructure bottlenecks: Heddle shows rollout throughput can improve up to 2.5× via trajectory-centric scheduling/placement/resource allocation—critical if verifier-heavy loops become standard.

4) Top 5 papers (with “why now”)

1) SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM

Turns brittle one-shot VLM trajectory prediction into MCTS over full trajectories with VLM scoring + step-level feedback.
Strong scaling with compute: 25%→73% avg success from 1 rollout to 45 nodes; real-world BlockIntoBowl 5/6 success.
Distillation cuts execution time 644.72s→72.306s, making “think longer then compress” practical.
Skepticism: depends on a trial-matched simulator/digital twin; sim-to-real gaps (pose/contact) still cause failures.

2) TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints

Practical federated conformal prediction with representation-aware client assignment and max-aggregated thresholds.
Evaluated at scale (>430k images, six modalities) with empirical coverage close to nominal under heterogeneity/imbalance.
Exchanges only scalar distances and thresholds, aligning with privacy constraints.
Skepticism: neighborhood size selection is empirical; limited to classification and single-modality tasks.

3) Execution-Verified Reinforcement Learning for Optimization Modeling

Uses solvers as deterministic verifiers for outcome-only RL (no process traces), with GRPO/DAPO updates.
Matches/slightly beats process-SFT on some benchmarks (e.g., OptiBench 62.95% vs 60.96%) and shows zero-shot solver transfer advantages.
Provides a concrete cold-start recipe (small SFT via cross-solver translation, then execution-RL).
Skepticism: constraint/semantic errors remain a major residual failure mode; requires execution harness infrastructure.

4) DAWA: Breaking the Safe Sink in Dummy Class Defenses

Shows a systematic evaluation flaw: standard attacks fall into dummy “safe sink,” overstating robustness.
DAWA’s objective targets both authentic and dummy logits; drops robust acc sharply (e.g., 58.61%→29.52% on CIFAR-10 for PGD-AT+DUCAT).
Computationally efficient and easy to integrate into evaluation suites.
Skepticism: demonstrated on CIFAR-10/100 under ℓ∞; broader datasets/threat models not shown.

5) Reasoning Shift: How Context Silently Shortens LLM Reasoning

Finds up to ~50% shorter reasoning traces when the same problem is embedded in long irrelevant context / multi-turn / bundled subtasks.
Sentence-level analysis suggests reduced self-verification and higher probability of stopping after first answer.
Directly relevant to long-context agents and tool-using systems that embed subtasks in large histories.
Skepticism: contexts are synthetic (e.g., long Shakespeare prefix) and deep trace analysis is focused on one model.

5) Practical next steps

For embodied agents: prototype a “trajectory search + learned scorer” loop (MCTS/beam) and measure success vs node budget; then test distillation to recover latency (SAIL-style).
For verifier-based RL: if you have a deterministic checker (solver, compiler, simulator), implement outcome-only RL with group-based updates and track which error types remain (EVOM highlights constraint errors).
For FL deployments: evaluate backdoor defenses against semantic, in-distribution triggers (SABLE-style), not just corner patches; separately test client-side sanitization (FL-PBM) under adaptive attackers.
For uncertainty in federated medical ML: add representation-aware routing + conservative threshold aggregation (TrustFed) and sweep neighborhood size to map the coverage–set-size frontier.
For privacy audits: run at least one white-box MIA (G-Drift) where possible, and one automated strategy search (AutoMIA) in grey-box settings; compare against domain-specific MIAs for code (SERSEM) if auditing code models.
For robustness evaluation: if using dummy-class defenses, incorporate dummy-aware success criteria and DAWA-like losses; otherwise you may be benchmarking the sink, not robustness.
For long-context agents: add instrumentation to log “reasoning token budget used” and self-check behaviors across context lengths; test whether context compaction or subtask isolation restores verification (motivated by Reasoning Shift).
For benchmarks/harnesses: budget time for benchmark auditing—ELT-Bench-Verified shows corrections can shift conclusions materially; treat verifier scripts as part of the model.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-05

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Test-time compute & search for robustness (embodied + agents)

Theme: Verifiers everywhere (execution-, simulator-, and harness-verified learning/eval)

Theme: Federated uncertainty & security under realistic heterogeneity

Theme: Privacy auditing & membership inference is diversifying (white-box, agentic, domain-specific)

Theme: Robustness evaluation pitfalls & repair with guarantees

Theme: Benchmarking & auditing agent capability in real workflows

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps