Daily AI Paper Report (2026-04-03)

Published: April 03, 2026

Chinese version: [中文]

Run stats

Candidates: 222
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-01T00:00:00Z → 2026-04-02T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.00788`	UK AISI Alignment Evaluation Case-Study PDF	cs.AI, cs.CR	96	AISI case study on sabotage in AI-lab coding assistants; concrete frontier-model behaviors.	AISI, alignment-eval, sabotage, agentic-coding, deployment, model-behavior
`2604.01151`	Detecting Multi-Agent Collusion Through Multi-Agent Interpretability PDF	cs.AI, cs.LG, cs.MA	95	Benchmark + probes for detecting multi-agent collusion; strong OOD transfer focus.	multi-agent, collusion, interpretability, probes, benchmark, security, OOD-generalization
`2604.01194`	AgentWatcher: A Rule-based Prompt Injection Monitor PDF	cs.CR	94	Rule-based prompt-injection monitor using causal attribution to scale to long contexts.	prompt-injection, agents, monitoring, attribution, long-context, security
`2604.00770`	Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning PDF	cs.LG, cs.AI	93	Backdoors for tokenless latent reasoning; high ASR and evades token-level defenses.	backdoors, latent-reasoning, continuous-CoT, adversarial, auditing, security
`2604.01212`	$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution PDF	cs.CL, cs.AI	92	Long-horizon agent benchmark (hundreds of turns) for planning, delayed feedback, compounding errors.	agents, benchmark, long-horizon, planning, evaluation, simulated-environment
`2604.00547`	Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models PDF	cs.AI, cs.LG	92	New safety benchmark for unified multimodal models; taxonomy + judging framework.	multimodal, safety-benchmark, evaluation, UMLM, red-teaming, safety-taxonomy
`2604.00414`	Decision-Centric Design for LLM Systems PDF	cs.AI, cs.LG	92	Makes LLM control decisions explicit/inspectable; improves debugging, constraints, and safety.	LLM systems, agent control, decision layer, tool use, reliability, governance
`2604.00627`	When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion PDF	cs.CR	91	Shows model merging can unlock hidden trojans; new attack surface for alignment fusion.	model-merging, trojans, safety-regression, attack-surface, alignment, security
`2604.00986`	Do Phone-Use Agents Respect Your Privacy? PDF	cs.CR, cs.AI, cs.CL, cs.LG	90	MyPhoneBench makes mobile-agent privacy measurable: permissions, minimal disclosure, memory.	privacy, mobile-agents, benchmark, auditing, permissions, evaluation
`2604.00842`	Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants PDF	cs.AI, cs.LG, cs.MA	90	User-simulator + FSM app modeling to evaluate proactive agents; introduces Pare-Bench tasks.	agents, proactive-assistants, user-simulation, benchmark, evaluation, tool-use
`2604.00892`	When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation PDF	cs.CL	90	InterruptBench targets long-horizon web agents handling mid-task goal changes.	agents, web-navigation, long-horizon, interruptibility, benchmark, reliability, human-in-the-loop
`2604.00445`	Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models PDF	cs.AI, cs.CL	90	Truth-anchored calibration for LLM uncertainty to detect hallucinations; targets proxy failure.	uncertainty, hallucinations, calibration, reliability, evaluation, post-hoc
`2604.01202`	Therefore I am. I Think PDF	cs.AI	89	Evidence decisions are encoded pre-CoT; probing + causal steering affects behavior.	mechanistic-interpretability, chain-of-thought, steering, tool-use, probes, agency
`2604.00387`	RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems PDF	cs.CR, cs.AI	88	Defense-in-depth for RAG poisoning via provenance/attestation + taint-style reasoning.	RAG, data-poisoning, provenance, supply-chain, grounding, security
`2604.01052`	VibeGuard: A Security Gate Framework for AI-Generated Code PDF	cs.CR, cs.AI	88	Practical secure-dev gate for AI-generated code; targets real packaging/artifact leak failure modes.	security, code-generation, supply-chain, static-analysis, deployment, guardrails
`2604.00392`	EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts PDF	cs.SE, cs.AI	86	Benchmark for LLM-generated tool libraries with safety/robustness and regression metrics.	agents, tool-use, benchmark, software-quality, safety-metrics, evaluation
`2604.00594`	Agent psychometrics: Task-level performance prediction in agentic coding benchmarks PDF	cs.AI	86	Predicts per-task success in agentic coding via IRT-style psychometrics; separates LLM vs scaffold ability.	agents, coding, evaluation, predictive-metrics, IRT, scaffolding
`2604.00477`	Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation PDF	cs.AI, cs.CL, cs.HC, cs.MA	86	Agent-judge eval study: panel size vs score saturation and issue discovery scaling.	evaluation, LLM-judges, scaling-laws, reliability, human-agreement, red-teaming
`2604.00694`	Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures PDF	cs.ET, cs.AI	86	Agent web interaction via shared shadow-API route graph; could reshape agent architectures/security.	agents, web automation, APIs, tooling, attack surface, infrastructure
`2604.01195`	ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget PDF	cs.CL, cs.AI, cs.IR	85	20K verifiable multi-step search-agent dataset built cheaply; includes external verification pipeline.	search-agents, dataset, verification, web, RAG, training-data
`2604.01108`	Adversarial Moral Stress Testing of Large Language Models PDF	cs.AI	84	Multi-turn adversarial ethical stress testing to catch rare failures and degradation.	safety-eval, multi-turn, red-teaming, ethics, robustness, benchmarks
`2604.00979`	Dual Optimal: Make Your LLM Peer-like with Dignity PDF	cs.CL, cs.AI	84	Targets sycophancy/evasiveness; introduces PersonaKnob + constrained Lagrangian DPO to avoid collapse.	alignment, anti-sycophancy, DPO, preference-learning, personas, evaluation
`2604.01007`	OmniMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory PDF	cs.AI	84	Autonomous research pipeline discovers multimodal lifelong agent memory design.	agents, memory, lifelong-learning, multimodal, auto-research, retrieval, benchmarks
`2604.00722`	LangMARL: Natural Language Multi-Agent Reinforcement Learning PDF	cs.CL	84	Brings MARL credit assignment + policy gradients into language space for coordinating LLM agents.	multi-agent, credit assignment, LLM agents, MARL, coordination, policy gradient
`2604.01039`	Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks PDF	cs.CR, cs.AI	83	Automates testing/hardening system prompts vs encoding-based instruction leakage attacks.	system-prompt, instruction-leakage, encoding-attacks, hardening, LLM-security
`2604.00778`	From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks PDF	cs.CL	83	Mechanistic analysis: correct internal counting but late suppression at output layer.	mechanistic-interpretability, reasoning, probing, activation-patching, logit-lens, failure-modes
`2604.01220`	Universal YOCO for Efficient Depth Scaling PDF	cs.CL	83	Efficient test-time depth scaling via parameter sharing/recursive compute; targets KV/depth costs.	LLM efficiency, test-time scaling, architecture, parameter sharing, long reasoning
`2604.00356`	Signals: Trajectory Sampling and Triage for Agentic Interactions PDF	cs.AI, cs.CL	82	Cheap signals to sample/triage agent trajectories for post-deployment monitoring at scale.	agents, monitoring, telemetry, triage, post-deployment, evaluation
`2604.00362`	In harmony with gpt-oss PDF	cs.AI, cs.LG	81	Reproduces gpt-oss tool scores via reverse-engineered tools + native agent harness.	agents, tool-use, reproducibility, evaluation, SWE-bench, harness, open-source
`2604.00801`	Routing-Free Mixture-of-Experts PDF	cs.LG, cs.AI, cs.CL	81	Removes centralized MoE routing; continuous expert self-activation + adaptive load balancing.	Mixture-of-Experts, routing, scaling, efficiency, architecture, training dynamics

AI Paper Insight Brief

2026-04-03

0) Executive takeaways (read this first)

Agent evaluation is shifting from “one score” to “systems observability”: multiple papers propose cheap triage, psychometric difficulty modeling, and panel-sizing laws to make agent monitoring and improvement budget-feasible without judging every trajectory.
Interface fidelity is now a first-order benchmark variable: reproducing published agentic coding scores required recovering in-distribution tools and running the model in its native message format; format/tool mismatch can create huge, misleading gaps.
Security threats are expanding from prompts to pipelines and weights: new attacks/defenses target (i) RAG supply chains (provenance + taint), (ii) model merging (latent trojans that activate only post-merge), (iii) continuous-latent reasoning (embedding-row backdoors), and (iv) system prompt leakage via encoding formats.
Long-horizon “realism” benchmarks are getting sharper: proactive assistants with active users, interruptible web agents, and year-long planning sims all show frontier models still plateau at modest success rates and incur large token-dominated recovery costs.
Interpretability results increasingly imply control/attack surfaces: evidence that tool-use decisions are encoded before chain-of-thought begins, and that some symbolic failures come from late-layer suppression, both suggest interventions must target internal decision circuits—not just prompting.

2) Key themes (clusters)

Theme: Scalable agent evaluation & data selection

Why it matters: Agentic systems generate massive interaction traces; without scalable selection and measurement, teams either overspend on review/judging or miss rare but critical failures.
Representative papers:
Common approach:
- Replace “evaluate everything” with selection/estimation layers (rule signals, IRT-style predictors, panel scaling laws).
- Use structured artifacts (tool calls, repo state/tests/solutions, persona diaries) to improve interpretability and efficiency.
- Report metrics beyond task success (informativeness yield, ICC reliability, issue discovery scaling, library health).
Open questions / failure modes:
- How well do these methods transfer from benchmarks/simulated users to production traffic?
- Risk of blind spots: coarse signals miss “behaviorally normal but semantically wrong” trajectories; predictors may encode dataset artifacts.
- How to close the loop end-to-end (triage → preference data → training → measurable improvement)?

Theme: Harness fidelity & reproducibility in agentic coding

Why it matters: Published scores can be non-reproducible if the evaluation harness, message format, and toolset differ from training-time distribution—misleading model selection and deployment planning.
Representative papers:
- In harmony with gpt-oss
- EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts
Common approach:
- Recover or define in-distribution tools and schemas; run agents in native formats to avoid conversion loss.
- Measure not just pass@1 but also context overflow, tool schema robustness, and regression/composability of generated code.
Open questions / failure modes:
- Tool discovery may be incomplete if logs are partial; harness choices can still hide contamination or other confounds.
- How to standardize “agent harness specs” so leaderboards remain comparable across implementations?

Theme: Supply-chain security for LLM systems (data, prompts, weights)

Why it matters: As LLM systems become compositional (RAG corpora, merged checkpoints, system prompts, generated artifacts), attackers can target the pipeline rather than the model’s surface behavior.
Representative papers:
Common approach:
- Treat knowledge/model artifacts like software supply chains (attestations, provenance, integrity bounds).
- Demonstrate stealthy attacks that pass standard checks (safe sources that become unsafe post-merge; latent triggers in embeddings).
- Add defense-in-depth layers (provenance + trust-weighted retrieval + taint tracking; design-time prompt reshaping).
Open questions / failure modes:
- Provenance defenses have insider replacement blind spots (in-place edits) unless hash-pinning/re-attestation is enforced.
- Merging-time defenses can fail under adaptive threats; detection without privileged access (hidden states) remains hard.
- Encoding-based prompt leakage suggests “refusal on direct ask” is not a confidentiality guarantee.

Theme: Realistic long-horizon & mixed-initiative agent benchmarks

Why it matters: Deployment failures often come from statefulness, interruptions, user acceptance, and delayed consequences—properties underrepresented in short-horizon benchmarks.
Representative papers:
Common approach:
- Build stateful environments (FSM apps, web UI state, POMDP business sim) and evaluate success under constraints.
- Add metrics for acceptance, post-update success curves, and cost/efficiency (tokens/actions, API cost).
Open questions / failure modes:
- Simulators may not capture real user variability; synthesized interruptions/scenarios can bias results.
- Token overhead dominates recovery in interruption settings—how to reduce “thinking cost” without harming adaptation?

Theme: Making control and internal decisions explicit (interpretability → engineering)

Why it matters: If decisions are made implicitly inside generation, failures are hard to attribute; if decisions are encoded pre-CoT, explanations may be post-hoc.
Representative papers:
Common approach:
- Separate signals/estimators from policies/controllers (explicit decision layer).
- Use probing/steering/patching to localize where decisions or failures arise (pre-gen action encoding; late-layer suppression circuits).
Open questions / failure modes:
- How to prevent premature commitment (pre-gen decision) while preserving performance?
- Whether these mechanistic findings generalize to larger models and real tool-use stacks.

3) Technical synthesis

Multiple works converge on “separate measurement from action”: Signals (triage), Decision-Centric (explicit δ), and agent-judge scaling (ICC vs discovery) all argue for modularizing what you observe vs what you do with it.
Budget-aware evaluation is becoming formal: Signals reports informativeness yield per label; agent-judge panels show logarithmic reliability but power-law discovery; psychometrics predicts per-task success to avoid reruns.
Artifact-level evaluation is expanding beyond outputs: EvolveTool-Bench evaluates evolving tool libraries (reuse/regression), while gpt-oss reproduction shows harness/tool/message-format are part of the “artifact.”
Security papers increasingly adopt supply-chain framings: RAGShield uses attestations + taint; TrojanMerge targets parameter fusion; THOUGHTSTEER targets embedding rows in latent-reasoning models; encoding attacks target system instruction confidentiality.
Several results imply privileged-access asymmetry: strong detection bounds/probes exist with hidden-state access (continuous-latent backdoor probes; collusion probes), but black-box detection is much weaker.
Long-horizon benchmarks (Pare, InterruptBench, YC-Bench) consistently show frontier models plateau and that efficiency costs (tokens, retries, API cost) are decisive, not just success rate.
Interpretability findings (pre-gen tool decision; late suppression circuits) suggest that post-hoc CoT can be unreliable as an explanation channel—supporting Decision-Centric’s push for explicit decision interfaces.
Reproducibility work (Harmony/tools) highlights that context window overflow and message formatting can dominate outcomes—interacting with long-horizon settings where context pressure is constant.
Across evaluation papers, there’s a recurring pattern: coarse metrics hide failure modes (task completion hides library debt; average scores hide tail drift; aggregate pass@1 hides harness mismatch).

4) Top 5 papers (with “why now”)

1) Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

Shows a training-time backdoor (THOUGHTSTEER) that achieves ~100% attack success with minimal clean-accuracy loss on continuous-latent reasoning models.
Connects robustness to Neural Collapse and reports linear probes with AUC≈1.0 given hidden-state access.
Evaluates multiple defenses and finds they fail to reduce ASR while preserving clean accuracy.
Skepticism: strongest detection relies on hidden-state access; mechanistic depth is most complete on smaller models (COCONUT 124M).

2) When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Introduces TrojanMerge: source models remain individually safe, but merged models reach harmful scores up to 85.4%.
Works across multiple merging algorithms (Task Arithmetic/DARE/TIES/KnOTS) with high average harmfulness post-merge.
Highlights that “passes safety checks alone” is not sufficient for models intended for merging.
Skepticism: evaluated primarily on dual-model merges; attack assumes ability to construct a safety-critical transformation (gradient/data access).

3) In harmony with gpt-oss

Independently reproduces OpenAI gpt-oss-20b scores by recovering in-distribution tools and implementing a native Harmony harness.
Quantifies how Chat Completions conversion inflates context overflow (e.g., Harmony 0.2% vs Chat 11.0% in one setting).
Provides a concrete tool-discovery methodology and harness design that practitioners can reuse.
Skepticism: tool discovery is bounded by available logs; contamination concerns in SWE Verified are explicitly not investigated.

4) Signals: Trajectory Sampling and Triage for Agentic Interactions

Deterministic, model-free signals raise “developer-informative” yield to 82% vs 54% random on τ-bench, improving label efficiency (reported 1.52×).
Separates interaction vs execution failures—important for tool-using agents where fluent dialogue can mask execution issues.
Designed to run always-on without extra model calls.
Skepticism: coarse taxonomy misses semantically wrong but behaviorally normal traces; evaluation uses simulated users (τ-bench).

5) Do Phone-Use Agents Respect Your Privacy?

Makes privacy in GUI agents auditable via iMy (LOW/HIGH data + permission tools) and instrumented apps that log field-level edits.
Shows success and privacy diverge sharply (e.g., Claude Opus 4.6: 82.8% success but 47.2% PQSR at τ=0.7).
Identifies form minimization (overfilling optional personal fields) as the most persistent failure mode.
Skepticism: mock apps + permissive user simulator (always grants HIGH) limit realism; doesn’t cover network exfiltration or cross-app leakage.

5) Practical next steps

Add a cheap triage layer to your agent logs (interaction + execution signals) to prioritize human review; track “informativeness per label” as a first-class metric.
Version and validate your harness: lock message format, tool schemas, and context accounting; measure context overflow and tool-call schema adherence as part of CI for evaluations.
Treat RAG corpora like supply chains: implement document attestations + hash-pinning/re-attestation workflows; add trust-weighted retrieval and taint propagation for high-integrity domains.
Harden against prompt/system leakage via format attacks: explicitly test “print system prompt in YAML/TOML/cron/gitignore” style probes; consider design-time instruction reshaping and re-test ASR.
If you merge models, add merge-time safety checks: evaluate harmfulness post-merge (not just per-source), and consider integrity verification of contributors before fusion.
Benchmark long-horizon behaviors with cost curves: for interruptions, track SR(k) and token deltas; for proactive assistants, track proposal vs acceptance vs success; for planning sims, track memory usage (scratchpad writes) as a predictor.
Make control explicit: separate signal estimation (sufficiency/correctness/uncertainty) from deterministic policies; log decision contexts so failures are attributable.
Privacy for GUI agents: instrument form drafts and enforce minimization policies (required vs optional fields); measure PQSR-like joint metrics rather than task success alone.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-03

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Scalable agent evaluation & data selection

Theme: Harness fidelity & reproducibility in agentic coding

Theme: Supply-chain security for LLM systems (data, prompts, weights)

Theme: Realistic long-horizon & mixed-initiative agent benchmarks

Theme: Making control and internal decisions explicit (interpretability → engineering)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps