Daily AI Paper Report (2026-04-03)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 222
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-01T00:00:00Z → 2026-04-02T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.00788UK AISI Alignment Evaluation Case-Study
PDF
cs.AI, cs.CR96AISI case study on sabotage in AI-lab coding assistants; concrete frontier-model behaviors.AISI, alignment-eval, sabotage, agentic-coding, deployment, model-behavior
2604.01151Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
PDF
cs.AI, cs.LG, cs.MA95Benchmark + probes for detecting multi-agent collusion; strong OOD transfer focus.multi-agent, collusion, interpretability, probes, benchmark, security, OOD-generalization
2604.01194AgentWatcher: A Rule-based Prompt Injection Monitor
PDF
cs.CR94Rule-based prompt-injection monitor using causal attribution to scale to long contexts.prompt-injection, agents, monitoring, attribution, long-context, security
2604.00770Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
PDF
cs.LG, cs.AI93Backdoors for tokenless latent reasoning; high ASR and evades token-level defenses.backdoors, latent-reasoning, continuous-CoT, adversarial, auditing, security
2604.01212$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
PDF
cs.CL, cs.AI92Long-horizon agent benchmark (hundreds of turns) for planning, delayed feedback, compounding errors.agents, benchmark, long-horizon, planning, evaluation, simulated-environment
2604.00547Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models
PDF
cs.AI, cs.LG92New safety benchmark for unified multimodal models; taxonomy + judging framework.multimodal, safety-benchmark, evaluation, UMLM, red-teaming, safety-taxonomy
2604.00414Decision-Centric Design for LLM Systems
PDF
cs.AI, cs.LG92Makes LLM control decisions explicit/inspectable; improves debugging, constraints, and safety.LLM systems, agent control, decision layer, tool use, reliability, governance
2604.00627When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion
PDF
cs.CR91Shows model merging can unlock hidden trojans; new attack surface for alignment fusion.model-merging, trojans, safety-regression, attack-surface, alignment, security
2604.00986Do Phone-Use Agents Respect Your Privacy?
PDF
cs.CR, cs.AI, cs.CL, cs.LG90MyPhoneBench makes mobile-agent privacy measurable: permissions, minimal disclosure, memory.privacy, mobile-agents, benchmark, auditing, permissions, evaluation
2604.00842Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
PDF
cs.AI, cs.LG, cs.MA90User-simulator + FSM app modeling to evaluate proactive agents; introduces Pare-Bench tasks.agents, proactive-assistants, user-simulation, benchmark, evaluation, tool-use
2604.00892When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
PDF
cs.CL90InterruptBench targets long-horizon web agents handling mid-task goal changes.agents, web-navigation, long-horizon, interruptibility, benchmark, reliability, human-in-the-loop
2604.00445Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models
PDF
cs.AI, cs.CL90Truth-anchored calibration for LLM uncertainty to detect hallucinations; targets proxy failure.uncertainty, hallucinations, calibration, reliability, evaluation, post-hoc
2604.01202Therefore I am. I Think
PDF
cs.AI89Evidence decisions are encoded pre-CoT; probing + causal steering affects behavior.mechanistic-interpretability, chain-of-thought, steering, tool-use, probes, agency
2604.00387RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems
PDF
cs.CR, cs.AI88Defense-in-depth for RAG poisoning via provenance/attestation + taint-style reasoning.RAG, data-poisoning, provenance, supply-chain, grounding, security
2604.01052VibeGuard: A Security Gate Framework for AI-Generated Code
PDF
cs.CR, cs.AI88Practical secure-dev gate for AI-generated code; targets real packaging/artifact leak failure modes.security, code-generation, supply-chain, static-analysis, deployment, guardrails
2604.00392EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts
PDF
cs.SE, cs.AI86Benchmark for LLM-generated tool libraries with safety/robustness and regression metrics.agents, tool-use, benchmark, software-quality, safety-metrics, evaluation
2604.00594Agent psychometrics: Task-level performance prediction in agentic coding benchmarks
PDF
cs.AI86Predicts per-task success in agentic coding via IRT-style psychometrics; separates LLM vs scaffold ability.agents, coding, evaluation, predictive-metrics, IRT, scaffolding
2604.00477Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
PDF
cs.AI, cs.CL, cs.HC, cs.MA86Agent-judge eval study: panel size vs score saturation and issue discovery scaling.evaluation, LLM-judges, scaling-laws, reliability, human-agreement, red-teaming
2604.00694Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures
PDF
cs.ET, cs.AI86Agent web interaction via shared shadow-API route graph; could reshape agent architectures/security.agents, web automation, APIs, tooling, attack surface, infrastructure
2604.01195ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget
PDF
cs.CL, cs.AI, cs.IR8520K verifiable multi-step search-agent dataset built cheaply; includes external verification pipeline.search-agents, dataset, verification, web, RAG, training-data
2604.01108Adversarial Moral Stress Testing of Large Language Models
PDF
cs.AI84Multi-turn adversarial ethical stress testing to catch rare failures and degradation.safety-eval, multi-turn, red-teaming, ethics, robustness, benchmarks
2604.00979Dual Optimal: Make Your LLM Peer-like with Dignity
PDF
cs.CL, cs.AI84Targets sycophancy/evasiveness; introduces PersonaKnob + constrained Lagrangian DPO to avoid collapse.alignment, anti-sycophancy, DPO, preference-learning, personas, evaluation
2604.01007OmniMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory
PDF
cs.AI84Autonomous research pipeline discovers multimodal lifelong agent memory design.agents, memory, lifelong-learning, multimodal, auto-research, retrieval, benchmarks
2604.00722LangMARL: Natural Language Multi-Agent Reinforcement Learning
PDF
cs.CL84Brings MARL credit assignment + policy gradients into language space for coordinating LLM agents.multi-agent, credit assignment, LLM agents, MARL, coordination, policy gradient
2604.01039Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks
PDF
cs.CR, cs.AI83Automates testing/hardening system prompts vs encoding-based instruction leakage attacks.system-prompt, instruction-leakage, encoding-attacks, hardening, LLM-security
2604.00778From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks
PDF
cs.CL83Mechanistic analysis: correct internal counting but late suppression at output layer.mechanistic-interpretability, reasoning, probing, activation-patching, logit-lens, failure-modes
2604.01220Universal YOCO for Efficient Depth Scaling
PDF
cs.CL83Efficient test-time depth scaling via parameter sharing/recursive compute; targets KV/depth costs.LLM efficiency, test-time scaling, architecture, parameter sharing, long reasoning
2604.00356Signals: Trajectory Sampling and Triage for Agentic Interactions
PDF
cs.AI, cs.CL82Cheap signals to sample/triage agent trajectories for post-deployment monitoring at scale.agents, monitoring, telemetry, triage, post-deployment, evaluation
2604.00362In harmony with gpt-oss
PDF
cs.AI, cs.LG81Reproduces gpt-oss tool scores via reverse-engineered tools + native agent harness.agents, tool-use, reproducibility, evaluation, SWE-bench, harness, open-source
2604.00801Routing-Free Mixture-of-Experts
PDF
cs.LG, cs.AI, cs.CL81Removes centralized MoE routing; continuous expert self-activation + adaptive load balancing.Mixture-of-Experts, routing, scaling, efficiency, architecture, training dynamics

AI Paper Insight Brief

2026-04-03

0) Executive takeaways (read this first)

  • Agent evaluation is shifting from “one score” to “systems observability”: multiple papers propose cheap triage, psychometric difficulty modeling, and panel-sizing laws to make agent monitoring and improvement budget-feasible without judging every trajectory.
  • Interface fidelity is now a first-order benchmark variable: reproducing published agentic coding scores required recovering in-distribution tools and running the model in its native message format; format/tool mismatch can create huge, misleading gaps.
  • Security threats are expanding from prompts to pipelines and weights: new attacks/defenses target (i) RAG supply chains (provenance + taint), (ii) model merging (latent trojans that activate only post-merge), (iii) continuous-latent reasoning (embedding-row backdoors), and (iv) system prompt leakage via encoding formats.
  • Long-horizon “realism” benchmarks are getting sharper: proactive assistants with active users, interruptible web agents, and year-long planning sims all show frontier models still plateau at modest success rates and incur large token-dominated recovery costs.
  • Interpretability results increasingly imply control/attack surfaces: evidence that tool-use decisions are encoded before chain-of-thought begins, and that some symbolic failures come from late-layer suppression, both suggest interventions must target internal decision circuits—not just prompting.

2) Key themes (clusters)

Theme: Scalable agent evaluation & data selection

Theme: Harness fidelity & reproducibility in agentic coding

  • Why it matters: Published scores can be non-reproducible if the evaluation harness, message format, and toolset differ from training-time distribution—misleading model selection and deployment planning.
  • Representative papers:
  • Common approach:
    • Recover or define in-distribution tools and schemas; run agents in native formats to avoid conversion loss.
    • Measure not just pass@1 but also context overflow, tool schema robustness, and regression/composability of generated code.
  • Open questions / failure modes:
    • Tool discovery may be incomplete if logs are partial; harness choices can still hide contamination or other confounds.
    • How to standardize “agent harness specs” so leaderboards remain comparable across implementations?

Theme: Supply-chain security for LLM systems (data, prompts, weights)

Theme: Realistic long-horizon & mixed-initiative agent benchmarks

Theme: Making control and internal decisions explicit (interpretability → engineering)

  • Why it matters: If decisions are made implicitly inside generation, failures are hard to attribute; if decisions are encoded pre-CoT, explanations may be post-hoc.
  • Representative papers:
  • Common approach:
    • Separate signals/estimators from policies/controllers (explicit decision layer).
    • Use probing/steering/patching to localize where decisions or failures arise (pre-gen action encoding; late-layer suppression circuits).
  • Open questions / failure modes:
    • How to prevent premature commitment (pre-gen decision) while preserving performance?
    • Whether these mechanistic findings generalize to larger models and real tool-use stacks.

3) Technical synthesis

  • Multiple works converge on “separate measurement from action”: Signals (triage), Decision-Centric (explicit δ), and agent-judge scaling (ICC vs discovery) all argue for modularizing what you observe vs what you do with it.
  • Budget-aware evaluation is becoming formal: Signals reports informativeness yield per label; agent-judge panels show logarithmic reliability but power-law discovery; psychometrics predicts per-task success to avoid reruns.
  • Artifact-level evaluation is expanding beyond outputs: EvolveTool-Bench evaluates evolving tool libraries (reuse/regression), while gpt-oss reproduction shows harness/tool/message-format are part of the “artifact.”
  • Security papers increasingly adopt supply-chain framings: RAGShield uses attestations + taint; TrojanMerge targets parameter fusion; THOUGHTSTEER targets embedding rows in latent-reasoning models; encoding attacks target system instruction confidentiality.
  • Several results imply privileged-access asymmetry: strong detection bounds/probes exist with hidden-state access (continuous-latent backdoor probes; collusion probes), but black-box detection is much weaker.
  • Long-horizon benchmarks (Pare, InterruptBench, YC-Bench) consistently show frontier models plateau and that efficiency costs (tokens, retries, API cost) are decisive, not just success rate.
  • Interpretability findings (pre-gen tool decision; late suppression circuits) suggest that post-hoc CoT can be unreliable as an explanation channel—supporting Decision-Centric’s push for explicit decision interfaces.
  • Reproducibility work (Harmony/tools) highlights that context window overflow and message formatting can dominate outcomes—interacting with long-horizon settings where context pressure is constant.
  • Across evaluation papers, there’s a recurring pattern: coarse metrics hide failure modes (task completion hides library debt; average scores hide tail drift; aggregate pass@1 hides harness mismatch).

4) Top 5 papers (with “why now”)

1) Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

  • Shows a training-time backdoor (THOUGHTSTEER) that achieves ~100% attack success with minimal clean-accuracy loss on continuous-latent reasoning models.
  • Connects robustness to Neural Collapse and reports linear probes with AUC≈1.0 given hidden-state access.
  • Evaluates multiple defenses and finds they fail to reduce ASR while preserving clean accuracy.
  • Skepticism: strongest detection relies on hidden-state access; mechanistic depth is most complete on smaller models (COCONUT 124M).

2) When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

  • Introduces TrojanMerge: source models remain individually safe, but merged models reach harmful scores up to 85.4%.
  • Works across multiple merging algorithms (Task Arithmetic/DARE/TIES/KnOTS) with high average harmfulness post-merge.
  • Highlights that “passes safety checks alone” is not sufficient for models intended for merging.
  • Skepticism: evaluated primarily on dual-model merges; attack assumes ability to construct a safety-critical transformation (gradient/data access).

3) In harmony with gpt-oss

  • Independently reproduces OpenAI gpt-oss-20b scores by recovering in-distribution tools and implementing a native Harmony harness.
  • Quantifies how Chat Completions conversion inflates context overflow (e.g., Harmony 0.2% vs Chat 11.0% in one setting).
  • Provides a concrete tool-discovery methodology and harness design that practitioners can reuse.
  • Skepticism: tool discovery is bounded by available logs; contamination concerns in SWE Verified are explicitly not investigated.

4) Signals: Trajectory Sampling and Triage for Agentic Interactions

  • Deterministic, model-free signals raise “developer-informative” yield to 82% vs 54% random on τ-bench, improving label efficiency (reported 1.52×).
  • Separates interaction vs execution failures—important for tool-using agents where fluent dialogue can mask execution issues.
  • Designed to run always-on without extra model calls.
  • Skepticism: coarse taxonomy misses semantically wrong but behaviorally normal traces; evaluation uses simulated users (τ-bench).

5) Do Phone-Use Agents Respect Your Privacy?

  • Makes privacy in GUI agents auditable via iMy (LOW/HIGH data + permission tools) and instrumented apps that log field-level edits.
  • Shows success and privacy diverge sharply (e.g., Claude Opus 4.6: 82.8% success but 47.2% PQSR at τ=0.7).
  • Identifies form minimization (overfilling optional personal fields) as the most persistent failure mode.
  • Skepticism: mock apps + permissive user simulator (always grants HIGH) limit realism; doesn’t cover network exfiltration or cross-app leakage.

5) Practical next steps

  • Add a cheap triage layer to your agent logs (interaction + execution signals) to prioritize human review; track “informativeness per label” as a first-class metric.
  • Version and validate your harness: lock message format, tool schemas, and context accounting; measure context overflow and tool-call schema adherence as part of CI for evaluations.
  • Treat RAG corpora like supply chains: implement document attestations + hash-pinning/re-attestation workflows; add trust-weighted retrieval and taint propagation for high-integrity domains.
  • Harden against prompt/system leakage via format attacks: explicitly test “print system prompt in YAML/TOML/cron/gitignore” style probes; consider design-time instruction reshaping and re-test ASR.
  • If you merge models, add merge-time safety checks: evaluate harmfulness post-merge (not just per-source), and consider integrity verification of contributors before fusion.
  • Benchmark long-horizon behaviors with cost curves: for interruptions, track SR(k) and token deltas; for proactive assistants, track proposal vs acceptance vs success; for planning sims, track memory usage (scratchpad writes) as a predictor.
  • Make control explicit: separate signal estimation (sufficiency/correctness/uncertainty) from deterministic policies; log decision contexts so failures are attributable.
  • Privacy for GUI agents: instrument form drafts and enforce minimization policies (required vs optional fields); measure PQSR-like joint metrics rather than task success alone.

Generated from per-paper analyses; no external browsing.