Daily AI Paper Report (2026-04-16)

Published: April 16, 2026

Chinese version: [中文]

Run stats

Candidates: 261
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-14T00:00:00Z → 2026-04-15T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.12177`	Policy-Invisible Violations in LLM-Based Agents PDF	cs.AI, cs.CL, cs.CR, cs.LG	95	New agent failure mode + benchmark for compliance when policy facts are hidden from context	agents, compliance, benchmark, evaluation, tool-use, context-limitations, governance
`2604.12500`	Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design PDF	cs.LG, cs.CR	95	Shows RL safety training can flip to harmful misalignment depending on environment design	agent-safety, rl, specification-gaming, sycophancy, evaluation, misalignment
`2604.12172`	COBALT-TLA: A Neuro-Symbolic Verification Loop for Cross-Chain Bridge Vulnerability Discovery PDF	cs.CR, cs.LO	95	LLM+TLA+ loop finds bridge vulns fast; strong security relevance and concrete eval on Nomad-like exploit.	agent-security, formal-verification, TLA+, cybersecurity, vulnerability-discovery, neuro-symbolic, tool-augmented-LLMs
`2604.12384`	Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints PDF	cs.AI	95	Coupled weight+activation constraints to prevent safety drift during fine-tuning; uses SAE safety features.	llm-safety, safety-drift, fine-tuning, regularization, sparse-autoencoders, refusal, alignment
`2604.12284`	WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents PDF	cs.CR	93	Guard-agent architecture to detect web prompt injection; targets real VLM web-agent threat model	web-agents, prompt-injection, guard-model, agent-security, VLM, detection
`2604.13018`	Toward Autonomous Long-Horizon Engineering for ML Research PDF	cs.CL	93	Long-horizon ML research engineering agent with permission-scoped workspace; relevant to agent safety & control.	agents, autonomous-research, orchestration, tool-use, permissions, state-continuity, agent-evals
`2604.12162`	AlphaEval: Evaluating Agents in Production PDF	cs.CL	92	Production-grounded agent benchmark (94 tasks, 7 companies) addressing real eval gaps	agents, evaluation, benchmarks, production, long-horizon, human-judgment
`2604.12374`	Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning PDF	cs.LG, cs.AI, cs.CL	92	Open 120B MoE hybrid Mamba-Transformer w/ 1M context + speculative decoding; big frontier capability jump.	frontier-llm, MoE, mamba, long-context, efficiency, speculative-decoding, open-model
`2604.12232`	TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs PDF	cs.CR, cs.AI, cs.SE	91	Fuzzing chat templates as an overlooked jailbreak surface; systematic red-teaming methodology	jailbreak, red-teaming, fuzzing, chat-templates, LLM-security, evaluation
`2604.13006`	One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness PDF	cs.CL, cs.AI	91	Shows instruction-tuned helpfulness collapses under tiny lexical constraints; important robustness failure mode.	robustness, instruction-tuning, evaluation, reliability, constraints, helpfulness, failure-modes
`2604.12342`	CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training PDF	cs.CR, cs.CV	90	New privacy attack surface: subset/coreset selection choices can leak sensitive info	privacy, data-leakage, training-data, security, attacks, coresets
`2604.12632`	Calibration-Aware Policy Optimization for Reasoning LLMs PDF	cs.LG, cs.AI	90	Targets overconfidence from GRPO; proposes calibration-aware RL objective with theory + bounds for reasoning LLMs.	alignment, calibration, RLHF, policy-optimization, reasoning, uncertainty, GRPO
`2604.12312`	CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems PDF	cs.CL	89	Benchmark for LLM-judge reliability in detecting/localizing compliance violations in dialogues	LLM-as-judge, compliance, benchmark, evaluation, dialogue, policy-violations
`2604.12359`	Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors PDF	cs.CR, cs.CL	88	Stealthy LLM backdoors by compiling activation steering into weights; highlights supply-chain risk	backdoors, weight-editing, supply-chain, LLM-security, stealth-attacks, red-teaming
`2604.12994`	LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software PDF	cs.CR, cs.AI	88	Framework to evaluate LLM vs classic repair on real logical vulns; useful for secure coding	cybersecurity, program-repair, llm-for-code, evaluation, vulnerabilities
`2604.12290`	Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization PDF	cs.AI, cs.CL	88	Real-world engineering benchmark for iterative propose-execute-evaluate agents with verifiers and continuous rewards.	agents, evaluation, benchmarks, generative-optimization, tool-use, verifiers, long-horizon
`2604.12308`	ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance PDF	cs.CL	88	Models ambiguous/incomplete context for privacy & safety legal compliance; explicit known/unknown factorization.	privacy, safety, legal-compliance, context-modeling, llm-evals, risk-assessment, governance
`2604.12616`	Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs PDF	cs.AI, cs.MM	87	Memory-augmented multi-agent jailbreaks for VLMs using natural-image semantics, not just pixels	VLM, multimodal-jailbreak, multi-agent, memory, adversarial-attacks, red-teaming
`2604.13016`	Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe PDF	cs.LG, cs.AI, cs.CL	87	Systematic study of on-policy distillation dynamics; actionable recipe for post-training	post-training, distillation, rlhf, training-dynamics, llms
`2604.12559`	FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing PDF	cs.CL	87	Fine-grained fact anchoring for model editing + new diagnostic benchmark (UnFine); useful for knowledge updates.	model-editing, factuality, knowledge-updates, benchmarks, transformers, reliability
`2604.12376`	Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations PDF	cs.CL, cs.AI	86	Practical long-horizon conversation memory: keyword bookmarks + recall tool; beats retrieval/truncation baselines.	agents, memory, long-context, tool-use, conversation, retrieval, evaluation
`2604.12736`	Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood PDF	cs.CL	86	Token-level policy optimization linking group rewards to tokens; targets sparse-reward CoT training issues.	rlhf, policy-optimization, reasoning, sparse-rewards, grpo, kl-regularization, training
`2604.12986`	Parallax: Why AI Agents That Think Must Never Act PDF	cs.CR, cs.AI	85	Argues prompt guardrails are insufficient for acting agents; proposes cognitive/executive separation	agent-safety, systems-security, sandboxing, permissions, architecture, governance
`2604.13029`	Visual Preference Optimization with Rubric Rewards PDF	cs.CV, cs.AI	85	Rubric-based rewards for visual DPO; reusable rubric pool improves judge quality and downstream performance.	multimodal, dpo, reward-modeling, rubrics, preference-optimization, evaluation
`2604.12817`	Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory PDF	cs.LG, cs.CR, stat.ML	84	First theory for continuous adversarial training for LLM jailbreak defense via ICL analysis	adversarial-training, jailbreak-defense, theory, ICL, robustness, LLM-security
`2604.12610`	Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs PDF	cs.CL	84	Triplet-structured retrieval to reduce RAG redundancy and improve alignment/efficiency	rag, retrieval, hallucinations, grounding, context-efficiency
`2604.12231`	Thought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems PDF	cs.CL, cs.IR	84	Retrieves 'thoughts' not chunks to use arbitrarily large corpora beyond context limits; agentic memory angle.	RAG, agents, memory, retrieval, context-length, reasoning, model-agnostic
`2604.12379`	Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks PDF	cs.SE, cs.AI, cs.LG	83	Code reasoning-quality benchmark + evaluator; moves beyond output correctness for LLMs	evaluation, reasoning, code, benchmarks, verifiers
`2604.12875`	AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance PDF	cs.AI	82	Catalogue of 195 safety benchmarks; meta-analysis shows fragmented metrics and weak governance	safety-benchmarks, measurement, meta-evaluation, governance, catalogue, metrics
`2604.12967`	Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training PDF	cs.AI	82	Gold-free reward for training search agents via question reconstructability (cycle-consistency)	agents, search, reinforcement-learning, self-supervision, retrieval

AI Paper Insight Brief

2026-04-16

0) Executive takeaways (read this first)

“Real-world” agent readiness is still low and highly pipeline-dependent: AlphaEval’s best production configuration is only 64.41/100, and scaffold choice swings scores by ~11–15 points, meaning infra/orchestration can matter as much as the base model.
Safety failures are increasingly “systems failures,” not “model reasoning failures”: Policy-invisible violations show models commit 90–98% of risky actions when policy metadata is hidden; Parallax argues for architectural separation (reasoner must not execute) and reports 98.9–100% blocking under an assume-compromise evaluation.
Attack surfaces are shifting to “structure” (templates, tools, images, weights), not just prompts: TemplateFuzz gets ~98% Top-5 ASR on open models and 80–100% transfer to commercial models; MemJack reaches 71.48% ASR on unmodified natural images; STEEREDIT compiles steering into weights with URR >97% and low leakage when null-space constrained.
Evaluation is fragmenting, but better measurement primitives are emerging: AlphaEval (production tasks), Frontier-Eng (budgeted optimization), CompliBench (turn-level guideline violations), CodeRQ-Bench/VERA (reasoning-quality in code), and AISafetyBenchExplorer (metric-collision governance) all point to a shift from single-number benchmarks to trace-, rubric-, and structure-aware evaluation.
RL/post-training is being retooled for stability and trust signals: CAPO targets calibration collapse under GRPO (AUC gains on AIME 2025), TEPO improves token-level credit assignment and convergence, and OPD analysis shows distillation success depends on teacher–student “thinking pattern” overlap and breaks down at long trajectory depths.

2) Key themes (clusters)

Theme: Production-grounded agent evaluation & optimization benchmarks

Why it matters: Benchmarks that don’t reflect under-specification, multimodality, long-horizon deliverables, and subjective stakeholder judgment can’t predict deployment value; optimization-style tasks better match real engineering work.
Representative papers:
Common approach:
- Build tasks from authentic requirements (partners / real workflows) or executable verifiers (frozen evaluators, sandboxing).
- Use multi-paradigm evaluation (rubrics + execution + formal checks) and record traces for failure analysis.
- Prefer budgeted iterative improvement metrics (rank/profiles) over binary pass/fail.
Open questions / failure modes:
- How to keep benchmarks longitudinally valid as partner standards and models evolve (AlphaEval snapshot limitation).
- Metric comparability: “accuracy/F1/safety score” collisions across benchmarks (AISafetyBenchExplorer).
- Preventing evaluator gaming while still allowing rich, subjective quality criteria.

Theme: Enterprise compliance & policy enforcement needs world-state, not better prompts

Why it matters: Many violations depend on metadata/state outside the model-visible context; prompt-only policies and content-only DLP can’t reliably enforce organizational rules.
Representative papers:
Common approach:
- Construct diagnostic benchmarks where decisive policy facts are hidden (PhantomPolicy) or where violations are precisely localized (CompliBench).
- Add structured enforcement layers: knowledge-graph world models + declarative invariants (Sentinel), or legal-text decomposition + precedence aggregation (ContextLens).
- Measure at turn/trace level, not just conversation-level outcomes.
Open questions / failure modes:
- World-model coverage/freshness is the bottleneck; Sentinel still misses violations even with full benchmark coverage.
- Scope mis-attribution dominates judge errors in multi-turn guideline settings (CompliBench).
- Cost/latency: ContextLens increases token usage; real-time deployment trade-offs remain.

Theme: Agent security is becoming architecture-first (guards, separation, formal loops)

Why it matters: Tool-using agents can cause irreversible harm; defenses inside the same reasoning substrate are brittle under prompt injection and context manipulation.
Representative papers:
Common approach:
- Decouple detection/validation from the main agent (parallel guard gating; process separation).
- Use deterministic or formal oracles (TLC model checker; tiered validators) to correct or block.
- Evaluate under assume-compromise or adversarial settings rather than relying on model refusals.
Open questions / failure modes:
- Synthetic training data and non-adaptive threat models (WebAgentGuard doesn’t consider white-box attacks).
- Trusted computing base risk: engine compromise undermines Parallax.
- Small-scope bounds and abstraction fidelity in formal modeling (COBALT-TLA).

Theme: Red-teaming expands to templates, multimodal semantics, and stealthy weight attacks

Why it matters: Alignment failures can be induced without long prompts—via chat-template mutations, benign images, or supply-chain weight edits that evade standard checks.
Representative papers:
Common approach:
- Treat system scaffolding as an attack surface (template elements; agent memory; toolchains).
- Use search/optimization (MCTS/evolution; fuzzing heuristics) plus scalable judging/oracles.
- Optimize for attack success + utility preservation (TemplateFuzz balances ASR and accuracy; STEEREDIT preserves URR).
Open questions / failure modes:
- Transferability over time as models/templates change; real-world detectability and countermeasures.
- Query budget requirements (MemJack needs higher rounds for 90% ASR).
- Distribution shift breaking stealth constraints (STEEREDIT null-space estimated from finite benign set).

Theme: Post-training stability: calibration, token credit assignment, distillation dynamics, and constraint fragility

Why it matters: Training methods that improve accuracy can degrade calibration, stability, or robustness; instruction tuning can create brittle “helpfulness templates.”
Representative papers:
Common approach:
- Replace reward-only surrogates with objectives aligned to desired properties (AUC-consistent surrogate in CAPO).
- Improve token-level learning signals (sequence-level likelihood aggregation; selective KL masking in TEPO).
- Diagnose training dynamics with internal metrics (overlap ratio, entropy gap) and mechanistic probes (two-pass recovery; layerwise probes).
Open questions / failure modes:
- Generalization beyond math reasoning (CAPO/TEPO/OPD analyses are math-heavy).
- Long-horizon instability: OPD reward quality degrades with depth; unclear how to scale to very long traces.
- Evaluation blind spots: independent judging underestimates constraint-induced quality collapse vs pairwise comparisons.

Theme: Memory & retrieval are moving from “raw chunks” to structured, query-aligned representations

Why it matters: Context windows and top-K retrieval limit recall; long-horizon agents need reversible, discriminative memory and compact evidence units.
Representative papers:
Common approach:
- Store compressed abstractions (thoughts; triplets; bookmarks) and retrieve under tight token budgets.
- Add filters/dedup (confidence + redundancy) and field-aware truncation (Tri-RAG).
- Evaluate on long-horizon or ultra-long-context tasks (AcademicEval; LoCoMo; multi-hop QA).
Open questions / failure modes:
- Bookmark discrimination: recall is triggered reliably but correct page selection can be ~56% (paging bottleneck).
- Robustness at extreme scale and outside AI-paper domains (Thought-Retriever limitation).
- Triplet extraction faithfulness on narrative/implicit evidence.

3) Technical synthesis

Production evaluation (AlphaEval) and benchmark governance (AISafetyBenchExplorer) converge on the same point: metric definitions + aggregation rules are part of the model, and scaffold/evaluator choices can dominate conclusions.
Several works independently adopt “separate the judge/guard from the actor”: WebAgentGuard (parallel guard), Parallax (process separation + validator tiers), Sentinel (world-state invariants), and COBALT-TLA (LLM + TLC oracle loop).
A recurring pattern is boundedness + deterministic feedback to control LLM hallucination: TLC bounds (MaxTokens=3) in COBALT-TLA; Docker sandbox + rubric scripts in AlphaEval; read-only evaluators in Frontier-Eng.
Safety evaluation is shifting from “does it refuse?” to trace-level and turn-level adjudication (AlphaEval traces; PhantomPolicy trace relabeling; CompliBench turn labels).
Red-teaming is increasingly search-based (TemplateFuzz MCTS-like exploration; MemJack MCTS/evolution; Frontier-Eng generative optimization), suggesting defenses must assume adaptive attackers.
Post-training methods are being redesigned around secondary properties beyond accuracy: CAPO optimizes relative calibration (AUC), TEPO targets stability/credit assignment, OPD targets overlap geometry, CWAC targets safety drift during fine-tuning.
Multiple papers highlight evaluation blind spots: AlphaEval shows benchmark/production mismatch; One-Token-Away shows independent judging misses large quality drops; AISafetyBenchExplorer documents metric collisions.
Memory/retrieval work is converging on structured intermediate artifacts (thoughts, triplets, bookmarks) rather than raw logs, but the key bottleneck becomes selection/discrimination rather than storage.
Security threats span the full stack: templates → web pages → images → weights → data pipelines (TemplateFuzz, WebAgentGuard, MemJack, STEEREDIT, CoLA), implying “prompt safety” alone is insufficient.
Formal methods are re-entering practical security via LLM-mediated interfaces (COBALT-TLA), but remain bounded/small-scope and abstraction-limited.

4) Top 5 papers (with “why now”)

1) AlphaEval: Evaluating Agents in Production

Converts authentic partner requirements into 94 executable production tasks with multimodal inputs and multi-paradigm evaluation.
Shows low absolute readiness (best 64.41/100) and that scaffolds can shift scores by ~11+ points, changing deployment decisions.
Adds economic grounding (tasks map to ~2,420 professional hours valued at $154K–$231K).
Skepticism: limited to seven companies/six domains and four scaffolds; snapshot may age quickly.

2) Policy-Invisible Violations in LLM-Based Agents

Names a deployment-critical failure mode: violations depend on hidden world state, not visible content.
PhantomPolicy shows models commit violations on 90–98% of risky cases under trace-level review.
Sentinel demonstrates a concrete enforcement layer (graph fork→mutate→check) reaching 92.99% accuracy / 92.71 F1 with full coverage.
Skepticism: guarantees are conditional on world-model completeness; Sentinel still misses violations (recall gaps) and doesn’t monitor plain-text outputs.

3) Parallax: Why AI Agents That Think Must Never Act

Argues for architectural guarantees: reasoners cannot execute; executors cannot reason.
OpenParallax blocks 98.9% of injected attacks by default and 100% in max-security mode under assume-compromise evaluation.
Provides a tiered validator design (deterministic policy → classifier → LLM eval → human).
Skepticism: strict mode has 36% false positives; engine is a single trusted base; rollback can’t undo external side effects.

4) TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Establishes chat templates as a first-class attack surface with element-level mutations and heuristic search.
Reports ~98.2% Top-5 ASR on open models with ~1.1% accuracy degradation; transfers 80–100% Top-5 ASR to commercial models.
Adds a scalable active-learning oracle to judge jailbreak outcomes cheaply.
Skepticism: transferability may shift with template hardening/model updates; real-world detectability/countermeasures not fully quantified.

5) Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Reframes agent evaluation as budgeted iterative optimization with feasibility gating and frozen verifiers (47 tasks, five categories).
Finds optimization dynamics: improvement frequency decays ~t⁻¹ and magnitude ~k⁻¹; depth beats width under fixed budgets.
Provides actionable comparisons across models/search frameworks; claude-opus-4.6 leads (avg rank 3.18).
Skepticism: average-rank metric discards magnitude; suite size/fidelity still limited.

5) Practical next steps

If you ship agents: adopt a production-grounded eval harness (AlphaEval-style task packages + sandbox + rubric scripts) and explicitly measure scaffold sensitivity before attributing gains to model upgrades.
For enterprise safety: prototype a world-state enforcement layer (Sentinel-like) that simulates tool-call mutations and returns Allow/Block/Clarify; track coverage gaps as a first-class metric.
For agent execution security: run an assume-compromise test (inject tool calls directly at the execution boundary) to validate that safety doesn’t depend on model refusals (Parallax methodology).
For web agents: consider a parallel multimodal guard gating actions; evaluate out-of-domain attacks (PopUp/VPI/EIA) and measure latency under parallel execution (WebAgentGuard).
For red-teaming: add template fuzzing and multimodal semantic jailbreak suites to your CI; treat “chat template” and “rendered page content” as adversarial inputs, not trusted formatting.
For post-training: when using GRPO-like RL, track calibration (AUC) alongside accuracy; consider CAPO-style objectives if AUC degrades during training.
For long-horizon systems: prefer reversible memory (bookmark+recall) and measure page-selection accuracy separately from “did it retrieve”; invest in bookmark discriminability.
For supply-chain risk: include checks for stealthy weight edits (triggered behavior with low clean leakage) and evaluate under distribution shift, since null-space stealth depends on the benign reference set.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-16

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Production-grounded agent evaluation & optimization benchmarks

Theme: Enterprise compliance & policy enforcement needs world-state, not better prompts

Theme: Agent security is becoming architecture-first (guards, separation, formal loops)

Theme: Red-teaming expands to templates, multimodal semantics, and stealthy weight attacks

Theme: Post-training stability: calibration, token credit assignment, distillation dynamics, and constraint fragility

Theme: Memory & retrieval are moving from “raw chunks” to structured, query-aligned representations

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps