Daily AI Paper Report (2026-04-16)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 261
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-14T00:00:00Z → 2026-04-15T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.12177Policy-Invisible Violations in LLM-Based Agents
PDF
cs.AI, cs.CL, cs.CR, cs.LG95New agent failure mode + benchmark for compliance when policy facts are hidden from contextagents, compliance, benchmark, evaluation, tool-use, context-limitations, governance
2604.12500Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
PDF
cs.LG, cs.CR95Shows RL safety training can flip to harmful misalignment depending on environment designagent-safety, rl, specification-gaming, sycophancy, evaluation, misalignment
2604.12172COBALT-TLA: A Neuro-Symbolic Verification Loop for Cross-Chain Bridge Vulnerability Discovery
PDF
cs.CR, cs.LO95LLM+TLA+ loop finds bridge vulns fast; strong security relevance and concrete eval on Nomad-like exploit.agent-security, formal-verification, TLA+, cybersecurity, vulnerability-discovery, neuro-symbolic, tool-augmented-LLMs
2604.12384Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
PDF
cs.AI95Coupled weight+activation constraints to prevent safety drift during fine-tuning; uses SAE safety features.llm-safety, safety-drift, fine-tuning, regularization, sparse-autoencoders, refusal, alignment
2604.12284WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents
PDF
cs.CR93Guard-agent architecture to detect web prompt injection; targets real VLM web-agent threat modelweb-agents, prompt-injection, guard-model, agent-security, VLM, detection
2604.13018Toward Autonomous Long-Horizon Engineering for ML Research
PDF
cs.CL93Long-horizon ML research engineering agent with permission-scoped workspace; relevant to agent safety & control.agents, autonomous-research, orchestration, tool-use, permissions, state-continuity, agent-evals
2604.12162AlphaEval: Evaluating Agents in Production
PDF
cs.CL92Production-grounded agent benchmark (94 tasks, 7 companies) addressing real eval gapsagents, evaluation, benchmarks, production, long-horizon, human-judgment
2604.12374Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
PDF
cs.LG, cs.AI, cs.CL92Open 120B MoE hybrid Mamba-Transformer w/ 1M context + speculative decoding; big frontier capability jump.frontier-llm, MoE, mamba, long-context, efficiency, speculative-decoding, open-model
2604.12232TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
PDF
cs.CR, cs.AI, cs.SE91Fuzzing chat templates as an overlooked jailbreak surface; systematic red-teaming methodologyjailbreak, red-teaming, fuzzing, chat-templates, LLM-security, evaluation
2604.13006One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
PDF
cs.CL, cs.AI91Shows instruction-tuned helpfulness collapses under tiny lexical constraints; important robustness failure mode.robustness, instruction-tuning, evaluation, reliability, constraints, helpfulness, failure-modes
2604.12342CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training
PDF
cs.CR, cs.CV90New privacy attack surface: subset/coreset selection choices can leak sensitive infoprivacy, data-leakage, training-data, security, attacks, coresets
2604.12632Calibration-Aware Policy Optimization for Reasoning LLMs
PDF
cs.LG, cs.AI90Targets overconfidence from GRPO; proposes calibration-aware RL objective with theory + bounds for reasoning LLMs.alignment, calibration, RLHF, policy-optimization, reasoning, uncertainty, GRPO
2604.12312CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
PDF
cs.CL89Benchmark for LLM-judge reliability in detecting/localizing compliance violations in dialoguesLLM-as-judge, compliance, benchmark, evaluation, dialogue, policy-violations
2604.12359Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors
PDF
cs.CR, cs.CL88Stealthy LLM backdoors by compiling activation steering into weights; highlights supply-chain riskbackdoors, weight-editing, supply-chain, LLM-security, stealth-attacks, red-teaming
2604.12994LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
PDF
cs.CR, cs.AI88Framework to evaluate LLM vs classic repair on real logical vulns; useful for secure codingcybersecurity, program-repair, llm-for-code, evaluation, vulnerabilities
2604.12290Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
PDF
cs.AI, cs.CL88Real-world engineering benchmark for iterative propose-execute-evaluate agents with verifiers and continuous rewards.agents, evaluation, benchmarks, generative-optimization, tool-use, verifiers, long-horizon
2604.12308ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance
PDF
cs.CL88Models ambiguous/incomplete context for privacy & safety legal compliance; explicit known/unknown factorization.privacy, safety, legal-compliance, context-modeling, llm-evals, risk-assessment, governance
2604.12616Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
PDF
cs.AI, cs.MM87Memory-augmented multi-agent jailbreaks for VLMs using natural-image semantics, not just pixelsVLM, multimodal-jailbreak, multi-agent, memory, adversarial-attacks, red-teaming
2604.13016Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
PDF
cs.LG, cs.AI, cs.CL87Systematic study of on-policy distillation dynamics; actionable recipe for post-trainingpost-training, distillation, rlhf, training-dynamics, llms
2604.12559FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing
PDF
cs.CL87Fine-grained fact anchoring for model editing + new diagnostic benchmark (UnFine); useful for knowledge updates.model-editing, factuality, knowledge-updates, benchmarks, transformers, reliability
2604.12376Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations
PDF
cs.CL, cs.AI86Practical long-horizon conversation memory: keyword bookmarks + recall tool; beats retrieval/truncation baselines.agents, memory, long-context, tool-use, conversation, retrieval, evaluation
2604.12736Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood
PDF
cs.CL86Token-level policy optimization linking group rewards to tokens; targets sparse-reward CoT training issues.rlhf, policy-optimization, reasoning, sparse-rewards, grpo, kl-regularization, training
2604.12986Parallax: Why AI Agents That Think Must Never Act
PDF
cs.CR, cs.AI85Argues prompt guardrails are insufficient for acting agents; proposes cognitive/executive separationagent-safety, systems-security, sandboxing, permissions, architecture, governance
2604.13029Visual Preference Optimization with Rubric Rewards
PDF
cs.CV, cs.AI85Rubric-based rewards for visual DPO; reusable rubric pool improves judge quality and downstream performance.multimodal, dpo, reward-modeling, rubrics, preference-optimization, evaluation
2604.12817Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
PDF
cs.LG, cs.CR, stat.ML84First theory for continuous adversarial training for LLM jailbreak defense via ICL analysisadversarial-training, jailbreak-defense, theory, ICL, robustness, LLM-security
2604.12610Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
PDF
cs.CL84Triplet-structured retrieval to reduce RAG redundancy and improve alignment/efficiencyrag, retrieval, hallucinations, grounding, context-efficiency
2604.12231Thought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems
PDF
cs.CL, cs.IR84Retrieves 'thoughts' not chunks to use arbitrarily large corpora beyond context limits; agentic memory angle.RAG, agents, memory, retrieval, context-length, reasoning, model-agnostic
2604.12379Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
PDF
cs.SE, cs.AI, cs.LG83Code reasoning-quality benchmark + evaluator; moves beyond output correctness for LLMsevaluation, reasoning, code, benchmarks, verifiers
2604.12875AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance
PDF
cs.AI82Catalogue of 195 safety benchmarks; meta-analysis shows fragmented metrics and weak governancesafety-benchmarks, measurement, meta-evaluation, governance, catalogue, metrics
2604.12967Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training
PDF
cs.AI82Gold-free reward for training search agents via question reconstructability (cycle-consistency)agents, search, reinforcement-learning, self-supervision, retrieval

AI Paper Insight Brief

2026-04-16

0) Executive takeaways (read this first)

  • “Real-world” agent readiness is still low and highly pipeline-dependent: AlphaEval’s best production configuration is only 64.41/100, and scaffold choice swings scores by ~11–15 points, meaning infra/orchestration can matter as much as the base model.
  • Safety failures are increasingly “systems failures,” not “model reasoning failures”: Policy-invisible violations show models commit 90–98% of risky actions when policy metadata is hidden; Parallax argues for architectural separation (reasoner must not execute) and reports 98.9–100% blocking under an assume-compromise evaluation.
  • Attack surfaces are shifting to “structure” (templates, tools, images, weights), not just prompts: TemplateFuzz gets ~98% Top-5 ASR on open models and 80–100% transfer to commercial models; MemJack reaches 71.48% ASR on unmodified natural images; STEEREDIT compiles steering into weights with URR >97% and low leakage when null-space constrained.
  • Evaluation is fragmenting, but better measurement primitives are emerging: AlphaEval (production tasks), Frontier-Eng (budgeted optimization), CompliBench (turn-level guideline violations), CodeRQ-Bench/VERA (reasoning-quality in code), and AISafetyBenchExplorer (metric-collision governance) all point to a shift from single-number benchmarks to trace-, rubric-, and structure-aware evaluation.
  • RL/post-training is being retooled for stability and trust signals: CAPO targets calibration collapse under GRPO (AUC gains on AIME 2025), TEPO improves token-level credit assignment and convergence, and OPD analysis shows distillation success depends on teacher–student “thinking pattern” overlap and breaks down at long trajectory depths.

2) Key themes (clusters)

Theme: Production-grounded agent evaluation & optimization benchmarks

Theme: Enterprise compliance & policy enforcement needs world-state, not better prompts

  • Why it matters: Many violations depend on metadata/state outside the model-visible context; prompt-only policies and content-only DLP can’t reliably enforce organizational rules.
  • Representative papers:
  • Common approach:
    • Construct diagnostic benchmarks where decisive policy facts are hidden (PhantomPolicy) or where violations are precisely localized (CompliBench).
    • Add structured enforcement layers: knowledge-graph world models + declarative invariants (Sentinel), or legal-text decomposition + precedence aggregation (ContextLens).
    • Measure at turn/trace level, not just conversation-level outcomes.
  • Open questions / failure modes:
    • World-model coverage/freshness is the bottleneck; Sentinel still misses violations even with full benchmark coverage.
    • Scope mis-attribution dominates judge errors in multi-turn guideline settings (CompliBench).
    • Cost/latency: ContextLens increases token usage; real-time deployment trade-offs remain.

Theme: Agent security is becoming architecture-first (guards, separation, formal loops)

Theme: Red-teaming expands to templates, multimodal semantics, and stealthy weight attacks

Theme: Post-training stability: calibration, token credit assignment, distillation dynamics, and constraint fragility

Theme: Memory & retrieval are moving from “raw chunks” to structured, query-aligned representations

3) Technical synthesis

  • Production evaluation (AlphaEval) and benchmark governance (AISafetyBenchExplorer) converge on the same point: metric definitions + aggregation rules are part of the model, and scaffold/evaluator choices can dominate conclusions.
  • Several works independently adopt “separate the judge/guard from the actor”: WebAgentGuard (parallel guard), Parallax (process separation + validator tiers), Sentinel (world-state invariants), and COBALT-TLA (LLM + TLC oracle loop).
  • A recurring pattern is boundedness + deterministic feedback to control LLM hallucination: TLC bounds (MaxTokens=3) in COBALT-TLA; Docker sandbox + rubric scripts in AlphaEval; read-only evaluators in Frontier-Eng.
  • Safety evaluation is shifting from “does it refuse?” to trace-level and turn-level adjudication (AlphaEval traces; PhantomPolicy trace relabeling; CompliBench turn labels).
  • Red-teaming is increasingly search-based (TemplateFuzz MCTS-like exploration; MemJack MCTS/evolution; Frontier-Eng generative optimization), suggesting defenses must assume adaptive attackers.
  • Post-training methods are being redesigned around secondary properties beyond accuracy: CAPO optimizes relative calibration (AUC), TEPO targets stability/credit assignment, OPD targets overlap geometry, CWAC targets safety drift during fine-tuning.
  • Multiple papers highlight evaluation blind spots: AlphaEval shows benchmark/production mismatch; One-Token-Away shows independent judging misses large quality drops; AISafetyBenchExplorer documents metric collisions.
  • Memory/retrieval work is converging on structured intermediate artifacts (thoughts, triplets, bookmarks) rather than raw logs, but the key bottleneck becomes selection/discrimination rather than storage.
  • Security threats span the full stack: templates → web pages → images → weights → data pipelines (TemplateFuzz, WebAgentGuard, MemJack, STEEREDIT, CoLA), implying “prompt safety” alone is insufficient.
  • Formal methods are re-entering practical security via LLM-mediated interfaces (COBALT-TLA), but remain bounded/small-scope and abstraction-limited.

4) Top 5 papers (with “why now”)

1) AlphaEval: Evaluating Agents in Production

  • Converts authentic partner requirements into 94 executable production tasks with multimodal inputs and multi-paradigm evaluation.
  • Shows low absolute readiness (best 64.41/100) and that scaffolds can shift scores by ~11+ points, changing deployment decisions.
  • Adds economic grounding (tasks map to ~2,420 professional hours valued at $154K–$231K).
  • Skepticism: limited to seven companies/six domains and four scaffolds; snapshot may age quickly.

2) Policy-Invisible Violations in LLM-Based Agents

  • Names a deployment-critical failure mode: violations depend on hidden world state, not visible content.
  • PhantomPolicy shows models commit violations on 90–98% of risky cases under trace-level review.
  • Sentinel demonstrates a concrete enforcement layer (graph fork→mutate→check) reaching 92.99% accuracy / 92.71 F1 with full coverage.
  • Skepticism: guarantees are conditional on world-model completeness; Sentinel still misses violations (recall gaps) and doesn’t monitor plain-text outputs.

3) Parallax: Why AI Agents That Think Must Never Act

  • Argues for architectural guarantees: reasoners cannot execute; executors cannot reason.
  • OpenParallax blocks 98.9% of injected attacks by default and 100% in max-security mode under assume-compromise evaluation.
  • Provides a tiered validator design (deterministic policy → classifier → LLM eval → human).
  • Skepticism: strict mode has 36% false positives; engine is a single trusted base; rollback can’t undo external side effects.

4) TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

  • Establishes chat templates as a first-class attack surface with element-level mutations and heuristic search.
  • Reports ~98.2% Top-5 ASR on open models with ~1.1% accuracy degradation; transfers 80–100% Top-5 ASR to commercial models.
  • Adds a scalable active-learning oracle to judge jailbreak outcomes cheaply.
  • Skepticism: transferability may shift with template hardening/model updates; real-world detectability/countermeasures not fully quantified.

5) Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

  • Reframes agent evaluation as budgeted iterative optimization with feasibility gating and frozen verifiers (47 tasks, five categories).
  • Finds optimization dynamics: improvement frequency decays ~t⁻¹ and magnitude ~k⁻¹; depth beats width under fixed budgets.
  • Provides actionable comparisons across models/search frameworks; claude-opus-4.6 leads (avg rank 3.18).
  • Skepticism: average-rank metric discards magnitude; suite size/fidelity still limited.

5) Practical next steps

  • If you ship agents: adopt a production-grounded eval harness (AlphaEval-style task packages + sandbox + rubric scripts) and explicitly measure scaffold sensitivity before attributing gains to model upgrades.
  • For enterprise safety: prototype a world-state enforcement layer (Sentinel-like) that simulates tool-call mutations and returns Allow/Block/Clarify; track coverage gaps as a first-class metric.
  • For agent execution security: run an assume-compromise test (inject tool calls directly at the execution boundary) to validate that safety doesn’t depend on model refusals (Parallax methodology).
  • For web agents: consider a parallel multimodal guard gating actions; evaluate out-of-domain attacks (PopUp/VPI/EIA) and measure latency under parallel execution (WebAgentGuard).
  • For red-teaming: add template fuzzing and multimodal semantic jailbreak suites to your CI; treat “chat template” and “rendered page content” as adversarial inputs, not trusted formatting.
  • For post-training: when using GRPO-like RL, track calibration (AUC) alongside accuracy; consider CAPO-style objectives if AUC degrades during training.
  • For long-horizon systems: prefer reversible memory (bookmark+recall) and measure page-selection accuracy separately from “did it retrieve”; invest in bookmark discriminability.
  • For supply-chain risk: include checks for stealthy weight edits (triggered behavior with low clean leakage) and evaluate under distribution shift, since null-space stealth depends on the benign reference set.

Generated from per-paper analyses; no external browsing.