Daily AI Paper Report (2026-04-16)
Published:
Chinese version: [中文]
Run stats
- Candidates: 261
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-14T00:00:00Z → 2026-04-15T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.12177 | Policy-Invisible Violations in LLM-Based Agents | cs.AI, cs.CL, cs.CR, cs.LG | 95 | New agent failure mode + benchmark for compliance when policy facts are hidden from context | agents, compliance, benchmark, evaluation, tool-use, context-limitations, governance |
2604.12500 | Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design | cs.LG, cs.CR | 95 | Shows RL safety training can flip to harmful misalignment depending on environment design | agent-safety, rl, specification-gaming, sycophancy, evaluation, misalignment |
2604.12172 | COBALT-TLA: A Neuro-Symbolic Verification Loop for Cross-Chain Bridge Vulnerability Discovery | cs.CR, cs.LO | 95 | LLM+TLA+ loop finds bridge vulns fast; strong security relevance and concrete eval on Nomad-like exploit. | agent-security, formal-verification, TLA+, cybersecurity, vulnerability-discovery, neuro-symbolic, tool-augmented-LLMs |
2604.12384 | Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints | cs.AI | 95 | Coupled weight+activation constraints to prevent safety drift during fine-tuning; uses SAE safety features. | llm-safety, safety-drift, fine-tuning, regularization, sparse-autoencoders, refusal, alignment |
2604.12284 | WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents | cs.CR | 93 | Guard-agent architecture to detect web prompt injection; targets real VLM web-agent threat model | web-agents, prompt-injection, guard-model, agent-security, VLM, detection |
2604.13018 | Toward Autonomous Long-Horizon Engineering for ML Research | cs.CL | 93 | Long-horizon ML research engineering agent with permission-scoped workspace; relevant to agent safety & control. | agents, autonomous-research, orchestration, tool-use, permissions, state-continuity, agent-evals |
2604.12162 | AlphaEval: Evaluating Agents in Production | cs.CL | 92 | Production-grounded agent benchmark (94 tasks, 7 companies) addressing real eval gaps | agents, evaluation, benchmarks, production, long-horizon, human-judgment |
2604.12374 | Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning | cs.LG, cs.AI, cs.CL | 92 | Open 120B MoE hybrid Mamba-Transformer w/ 1M context + speculative decoding; big frontier capability jump. | frontier-llm, MoE, mamba, long-context, efficiency, speculative-decoding, open-model |
2604.12232 | TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs | cs.CR, cs.AI, cs.SE | 91 | Fuzzing chat templates as an overlooked jailbreak surface; systematic red-teaming methodology | jailbreak, red-teaming, fuzzing, chat-templates, LLM-security, evaluation |
2604.13006 | One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness | cs.CL, cs.AI | 91 | Shows instruction-tuned helpfulness collapses under tiny lexical constraints; important robustness failure mode. | robustness, instruction-tuning, evaluation, reliability, constraints, helpfulness, failure-modes |
2604.12342 | CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training | cs.CR, cs.CV | 90 | New privacy attack surface: subset/coreset selection choices can leak sensitive info | privacy, data-leakage, training-data, security, attacks, coresets |
2604.12632 | Calibration-Aware Policy Optimization for Reasoning LLMs | cs.LG, cs.AI | 90 | Targets overconfidence from GRPO; proposes calibration-aware RL objective with theory + bounds for reasoning LLMs. | alignment, calibration, RLHF, policy-optimization, reasoning, uncertainty, GRPO |
2604.12312 | CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems | cs.CL | 89 | Benchmark for LLM-judge reliability in detecting/localizing compliance violations in dialogues | LLM-as-judge, compliance, benchmark, evaluation, dialogue, policy-violations |
2604.12359 | Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors | cs.CR, cs.CL | 88 | Stealthy LLM backdoors by compiling activation steering into weights; highlights supply-chain risk | backdoors, weight-editing, supply-chain, LLM-security, stealth-attacks, red-teaming |
2604.12994 | LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software | cs.CR, cs.AI | 88 | Framework to evaluate LLM vs classic repair on real logical vulns; useful for secure coding | cybersecurity, program-repair, llm-for-code, evaluation, vulnerabilities |
2604.12290 | Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization | cs.AI, cs.CL | 88 | Real-world engineering benchmark for iterative propose-execute-evaluate agents with verifiers and continuous rewards. | agents, evaluation, benchmarks, generative-optimization, tool-use, verifiers, long-horizon |
2604.12308 | ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance | cs.CL | 88 | Models ambiguous/incomplete context for privacy & safety legal compliance; explicit known/unknown factorization. | privacy, safety, legal-compliance, context-modeling, llm-evals, risk-assessment, governance |
2604.12616 | Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs | cs.AI, cs.MM | 87 | Memory-augmented multi-agent jailbreaks for VLMs using natural-image semantics, not just pixels | VLM, multimodal-jailbreak, multi-agent, memory, adversarial-attacks, red-teaming |
2604.13016 | Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe | cs.LG, cs.AI, cs.CL | 87 | Systematic study of on-policy distillation dynamics; actionable recipe for post-training | post-training, distillation, rlhf, training-dynamics, llms |
2604.12559 | FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing | cs.CL | 87 | Fine-grained fact anchoring for model editing + new diagnostic benchmark (UnFine); useful for knowledge updates. | model-editing, factuality, knowledge-updates, benchmarks, transformers, reliability |
2604.12376 | Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations | cs.CL, cs.AI | 86 | Practical long-horizon conversation memory: keyword bookmarks + recall tool; beats retrieval/truncation baselines. | agents, memory, long-context, tool-use, conversation, retrieval, evaluation |
2604.12736 | Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood | cs.CL | 86 | Token-level policy optimization linking group rewards to tokens; targets sparse-reward CoT training issues. | rlhf, policy-optimization, reasoning, sparse-rewards, grpo, kl-regularization, training |
2604.12986 | Parallax: Why AI Agents That Think Must Never Act | cs.CR, cs.AI | 85 | Argues prompt guardrails are insufficient for acting agents; proposes cognitive/executive separation | agent-safety, systems-security, sandboxing, permissions, architecture, governance |
2604.13029 | Visual Preference Optimization with Rubric Rewards | cs.CV, cs.AI | 85 | Rubric-based rewards for visual DPO; reusable rubric pool improves judge quality and downstream performance. | multimodal, dpo, reward-modeling, rubrics, preference-optimization, evaluation |
2604.12817 | Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory | cs.LG, cs.CR, stat.ML | 84 | First theory for continuous adversarial training for LLM jailbreak defense via ICL analysis | adversarial-training, jailbreak-defense, theory, ICL, robustness, LLM-security |
2604.12610 | Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs | cs.CL | 84 | Triplet-structured retrieval to reduce RAG redundancy and improve alignment/efficiency | rag, retrieval, hallucinations, grounding, context-efficiency |
2604.12231 | Thought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems | cs.CL, cs.IR | 84 | Retrieves 'thoughts' not chunks to use arbitrarily large corpora beyond context limits; agentic memory angle. | RAG, agents, memory, retrieval, context-length, reasoning, model-agnostic |
2604.12379 | Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks | cs.SE, cs.AI, cs.LG | 83 | Code reasoning-quality benchmark + evaluator; moves beyond output correctness for LLMs | evaluation, reasoning, code, benchmarks, verifiers |
2604.12875 | AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance | cs.AI | 82 | Catalogue of 195 safety benchmarks; meta-analysis shows fragmented metrics and weak governance | safety-benchmarks, measurement, meta-evaluation, governance, catalogue, metrics |
2604.12967 | Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training | cs.AI | 82 | Gold-free reward for training search agents via question reconstructability (cycle-consistency) | agents, search, reinforcement-learning, self-supervision, retrieval |
AI Paper Insight Brief
2026-04-16
0) Executive takeaways (read this first)
- “Real-world” agent readiness is still low and highly pipeline-dependent: AlphaEval’s best production configuration is only 64.41/100, and scaffold choice swings scores by ~11–15 points, meaning infra/orchestration can matter as much as the base model.
- Safety failures are increasingly “systems failures,” not “model reasoning failures”: Policy-invisible violations show models commit 90–98% of risky actions when policy metadata is hidden; Parallax argues for architectural separation (reasoner must not execute) and reports 98.9–100% blocking under an assume-compromise evaluation.
- Attack surfaces are shifting to “structure” (templates, tools, images, weights), not just prompts: TemplateFuzz gets ~98% Top-5 ASR on open models and 80–100% transfer to commercial models; MemJack reaches 71.48% ASR on unmodified natural images; STEEREDIT compiles steering into weights with URR >97% and low leakage when null-space constrained.
- Evaluation is fragmenting, but better measurement primitives are emerging: AlphaEval (production tasks), Frontier-Eng (budgeted optimization), CompliBench (turn-level guideline violations), CodeRQ-Bench/VERA (reasoning-quality in code), and AISafetyBenchExplorer (metric-collision governance) all point to a shift from single-number benchmarks to trace-, rubric-, and structure-aware evaluation.
- RL/post-training is being retooled for stability and trust signals: CAPO targets calibration collapse under GRPO (AUC gains on AIME 2025), TEPO improves token-level credit assignment and convergence, and OPD analysis shows distillation success depends on teacher–student “thinking pattern” overlap and breaks down at long trajectory depths.
2) Key themes (clusters)
Theme: Production-grounded agent evaluation & optimization benchmarks
- Why it matters: Benchmarks that don’t reflect under-specification, multimodality, long-horizon deliverables, and subjective stakeholder judgment can’t predict deployment value; optimization-style tasks better match real engineering work.
- Representative papers:
- Common approach:
- Build tasks from authentic requirements (partners / real workflows) or executable verifiers (frozen evaluators, sandboxing).
- Use multi-paradigm evaluation (rubrics + execution + formal checks) and record traces for failure analysis.
- Prefer budgeted iterative improvement metrics (rank/profiles) over binary pass/fail.
- Open questions / failure modes:
- How to keep benchmarks longitudinally valid as partner standards and models evolve (AlphaEval snapshot limitation).
- Metric comparability: “accuracy/F1/safety score” collisions across benchmarks (AISafetyBenchExplorer).
- Preventing evaluator gaming while still allowing rich, subjective quality criteria.
Theme: Enterprise compliance & policy enforcement needs world-state, not better prompts
- Why it matters: Many violations depend on metadata/state outside the model-visible context; prompt-only policies and content-only DLP can’t reliably enforce organizational rules.
- Representative papers:
- Common approach:
- Construct diagnostic benchmarks where decisive policy facts are hidden (PhantomPolicy) or where violations are precisely localized (CompliBench).
- Add structured enforcement layers: knowledge-graph world models + declarative invariants (Sentinel), or legal-text decomposition + precedence aggregation (ContextLens).
- Measure at turn/trace level, not just conversation-level outcomes.
- Open questions / failure modes:
- World-model coverage/freshness is the bottleneck; Sentinel still misses violations even with full benchmark coverage.
- Scope mis-attribution dominates judge errors in multi-turn guideline settings (CompliBench).
- Cost/latency: ContextLens increases token usage; real-time deployment trade-offs remain.
Theme: Agent security is becoming architecture-first (guards, separation, formal loops)
- Why it matters: Tool-using agents can cause irreversible harm; defenses inside the same reasoning substrate are brittle under prompt injection and context manipulation.
- Representative papers:
- Common approach:
- Decouple detection/validation from the main agent (parallel guard gating; process separation).
- Use deterministic or formal oracles (TLC model checker; tiered validators) to correct or block.
- Evaluate under assume-compromise or adversarial settings rather than relying on model refusals.
- Open questions / failure modes:
- Synthetic training data and non-adaptive threat models (WebAgentGuard doesn’t consider white-box attacks).
- Trusted computing base risk: engine compromise undermines Parallax.
- Small-scope bounds and abstraction fidelity in formal modeling (COBALT-TLA).
Theme: Red-teaming expands to templates, multimodal semantics, and stealthy weight attacks
- Why it matters: Alignment failures can be induced without long prompts—via chat-template mutations, benign images, or supply-chain weight edits that evade standard checks.
- Representative papers:
- Common approach:
- Treat system scaffolding as an attack surface (template elements; agent memory; toolchains).
- Use search/optimization (MCTS/evolution; fuzzing heuristics) plus scalable judging/oracles.
- Optimize for attack success + utility preservation (TemplateFuzz balances ASR and accuracy; STEEREDIT preserves URR).
- Open questions / failure modes:
- Transferability over time as models/templates change; real-world detectability and countermeasures.
- Query budget requirements (MemJack needs higher rounds for 90% ASR).
- Distribution shift breaking stealth constraints (STEEREDIT null-space estimated from finite benign set).
Theme: Post-training stability: calibration, token credit assignment, distillation dynamics, and constraint fragility
- Why it matters: Training methods that improve accuracy can degrade calibration, stability, or robustness; instruction tuning can create brittle “helpfulness templates.”
- Representative papers:
- Calibration-Aware Policy Optimization for Reasoning LLMs
- Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood
- Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
- One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
- Common approach:
- Replace reward-only surrogates with objectives aligned to desired properties (AUC-consistent surrogate in CAPO).
- Improve token-level learning signals (sequence-level likelihood aggregation; selective KL masking in TEPO).
- Diagnose training dynamics with internal metrics (overlap ratio, entropy gap) and mechanistic probes (two-pass recovery; layerwise probes).
- Open questions / failure modes:
- Generalization beyond math reasoning (CAPO/TEPO/OPD analyses are math-heavy).
- Long-horizon instability: OPD reward quality degrades with depth; unclear how to scale to very long traces.
- Evaluation blind spots: independent judging underestimates constraint-induced quality collapse vs pairwise comparisons.
Theme: Memory & retrieval are moving from “raw chunks” to structured, query-aligned representations
- Why it matters: Context windows and top-K retrieval limit recall; long-horizon agents need reversible, discriminative memory and compact evidence units.
- Representative papers:
- Common approach:
- Store compressed abstractions (thoughts; triplets; bookmarks) and retrieve under tight token budgets.
- Add filters/dedup (confidence + redundancy) and field-aware truncation (Tri-RAG).
- Evaluate on long-horizon or ultra-long-context tasks (AcademicEval; LoCoMo; multi-hop QA).
- Open questions / failure modes:
- Bookmark discrimination: recall is triggered reliably but correct page selection can be ~56% (paging bottleneck).
- Robustness at extreme scale and outside AI-paper domains (Thought-Retriever limitation).
- Triplet extraction faithfulness on narrative/implicit evidence.
3) Technical synthesis
- Production evaluation (AlphaEval) and benchmark governance (AISafetyBenchExplorer) converge on the same point: metric definitions + aggregation rules are part of the model, and scaffold/evaluator choices can dominate conclusions.
- Several works independently adopt “separate the judge/guard from the actor”: WebAgentGuard (parallel guard), Parallax (process separation + validator tiers), Sentinel (world-state invariants), and COBALT-TLA (LLM + TLC oracle loop).
- A recurring pattern is boundedness + deterministic feedback to control LLM hallucination: TLC bounds (MaxTokens=3) in COBALT-TLA; Docker sandbox + rubric scripts in AlphaEval; read-only evaluators in Frontier-Eng.
- Safety evaluation is shifting from “does it refuse?” to trace-level and turn-level adjudication (AlphaEval traces; PhantomPolicy trace relabeling; CompliBench turn labels).
- Red-teaming is increasingly search-based (TemplateFuzz MCTS-like exploration; MemJack MCTS/evolution; Frontier-Eng generative optimization), suggesting defenses must assume adaptive attackers.
- Post-training methods are being redesigned around secondary properties beyond accuracy: CAPO optimizes relative calibration (AUC), TEPO targets stability/credit assignment, OPD targets overlap geometry, CWAC targets safety drift during fine-tuning.
- Multiple papers highlight evaluation blind spots: AlphaEval shows benchmark/production mismatch; One-Token-Away shows independent judging misses large quality drops; AISafetyBenchExplorer documents metric collisions.
- Memory/retrieval work is converging on structured intermediate artifacts (thoughts, triplets, bookmarks) rather than raw logs, but the key bottleneck becomes selection/discrimination rather than storage.
- Security threats span the full stack: templates → web pages → images → weights → data pipelines (TemplateFuzz, WebAgentGuard, MemJack, STEEREDIT, CoLA), implying “prompt safety” alone is insufficient.
- Formal methods are re-entering practical security via LLM-mediated interfaces (COBALT-TLA), but remain bounded/small-scope and abstraction-limited.
4) Top 5 papers (with “why now”)
1) AlphaEval: Evaluating Agents in Production
- Converts authentic partner requirements into 94 executable production tasks with multimodal inputs and multi-paradigm evaluation.
- Shows low absolute readiness (best 64.41/100) and that scaffolds can shift scores by ~11+ points, changing deployment decisions.
- Adds economic grounding (tasks map to ~2,420 professional hours valued at $154K–$231K).
- Skepticism: limited to seven companies/six domains and four scaffolds; snapshot may age quickly.
2) Policy-Invisible Violations in LLM-Based Agents
- Names a deployment-critical failure mode: violations depend on hidden world state, not visible content.
- PhantomPolicy shows models commit violations on 90–98% of risky cases under trace-level review.
- Sentinel demonstrates a concrete enforcement layer (graph fork→mutate→check) reaching 92.99% accuracy / 92.71 F1 with full coverage.
- Skepticism: guarantees are conditional on world-model completeness; Sentinel still misses violations (recall gaps) and doesn’t monitor plain-text outputs.
3) Parallax: Why AI Agents That Think Must Never Act
- Argues for architectural guarantees: reasoners cannot execute; executors cannot reason.
- OpenParallax blocks 98.9% of injected attacks by default and 100% in max-security mode under assume-compromise evaluation.
- Provides a tiered validator design (deterministic policy → classifier → LLM eval → human).
- Skepticism: strict mode has 36% false positives; engine is a single trusted base; rollback can’t undo external side effects.
4) TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
- Establishes chat templates as a first-class attack surface with element-level mutations and heuristic search.
- Reports ~98.2% Top-5 ASR on open models with ~1.1% accuracy degradation; transfers 80–100% Top-5 ASR to commercial models.
- Adds a scalable active-learning oracle to judge jailbreak outcomes cheaply.
- Skepticism: transferability may shift with template hardening/model updates; real-world detectability/countermeasures not fully quantified.
- Reframes agent evaluation as budgeted iterative optimization with feasibility gating and frozen verifiers (47 tasks, five categories).
- Finds optimization dynamics: improvement frequency decays ~t⁻¹ and magnitude ~k⁻¹; depth beats width under fixed budgets.
- Provides actionable comparisons across models/search frameworks; claude-opus-4.6 leads (avg rank 3.18).
- Skepticism: average-rank metric discards magnitude; suite size/fidelity still limited.
5) Practical next steps
- If you ship agents: adopt a production-grounded eval harness (AlphaEval-style task packages + sandbox + rubric scripts) and explicitly measure scaffold sensitivity before attributing gains to model upgrades.
- For enterprise safety: prototype a world-state enforcement layer (Sentinel-like) that simulates tool-call mutations and returns Allow/Block/Clarify; track coverage gaps as a first-class metric.
- For agent execution security: run an assume-compromise test (inject tool calls directly at the execution boundary) to validate that safety doesn’t depend on model refusals (Parallax methodology).
- For web agents: consider a parallel multimodal guard gating actions; evaluate out-of-domain attacks (PopUp/VPI/EIA) and measure latency under parallel execution (WebAgentGuard).
- For red-teaming: add template fuzzing and multimodal semantic jailbreak suites to your CI; treat “chat template” and “rendered page content” as adversarial inputs, not trusted formatting.
- For post-training: when using GRPO-like RL, track calibration (AUC) alongside accuracy; consider CAPO-style objectives if AUC degrades during training.
- For long-horizon systems: prefer reversible memory (bookmark+recall) and measure page-selection accuracy separately from “did it retrieve”; invest in bookmark discriminability.
- For supply-chain risk: include checks for stealthy weight edits (triggered behavior with low clean leakage) and evaluate under distribution shift, since null-space stealth depends on the benign reference set.
Generated from per-paper analyses; no external browsing.
