Daily AI Paper Report (2026-05-06)
Published:
Chinese version: [中文]
Run stats
- Candidates: 258
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-05-04T00:00:00Z → 2026-05-05T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2605.02187 | When Alignment Isn't Enough: Response-Path Attacks on LLM Agents | cs.CR, cs.AI | 96 | High-impact agent security threat: post-alignment relay tampering breaks BYOK agents with 99.1% ASR. | agent-security, integrity, prompt-injection, BYOK, red-teaming |
2605.02269 | Towards Understanding Specification Gaming in Reasoning Models | cs.AI | 96 | Open benchmark on LLM specification gaming; directly studies agent failure modes and RL effects. | agent-safety, specification-gaming, evaluation, reasoning-models, benchmark |
2605.02682 | Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI | cs.AI | 95 | Directly targets agent tool-use authorization and runtime enforcement in zero-trust settings. | agent-safety, security, tool-use, authorization, zero-trust |
2605.02751 | Mitigating Misalignment Contagion by Steering with Implicit Traits | cs.AI, cs.CL | 95 | Directly studies multi-agent LM misalignment contagion and mitigation; highly safety-relevant. | llm-safety, multi-agent, alignment, steering, evaluation |
2605.02812 | Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense | cs.CR | 94 | Systematic study of persistent agent worm propagation and defenses in multi-agent LLM ecosystems. | agent-security, worms, persistent-memory, multi-agent, defense |
2605.02647 | ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming | cs.CL, cs.CR | 92 | Automated multi-turn jailbreak red-teaming via conversational priming targets a key real-world attack surface. | jailbreaks, red-teaming, multi-turn, alignment, adversarial-evaluation |
2605.02199 | MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing | cs.AI | 92 | Auditable protocol for long-term agent memory writing; strong relevance to reliable agentic systems. | agents, memory, evaluation, reliability, auditing |
2605.02202 | CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models | cs.AI | 92 | Backdoor attacks on VLMs via clean-label diffusion poisoning; strong security relevance. | vlm-security, backdoor, data-poisoning, diffusion, adversarial-ml |
2605.02196 | DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning | cs.LG | 91 | Important privacy/safety result: INT4 quantization can recover supposedly unlearned content by up to 22x. | machine-unlearning, privacy, quantization, llm-security, compliance |
2605.02178 | T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning | cs.AI | 91 | Targets instability in multi-turn agent RL with uncertainty-guided exploration control. | agentic-rl, reasoning, training-stability, uncertainty, llm-agents |
2605.02661 | AcademiClaw: When Students Set Challenges for AI Agents | cs.AI, cs.CY | 90 | Real student-sourced long-horizon agent benchmark with isolated environments; high practical eval value. | agents, benchmark, long-horizon, evaluation, academic-tasks |
2605.02495 | Efficient Preference Poisoning Attack on Offline RLHF | cs.LG, cs.AI, stat.ML | 89 | Preference poisoning exposes offline RLHF/DPO supply-chain risk with efficient targeted attack methods. | RLHF, DPO, data-poisoning, alignment, security |
2605.02363 | When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models | cs.CL, cs.AI, cs.LG | 89 | Targets structured-output reliability, crucial for tool use and agent deployment. | llm-reliability, structured-output, json, agents, evaluation |
2605.02398 | The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure | cs.AI, cs.CL, cs.LG | 88 | Large-scale eval of metacognitive collapse under adversarial pressure; directly relevant to frontier AI safety. | evaluation, metacognition, adversarial-robustness, frontier-models, safety |
2605.02240 | PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments | cs.AI | 88 | Real-world EHR benchmark for long-horizon LLM agents with verifiable execution in clinical workflows. | agents, benchmark, healthcare, evaluation, long-horizon |
2605.02395 | Controllable and Verifiable Process Data Synthesis for Process Reward Models | cs.AI | 88 | Controllable, verifiable process data synthesis for PRMs could improve reasoning supervision. | process-reward-models, reasoning, alignment, synthetic-data, verification |
2605.02411 | FitText: Evolving Agent Tool Ecologies via Memetic Retrieval | cs.AI, cs.IR, cs.LG, cs.MA | 88 | Dynamic tool retrieval inside the reasoning loop is highly relevant for scalable agent capability. | agents, tool-use, retrieval, tool-selection, framework |
2605.02503 | DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis | cs.AI | 87 | Process-level benchmark for exploratory data-analysis agents with milestone annotations and noisy real data. | agents, benchmark, process-supervision, data-analysis, evaluation |
2605.02626 | Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models | cs.LG | 86 | Stabilizes DPO training and addresses probability collapse in preference optimization. | dpo, alignment, preference-optimization, training-stability, llms |
2605.02348 | Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation | cs.CL, cs.LG | 86 | Decoding-time debiasing with PRMs is practical for safer generation without weight updates. | alignment, bias, process-reward-models, decoding, safety |
2605.02255 | On the Privacy of LLMs: An Ablation Study | cs.CR, cs.AI | 85 | Useful unified ablation of LLM privacy attacks across architecture, scale, data, and retrieval settings. | privacy, membership-inference, data-extraction, RAG, security-evaluation |
2605.02545 | Strategy-Aware Optimization Modeling with Reasoning LLMs | cs.AI | 85 | Post-training for strategy-aware optimization with verified data and GRPO shows concrete gains. | llm-reasoning, post-training, grpo, verification, optimization |
2605.02168 | Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning | cs.AI, cs.LG, cs.MA | 84 | Multi-agent planning framework with compute-allocation insight; useful for efficient long-horizon agents. | agents, planning, multi-agent, efficiency, long-horizon |
2605.02472 | Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication | cs.CL | 84 | Neuro-symbolic offloading with auditable traces for legal reasoning is strong reliability work. | neuro-symbolic, auditability, legal-reasoning, reliability, agents |
2605.02469 | Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent | cs.LG, cs.AI | 84 | Theoretical clarification of RLVR vs weighted SFT could matter for scalable verifier-based alignment. | alignment, rlvr, kl-regularization, theory, sft |
2605.02443 | HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs | cs.CL | 83 | Comprehensive hallucination benchmark with routing and mitigation; practical for reliability evaluation. | hallucinations, benchmark, evaluation, reliability, mitigation |
2605.02206 | Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score | cs.CV, cs.LG | 83 | Systematic analysis of unreliable multimodal unlearning metrics; proposes unified scoring for evaluation. | unlearning, multimodal, evaluation, safety, privacy |
2605.02435 | Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning | cs.LG, stat.ML | 82 | Addresses estimator bias in answer-level fine-tuning games with provable unbiased methods. | alignment, fine-tuning, distributional-alignment, theory, training |
2605.02504 | A multilingual hallucination benchmark: MultiWikiQHalluA | cs.CL | 81 | Multilingual hallucination benchmark across many languages; valuable for reliability beyond English. | hallucination, multilingual, benchmark, reliability, evaluation |
2605.02122 | STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems | cs.LG, cs.AI | 80 | Evaluation framework for stable, disagreement-aware AI ranking; broadly reusable for model assessment. | evaluation, human-judgment, uncertainty, benchmarking, reliability |
AI Paper Insight Brief
2026-05-06
0) Executive takeaways (read this first)
- Agent evaluation is shifting from final-answer scoring to execution-grounded, process-aware, and stability-aware measurement. Several papers argue current benchmarks overstate capability because they ignore annotator disagreement, intermediate milestones, structured-output validity, or real environment execution.
- A recurring systems lesson: architecture and control matter more than raw model size in long-horizon agents. Planner-centric decomposition, uncertainty-guided exploration control, dynamic tool retrieval, and neuro-symbolic offloading all report large gains over monolithic agent setups.
- Security work is increasingly targeting post-generation and persistent-system attack surfaces, not just prompt injection. BYOK relay tampering, autonomous agent worms, contextual multi-turn jailbreaks, clean-label VLM backdoors, and offline RLHF poisoning all expose vulnerabilities outside the usual “aligned model output” threat model.
- Multiple papers show that deployment transformations break audit assumptions: quantization can undo machine unlearning, relay infrastructure can bypass alignment, and structured outputs can be “correct but unusable.” Evaluation at training-time or BF16-only audit time is no longer enough.
- Preference/RL fine-tuning remains fragile. New work identifies small-batch estimator bias, DPO squeezing, and multi-turn RL collapse, while proposing fixes such as unbiased/variance-aware estimators, gradient gating, and token/turn-level exploration control.
- For practitioners, the practical frontier is clear: instrument the stack end-to-end—response integrity, memory writes, tool authorization, output schemas, deployment precision, and process checkpoints all need explicit controls.
2) Key themes (clusters)
Theme: Evaluation is becoming process-aware and deployment-aware
- Why it matters: Several papers argue that standard benchmark scores hide the real failure modes that matter in deployment: unstable rankings, invalid structured outputs, poor intermediate progress, and execution failures in realistic environments. The common move is to evaluate the full operational contract, not just final correctness.
- Representative papers:
- STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems
- PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
- DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis
- When Correct Isn’t Usable: Improving Structured Output Reliability in Small Language Models
- Common approach:
- Replace single hard labels or final-answer-only metrics with richer signals: posterior expected credit, checkpoints, milestones, JSON validity, or execution-grounded grading.
- Evaluate in realistic environments with tool calls and state changes rather than static QA.
- Separate distinct failure sources: extraction vs selection, reasoning vs execution, correctness vs parseability.
- Use certified or auditable denominators where possible rather than heuristic scoring alone.
- Open questions / failure modes:
- How well do these richer metrics transfer across domains without becoming benchmark-specific?
- Many protocols still rely partly on LLM judges or sparse human annotation.
- Small or sparse settings remain hard: STABLEVAL notes instability in low-density annotation regimes; HalluScan uses very small samples.
- Process metrics can diagnose failure, but they do not by themselves provide training signals that reliably improve agents.
Theme: Long-horizon agents benefit from decomposition, retrieval adaptation, and control
- Why it matters: Across web, OS, tool-use, and data-analysis settings, papers converge on the view that monolithic agents fail because they mix planning, execution, retrieval, and memory management into one brittle loop. Specialized modules and in-loop control improve both success rate and compute efficiency.
- Representative papers:
- Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning
- T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
- FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
- MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
- Common approach:
- Split planning from acting and memory, then allocate more capacity or RL updates to the planner.
- Control exploration at finer granularity using token- and turn-level uncertainty signals.
- Move retrieval into the reasoning loop so tool search adapts as the task unfolds.
- Treat memory writing as a budgeted optimization problem rather than a downstream QA proxy.
- Open questions / failure modes:
- Planner-centric systems still inherit frozen actor errors and coarse credit assignment.
- Retrieval-heavy methods can become base-model dependent and token-expensive.
- Memory evaluation is package-conditional; practical writers still need estimated utilities, not oracle marginals.
- Longer-horizon settings beyond current step limits remain underexplored.
Theme: Security threats are moving beyond prompt injection
- Why it matters: The strongest security papers here target structural weaknesses in the agent stack: relays, persistent memory, tool authorization, poisoned datasets, and multi-turn conversational context. The implication is that model alignment alone is insufficient if the surrounding system can rewrite, persist, or misroute actions.
- Representative papers:
- When Alignment Isn’t Enough: Response-Path Attacks on LLM Agents
- Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense
- ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
- Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI
- Common approach:
- Model the full system threat surface: relay tampering, persistent carriers, multi-turn priming, and intent-misaligned tool access.
- Use automated search or analysis to discover high-yield attacks: evolutionary mutators, code-property graphs, semantic task-tool matching.
- Evaluate transfer across models and platforms, not just within one stack.
- Pair attacks with architectural defenses such as time-channel detection, temporal re-entry constraints, or zero-trust interception.
- Open questions / failure modes:
- Many defenses are partial: latency-based detection needs long sessions; semantic authorization degrades in multi-turn settings.
- Some results depend on black-box judges or withheld artifacts, limiting exact reproducibility.
- Closed-source and production systems may expose different carrier surfaces than open-source testbeds.
- Strong provider asymmetries suggest defenses may be recipe-specific rather than general.
Theme: Alignment and preference optimization are hitting optimization pathologies
- Why it matters: Several papers identify failure modes in preference learning and RL that are not just “needs more data” problems: structural estimator bias, destructive rejected-response gradients, and unstable exploration. These are algorithmic issues that can distort training even with good objectives.
- Representative papers:
- Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
- Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
- Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
- Efficient Preference Poisoning Attack on Offline RLHF
- Common approach:
- Analyze training dynamics at the estimator or gradient level rather than only comparing end metrics.
- Derive closed-form targets or bias terms for KL-regularized objectives and small-batch estimators.
- Introduce lightweight fixes: gradient gating, offline lookup estimators, weighted SFT projections.
- Study adversarial manipulation of preference data as a geometric problem in gradient space.
- Open questions / failure modes:
- Several papers are theory-heavy or single-run empirically; broader validation is still needed.
- Fixed-reference or log-linear assumptions may not hold in modern large-scale pipelines.
- Offline replacements for online RL still face support/coverage barriers.
- Robustness to poisoned or misspecified preference data remains weak.
Theme: Audit assumptions are breaking under deployment transformations
- Why it matters: A notable cluster shows that safety/privacy claims made at one stage of the pipeline can fail after deployment transformations such as quantization, relay mediation, or multilingual/tokenization shifts. This is a warning against evaluating only the “clean” training-time artifact.
- Representative papers:
- DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
- Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
- A multilingual hallucination benchmark: MultiWikiQHalluA
- HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
- Common approach:
- Stress-test claims under deployment-relevant conditions: INT4 quantization, multilingual settings, recoverability probes, cost-aware detection routing.
- Compare multiple metrics and show they disagree systematically.
- Introduce composite or deployment-aware metrics such as UQS, Q-INT4, HalluScore, or answer-level hallucination rates.
- Validate whether “passing” one metric still leaves recoverable or hidden failure modes.
- Open questions / failure modes:
- Composite metrics help ranking but may still inherit oracle or weighting assumptions.
- Small benchmark sizes and synthetic labels limit confidence in some hallucination findings.
- Quantization-robust unlearning currently appears to trade off sharply against retain accuracy.
- Cross-language hallucination estimates may be confounded by tokenization and classifier false positives.
Theme: Structured, symbolic, and auditable offloading is resurging
- Why it matters: Several papers improve reliability by moving critical reasoning into typed, symbolic, or constrained representations rather than relying on free-form runtime reasoning. This is especially compelling in legal, optimization, and process-supervision settings where auditability matters.
- Representative papers:
- Common approach:
- Use LLMs once for compilation, synthesis, or strategy proposal, then rely on deterministic execution or verification.
- Make latent structure explicit: typed clause graphs, first-error labels, strategy segments.
- Train or evaluate against solver/prover-backed signals rather than only text similarity.
- Emphasize audit trails and structural diagnostics.
- Open questions / failure modes:
- Expressivity remains bounded by the symbolic substrate or template library.
- Compilation errors can become deterministic failure modes if not caught.
- Generalization to open-ended domains is unclear.
- Some systems depend on strong external tools or judges (solvers, provers, expert review).
3) Technical synthesis
- Multiple papers replace coarse trajectory-level supervision with finer-grained control or scoring: T2PO uses token/turn uncertainty, Gate-DPO gates rejected gradients, STABLEVAL preserves posterior item uncertainty, and process-PRM synthesis labels first-error structure.
- A common pattern is decomposition for identifiability: planner vs actor vs memory, task extraction vs task-tool matching, extraction vs selection in memory writing, correctness vs JSON validity, and retrieval vs reasoning in data-analysis agents.
- Several works expose support/coverage as the hidden bottleneck: BOLT’s one-shot gap depends on missing target support, FitText shows retrieval is the binding constraint, and MEMAUDIT formalizes budgeted candidate coverage.
- Security papers increasingly model post-hoc integrity failures rather than model misbehavior alone: response-path forgery, persistent carrier re-entry, tool-call substitution, and relay-side rewriting all bypass standard alignment assumptions.
- Evaluation papers repeatedly show that single metrics are misleading: FA vs RA vs AD/JS disagree in multimodal unlearning; task accuracy without parseability gives 0 operational utility; final-answer-only scoring hides partial process progress.
- Several methods use offline precomputation to reduce online cost: BOLT precomputes Boltzmann weights, DACL compiles contracts once, MEMAUDIT computes exact package optima, and AloLab pays one-time prompt optimization cost to recover near-baseline inference latency.
- There is a notable trend toward judge-in-the-loop systems, but with different roles: VLM-as-judge for planner RL, LLM judges for hallucination and benchmark grading, and semantic judges for jailbreak search. This improves scalability but creates a shared dependency on judge reliability.
- Robustness work increasingly tests deployment transformations explicitly: quantization, annotator subsampling, multilingual tokenization, relay mediation, and constrained decoding overhead all materially change conclusions.
- Several papers suggest capability is not the only determinant of robustness: Constitutional AI appears resistant to the compliance trap, Anthropic models resist some contextual jailbreak transfer, and planner scaling can matter more than scaling all modules.
- Across agent benchmarks, the dominant failures are still reasoning and coordination, not just tool syntax: PhysicianBench attributes about half of failures to clinical reasoning; DataClaw shows hard tasks remain difficult even after cleaning data; AcademiClaw finds more tokens do not buy better outcomes.
4) Top 5 papers (with “why now”)
When Alignment Isn’t Enough: Response-Path Attacks on LLM Agents
- Formalizes a structural integrity gap in BYOK deployments: a relay can rewrite model outputs after alignment but before agent execution.
- Shows strong empirical attack performance across AgentDojo and ASB, with RTA-PostForge reaching 73.5% ASR on AgentDojo while preserving 47.6% utility, and high ASR on ASB.
- Useful now because many production agent stacks rely on relays, routers, or middleware that terminate TLS and are implicitly trusted.
- Also valuable because it reframes prompt injection as only one part of the threat model; response authenticity becomes a first-class safety requirement.
- Skeptical about: the threat model is specific to BYOK relay deployments, and the proposed time-channel detection works best over longer sessions.
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
- Identifies “hesitation” as a concrete source of multi-turn RL instability: overlong low-information thinking and repeated unproductive turns.
- Introduces token-level truncation and turn-level resampling driven by a self-calibrated uncertainty signal, with gains across WebShop, ALFWorld, and Search QA.
- Useful now because multi-turn agent RL is becoming standard, and training collapse/variance is a major practical bottleneck.
- Particularly actionable as a plug-and-play control layer rather than a full optimizer replacement.
- Skeptical about: effectiveness depends on threshold tuning and still inherits off-policy staleness from pipelined rollouts.
Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning
- Provides unusually direct evidence that planner capacity dominates end-to-end long-horizon agent performance, with planner scaling nearly matching scaling all modules.
- Shows large gains from multi-agent decomposition and planner-only RL, including improvements on WebVoyager, OSWorld, and MCPBench.
- Useful now because many teams are over-investing in monolithic agents or uniform scaling rather than targeting the planning bottleneck.
- Offers a compute-allocation lesson: spend model size and RL budget where it matters most.
- Skeptical about: actor and memory are frozen during RL, so execution failures remain; the 15-step cap may understate longer-horizon failure modes.
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
- Shows that INT4 quantization can restore forgotten content even when BF16 audits indicate successful unlearning.
- Introduces a deployment-aware durability framing and a quantization-aware mitigation, DURABLEUN-SAF, that achieves a multi-seed durability certificate on TOFU.
- Useful now because low-bit deployment is standard, making BF16-only unlearning audits potentially misleading for privacy/compliance claims.
- The paper’s main contribution is procedural as much as algorithmic: evaluate unlearning at deployment precision, not just training precision.
- Skeptical about: the current robust solution collapses retain accuracy, so it is more an existence proof than a production-ready fix.
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
- Brings agent evaluation into a realistic FHIR-based EHR environment with 100 physician-validated long-horizon tasks and 670 checkpoints.
- Shows that even top models are far from reliable autonomy: best pass@1 is 46.3%, with clinical reasoning accounting for about half of failures.
- Useful now because healthcare is a high-stakes domain where static QA benchmarks are especially misleading.
- More broadly, it exemplifies the benchmark shift toward execution-grounded, stateful, domain-realistic evaluation.
- Skeptical about: scope is limited to e-consult-style EHR workflows and excludes broader multimodal or collaborative clinical settings.
5) Practical next steps
- Add response integrity checks to agent stacks: log and verify the exact model output consumed by the executor, especially in relay/BYOK architectures.
- Evaluate any unlearning or privacy-sensitive model at deployment precision (INT8/INT4), not just BF16; add Q-INT4/Q-INT8-style reporting to audits.
- For long-horizon agents, run ablations that separately scale planner, actor, retrieval, and memory to identify the true bottleneck before spending more compute.
- Instrument agent training with token/turn-level diagnostics: average think length, repeated-turn rate, uncertainty trajectories, and collapse indicators across seeds.
- Replace final-answer-only benchmarking with process checkpoints and operational metrics: parseability, tool-state correctness, milestone completion, and execution-grounded success.
- For tool-using agents, deploy zero-trust interception: verify tool definitions, tool-call provenance, parameter integrity, and task-tool semantic alignment before execution.
- Stress-test agents against multi-turn contextual jailbreaks and persistent-memory attacks, not just single-turn prompt injection.
- Where domains are structured and high-stakes, prototype compile-once neuro-symbolic offloading or typed intermediate representations instead of repeated free-form runtime reasoning.
- In preference optimization, monitor chosen-response likelihood and mass dynamics, not just pairwise margin improvements, to catch DPO-style squeezing early.
- For memory systems, separate write-time quality from retrieval/reader quality; audit whether the writer is extracting the right facts, selecting under budget, or simply overproducing unusable notes.
Generated from per-paper analyses; no external browsing.
