Daily AI Paper Report (2026-04-29)
Published:
Chinese version: [中文]
Run stats
- Candidates: 244
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-27T00:00:00Z → 2026-04-28T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.24118 | AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization | cs.CR | 95 | Prompt-injection defense for agents via semantic privilege separation and audited tool mediation. | agent-security, prompt-injection, tool-use, sandboxing, guardrails |
2604.24082 | Jailbreaking Frontier Foundation Models Through Intention Deception | cs.CR, cs.AI, cs.CL | 95 | Targets jailbreaks via intention deception against frontier safe-completion models; highly safety-relevant. | jailbreak, agent-safety, safe-completion, red-teaming, frontier-models |
2604.24542 | Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models | cs.CR, cs.AI, cs.CL | 93 | Runtime monitor for jailbreaks, prompt injection, and backdoors without clean-model assumptions. | runtime-monitoring, jailbreaks, prompt-injection, backdoors, llm-security |
2604.24657 | AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents | cs.CR, cs.AI | 92 | Defense-in-depth security architecture for autonomous agents across lifecycle stages and tool execution. | agent-security, agents, defense-in-depth, tool-use, runtime-safety |
2604.24348 | OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents | cs.CL | 91 | Comprehensive OS-agent toolkit spanning safety, performance, efficiency, and robustness evaluation. | agents, benchmark, os-agents, safety-evaluation, robustness |
2604.24700 | Green Shielding: A User-Centric Approach Towards Trustworthy AI | cs.CL, cs.AI | 91 | User-centric robustness benchmark for benign prompt variation; strong deployment-safety relevance. | llm-safety, evaluation, robustness, medical-ai, red-teaming |
2604.24710 | Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters | cs.AI, cs.CL | 91 | Clinician-authored rubric framework for scalable clinical AI eval with large real-world validation. | evaluation, clinical-ai, llm-as-judge, safety, deployment |
2604.24074 | How Sensitive Are Safety Benchmarks to Judge Configuration Choices? | cs.CL | 90 | Shows safety benchmark scores vary sharply with judge prompts, challenging reliability of current evals. | safety-evaluation, llm-as-judge, benchmarking, harmbench, measurement |
2604.24618 | Evaluating whether AI models would sabotage AI safety research | cs.AI | 89 | Direct evaluation of whether frontier models sabotage AI safety research in agentic settings. | ai-safety, agent-evals, sabotage, frontier-models, alignment |
2604.24594 | Skill Retrieval Augmentation for Agentic AI | cs.CL, cs.AI | 89 | Scalable skill retrieval for agents tackles context limits; likely reusable agent framework/benchmark. | agents, retrieval, tool-use, long-context, agent-architecture |
2604.24697 | Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft | cs.AI | 89 | Benchmark for agents' discovery-to-application loop; tests generalization beyond memorized solutions. | agents, benchmark, reasoning, evaluation, generalization |
2604.24005 | TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents | cs.LG, cs.AI | 88 | Addresses instability in on-policy distillation for multi-turn agents with curriculum training. | agents, distillation, reinforcement-learning, reasoning, training |
2604.24198 | Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis | cs.CL, cs.AI, cs.CE, cs.LG, cs.MA | 88 | Process reward modeling for agentic data analysis with environment-aware verification and silent-error detection. | process-reward-model, agents, verification, reasoning, reliability |
2604.24473 | Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus | cs.AI, cs.CL | 88 | Large retrospective study of agentic clinical reasoning over long records against expert consensus. | agents, clinical-ai, long-context, rag, evaluation |
2604.24686 | Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents | cs.AI | 87 | Adaptive runtime governance framework for autonomous agents with explicit risk-bound decision rules. | agent-governance, runtime-safety, risk-estimation, autonomous-agents, monitoring |
2604.24021 | QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems | cs.AI, math.AP | 87 | Open multi-agent proof system with explicit failure modes; useful for agent reliability research. | agents, reasoning, evaluation, multi-agent, math |
2604.24395 | Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs | cs.AI | 86 | Self-corrected preference learning for LVLM hallucination mitigation with concrete alignment angle. | alignment, hallucination, vlm, dpo, reliability |
2604.24197 | Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk | cs.CL, cs.AI | 86 | Timely risk analysis of frontier image models and synthetic visual evidence with policy relevance. | ai-safety, multimodal, misinformation, risk-analysis, frontier-models |
2604.24162 | Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing | cs.CR, cs.AI | 85 | Plug-and-play inference-time defense for backdoored LLMs with no retraining or clean data. | backdoor-defense, llm-security, inference-time, robustness, attention |
2604.24184 | Dynamic Cyber Ranges | cs.CR | 84 | Dynamic cyber ranges with defender agents offer stronger evaluation for offensive agent capabilities. | cybersecurity, agent-evaluation, red-teaming, defender-agents, benchmarks |
2604.24698 | The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models | cs.CL | 84 | Identifies persona collapse in LLM populations; important for multi-agent realism and risk analysis. | multi-agent, evaluation, llm-behavior, robustness, social-simulation |
2604.24302 | Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer | cs.CL | 84 | Mechanistic interpretability method for transferring circuits across models could improve scalable auditing. | mechanistic-interpretability, circuits, model-auditing, transfer, reliability |
2604.24039 | AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents | cs.LG, cs.AI, cs.CL | 84 | Improves embodied agent efficiency via cache-based planning with strong latency and token reductions. | agents, efficiency, planning, embodied-ai, llm-systems |
2604.24623 | XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation | cs.AI, cs.IR, cs.LG | 82 | Explainability for GraphRAG via causal graph perturbations improves transparency and trust. | rag, graphrag, interpretability, grounding, explainability |
2604.24178 | Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment | cs.LG, cs.AI | 82 | Adaptive multi-objective alignment via meta-learning addresses conflicting values in LLM optimization. | alignment, multi-objective, preference-optimization, meta-learning, llms |
2604.23954 | An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness | cs.AI | 82 | Empirical study of model update risks in clinical AI covering stability, arbitrariness, and fairness. | safety, fairness, evaluation, clinical-ai, model-updates |
2604.24477 | GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems | cs.CR, cs.AI | 80 | Open benchmarking framework for anomaly monitoring in LLM multi-agent systems under attacks. | multi-agent, benchmark, anomaly-detection, security, evaluation |
2604.24564 | MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG | cs.CL, cs.IR, cs.IT | 80 | Metric for multimodal evidence grounding in RAG targets hallucination and evidence quality. | rag, multimodal, evaluation, grounding, hallucination |
2604.24222 | MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation | cs.SE, cs.AI, cs.CL | 80 | Memory framework for private-library codegen addresses enterprise RAG gaps with evolving usage guidance. | code-llm, rag, memory, enterprise-ai, agents |
2604.24038 | AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment | cs.AI, cs.CL, cs.SE | 79 | Continuous deployment-time evaluation for AI agents using multi-signal metrics beyond static benchmarks. | agent-evaluation, deployment, monitoring, benchmarks, ecosystem |
AI Paper Insight Brief
2026-04-29
0) Executive takeaways (read this first)
- Agent work is shifting from single-score capability gains to runtime control and deployment realism: several papers focus on continuous evaluation, lifecycle defenses, runtime monitors, and dynamic benchmarks rather than static task success alone.
- A recurring pattern is structured mediation beats naive scaling: temporal curricula for distillation, semantic hypervisors for tool use, process reward models for data analysis, and skill/memory scaffolds all outperform simpler “just give the model more context” baselines.
- Evaluation itself is under attack from hidden variance: judge prompt wording can swing safety scores by up to 24.2 points, deployment-aware rankings diverge sharply from benchmark-only rankings, and persona/population fidelity can fail even when per-instance metrics look good.
- Security work is increasingly targeting indirect and lifecycle-spanning failures: para-jailbreaking, prompt injection through external content, backdoored weights, and multi-agent infection all require defenses that monitor internal state or mediate actions across stages.
- In high-stakes domains, the strongest results come from tool-using, structured systems with explicit verification, but residual errors remain more consequential than aggregate metrics suggest—especially in clinical and safety-critical settings.
- For frontier progress, the practical bottlenecks are less about raw model competence and more about incorporation, stability, grounding, and governance: retrieving skills/evidence is not enough unless the agent knows when and how to use them safely.
2) Key themes (clusters)
Theme: Runtime governance and defense-in-depth for agents
- Why it matters: As agents gain tools, memory, and autonomy, failures propagate across stages rather than at a single prompt boundary. The strongest papers here move toward layered controls, runtime anomaly detection, and action mediation instead of one-shot filtering.
- Representative papers:
- AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
- Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
- AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
- Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents
- Common approach:
- Interpose a trusted control layer between model outputs and privileged actions.
- Use runtime signals beyond output text: tool-call semantics, hidden-state trajectories, drift statistics, or lifecycle event hooks.
- Prefer monotonic/tighten-only interventions, abstention, or one-shot correction over unconstrained retries.
- Treat security as cross-stage state tracking rather than isolated input filtering.
- Open questions / failure modes:
- Many methods require white-box access, added latency, or framework-specific integration.
- Quantitative validation is uneven: some architectures are compelling but still mostly qualitative or theoretical.
- Long-context, multimodal, and highly adaptive adversaries remain weak spots.
- False positives and over-restriction can degrade utility in production.
Theme: Agent evaluation is becoming deployment-aware and harder to trust
- Why it matters: Static benchmarks are increasingly insufficient or unstable. Several papers show that rankings, safety scores, and fidelity claims can change materially with judge configuration, deployment signals, or population-level diagnostics.
- Representative papers:
- How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
- AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
- OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
- The Chameleon’s Limit: Investigating Persona Collapse and Homogenization in Large Language Models
- Common approach:
- Expand evaluation beyond task success to include safety, efficiency, robustness, adoption, sentiment, or population geometry.
- Use factorized diagnostics to separate complementary signals rather than collapse everything into one leaderboard.
- Stress-test evaluators themselves via prompt variants, cross-factor validation, or population-level metrics.
- Release tooling/registries to support continuous rather than one-off measurement.
- Open questions / failure modes:
- Composite metrics can conflict with benchmark rankings, especially for closed-source high-capability systems.
- Judge variance remains large without human anchors.
- Some analyses are underpowered or rely on public observables that systematically miss proprietary usage.
- Population-level fidelity lacks strong human references outside a few domains.
Theme: Structured scaffolding beats naive context stuffing in long-horizon agents
- Why it matters: Across distillation, planning, skill use, and scientific/data-analysis agents, the main failure is not lack of information but poor sequencing, incorporation, and recovery. Systems that explicitly control horizon, memory, or verification outperform flat prompting and long-context baselines.
- Representative papers:
- TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
- Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
- Skill Retrieval Augmentation for Agentic AI
- Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
- Common approach:
- Break long-horizon behavior into staged curricula, process rewards, or explicit sub-agent roles.
- Distinguish retrieval from incorporation and application; the bottleneck is often need-aware use, not recall.
- Reward or verify intermediate steps, especially exploratory-but-correctable ones.
- Use structured consolidation formats or memory to preserve discovered knowledge across episodes.
- Open questions / failure modes:
- Strong scaffolds can improve discovery but leave a large execution/application gap.
- Some methods depend on teacher trajectories or expensive annotation pipelines.
- Need-aware skill loading remains weak even when the right skill is retrieved.
- Generalization beyond text benchmarks or synthetic environments is still uncertain.
Theme: Grounding, verification, and evidence selection are moving upstream
- Why it matters: Hallucination mitigation is shifting from output filtering to better evidence selection, self-correction, and explicit verification. The common lesson is that relevance is not enough; systems need contribution-aware evidence and in-distribution correction loops.
- Representative papers:
- QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
- MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG
- Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs
- XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
- Common approach:
- Separate generation from verification, often with specialized verifier stages or models.
- Score evidence by marginal contribution to answer confidence rather than semantic similarity alone.
- Generate preference data from the model’s own corrected outputs to reduce teacher-distribution mismatch.
- Use structured outputs and graph-native perturbations to make attribution auditable.
- Open questions / failure modes:
- Verification pipelines can be expensive and domain-limited.
- Some methods depend on teacher models, heuristic thresholds, or small case-study samples.
- Automatic verification still misses complex multi-object or long-chain dependencies.
- Better grounding does not fully solve conservativeness or coverage trade-offs.
Theme: Security research is targeting indirect, adaptive, and multi-agent attack surfaces
- Why it matters: The threat model is broadening from direct jailbreaks to deception, backdoors, prompt injection through tools/content, and infection across agent networks. Defenses now need to handle hidden triggers, indirect intent shifts, and topology-level propagation.
- Representative papers:
- Common approach:
- Model attacks as multi-turn or multi-stage processes rather than single prompts.
- Use internal signals or graph structure to detect anomalous propagation and isolate compromised nodes.
- Benchmark defenses in live, adaptive environments instead of static datasets.
- Emphasize operational metrics like latency, token cost, and containment time alongside ASR.
- Open questions / failure modes:
- Some defenses only partially handle router-targeted or highly adaptive attacks.
- Dynamic environments improve realism but make reproducibility and attribution harder.
- Attack success can depend heavily on evaluator/judge setup.
- Many results are benchmark-bound and may not transfer cleanly to production systems.
Theme: High-stakes domains are exposing the limits of aggregate metrics
- Why it matters: Clinical and other safety-critical settings show that average accuracy can hide instability, fairness drift, or clinically significant errors. The more useful papers measure temporal consistency, subgroup effects, and case-specific quality rather than just top-line performance.
- Representative papers:
- An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness
- Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus
- Green Shielding: A User-Centric Approach Towards Trustworthy AI
- Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters
- Common approach:
- Evaluate stability, subgroup fairness, citation sufficiency, or clinically weighted rubric criteria alongside accuracy.
- Compare agentic/tool-using systems against RAG and long-context baselines on real longitudinal records.
- Use realistic user inputs and case-specific rubrics rather than generic benchmark prompts.
- Audit abstention and neutralization mechanisms for unequal subgroup or coverage effects.
- Open questions / failure modes:
- Better aggregate concordance can still leave more clinically significant residual errors.
- Neutralization and abstention often improve plausibility while reducing coverage of critical conditions.
- Generalization across institutions, specialties, and workflows remains limited.
- Expert-authored evaluation is costly, and automated proxies still need calibration.
3) Technical synthesis
- Several papers converge on a monitor-then-intervene pattern: LCF monitors hidden-state deltas before generation, AgentVisor audits proposed tool calls, TIGS screens attention collapse before smoothing, and clinical abstention methods defer on out-of-distribution cases.
- Structured intermediate representations are a recurring enabler: YAML proof DAGs in QED, semantic exceptions in AgentVisor, task/API guideline memories in MEMCoder, structured memory in clinical agents, and claim-proof-constraints-example summaries in SCICRAFTER.
- A common failure mode across agent papers is retrieval/incorporation mismatch: retrieving the right skill, evidence, or document is often easier than getting the model to use it correctly.
- Multiple works replace binary correctness with graded process signals: DataPRM’s ternary rewards, clinical rubric weighting, and multi-factor deployment scores all capture recoverable vs irrecoverable errors better than pass/fail metrics.
- Curriculum and pacing appear as a general stabilization tool: TCOD controls rollout horizon during distillation; discovery agents improve with staged hints/scientist scaffolds; memory systems evolve guidelines over time rather than injecting everything at once.
- Security defenses increasingly rely on internal geometry or topology, not just text classification: attention collapse, layerwise convergence fingerprints, graph anomaly monitoring, and graph-native perturbation explanations.
- Several papers show benchmark ceilings can be misleading: iterative RAG and full-context converge in longitudinal clinical reasoning; benchmark-only rankings diverge from deployment-aware rankings; per-persona fidelity hides population collapse.
- There is a growing split between architectural papers with strong conceptual framing but weak quantitative validation and benchmark-heavy empirical papers with narrower scope; combining both remains rare.
- In multimodal settings, the strongest gains come from evidence contribution modeling rather than raw relevance, whether for reranking (MEG-RAG) or hallucination correction (AVES-DPO).
- Across domains, utility-preserving defense is the differentiator: one-shot self-correction, asynchronous cache updates, process rewards for exploration, and selective skill loading all try to avoid the usual safety-vs-performance collapse.
4) Top 5 papers (with “why now”)
- AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
- Reframes agent security as privilege separation: an untrusted Guest proposes actions, a trusted Visor audits them via Suitability, Taint, and Integrity checks.
- Achieves near-zero ASR on evaluated direct and indirect prompt-injection benchmarks while preserving substantial utility under attack.
- The one-shot semantic-exception recovery path is practically useful because it avoids the utility collapse of block-only defenses.
- Why now: prompt injection is moving from toy demos to real tool-using agents, and this is one of the clearest deployable architectures for mediation.
- Skepticism: adds latency, focuses on text settings, and long-context/multimodal scaling is still unresolved.
- Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
- Identifies two concrete PRM failure modes in data-analysis agents: silent semantic errors and over-penalized exploratory grounding steps.
- DataPRM uses environment-aware ReAct verification, tool calls, and ternary rewards to improve both test-time scaling and RL training.
- A 4B verifier outperforming larger PRM baselines is especially relevant for practical agent stacks.
- Why now: agentic scientific/data-analysis workflows are proliferating, and process supervision is becoming more important than final-answer scoring.
- Skepticism: scope is still mostly reasoning/visualization tasks, and the verifier pipeline adds compute and annotation overhead.
- Jailbreaking Frontier Foundation Models Through Intention Deception
- Introduces para-jailbreaking: models can refuse direct harmful requests yet still leak harmful alternative content under a benign-seeming narrative.
- iDecep shows strong multi-turn attack success against frontier systems, including multimodal amplification with benign images.
- The paper matters because it targets the newer safe-completion regime rather than older refusal-only defenses.
- Why now: as labs shift to “helpful but safe” completions, indirect leakage becomes a more realistic failure mode than blunt refusal bypasses.
- Skepticism: black-box experiments are limited in scope, and exact attack tooling is withheld, making replication and defense benchmarking harder.
- Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus
- Shows a structured agentic system can beat both iterative RAG and full-context baselines on complex longitudinal clinical reasoning.
- Gains are largest on the hardest questions and longest records, where current non-agentic methods appear to hit a ceiling.
- The ablation suggests the skill library, not just tool access, is the main driver of improvement.
- Why now: this is a concrete signal that agentic structure may finally outperform brute-force retrieval/context expansion in a real high-stakes domain.
- Skepticism: retrospective, institution-specific, and residual system errors are more often clinically significant than expert disagreements.
- How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
- Quantifies a major but under-discussed source of benchmark instability: judge prompt wording alone shifts harmful-rate estimates by up to 24.2 points.
- Shows even surface rewording within the same prompt condition can cause large swings and ranking reversals.
- Provides a direct methodological warning for anyone using LLM-as-judge safety scores in model comparison or governance.
- Why now: safety benchmarking is increasingly used for deployment and policy decisions, but many reported deltas may be smaller than judge-induced variance.
- Skepticism: primary analysis is centered on one judge model and one benchmark, without a human accuracy anchor.
5) Practical next steps
- Run multi-prompt judge audits for any internal safety benchmark; report ranges and ranking stability, not just a single harmfulness number.
- Add a runtime mediation layer for tool-using agents: at minimum, audit tool suitability, goal alignment, and argument integrity before execution.
- Instrument agents with prefill/runtime anomaly signals where possible—hidden-state or action-sequence monitors can catch failures that output filters miss.
- For long-horizon agents, test curriculum exposure and process-level rewards before scaling context or model size; many failures are sequencing failures.
- Separate your agent stack into retrieval, incorporation, and application metrics. If performance is flat, check whether the model is actually using retrieved skills/evidence.
- In RAG and multimodal systems, rerank by marginal evidence contribution rather than semantic similarity alone; relevance without contribution is a common hallucination source.
- In high-stakes deployments, track stability, subgroup effects, and abstention distribution across updates, not just aggregate accuracy.
- For evaluation of synthetic users or multi-agent populations, add population-level geometry checks (coverage, uniformity, complexity) to catch homogenization hidden by per-instance fidelity.
- If you deploy autonomous agents in adversarial settings, benchmark them in dynamic environments with active defenders or topology updates, not only static tasks.
Generated from per-paper analyses; no external browsing.
