Daily AI Paper Report (2026-04-11)

Published:

Chinese version: [中文]

Run stats

  • Candidates: 285
  • Selected: 30
  • Deepread completed: 30
  • Window (UTC): 2026-04-09T00:00:00Z → 2026-04-10T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
arXiv IDTitle / LinksCategoriesScoreWhyTags
2604.08407Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
PDF
cs.CR95First systematic study of malicious LLM API routers: injection+secret exfiltration threat model & measurementsagent-security, supply-chain, tool-calling, api-routers, exfiltration, threat-model, measurement
2604.07667From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation
PDF
cs.AI, cs.MA, cs.SI95Calibrated act-vs-escalate layer for multi-agent debate; conformal guarantees against wrong consensus.agent-safety, multi-agent, debate, calibration, conformal-prediction, decision-making, escalation
2604.08499PIArena: A Platform for Prompt Injection Evaluation
PDF
cs.CR, cs.AI, cs.CL, cs.LG93Unified, extensible prompt-injection evaluation platform to compare attacks/defenses across datasets/tasksprompt-injection, evaluation, benchmarking, security, defenses, attacks, platform
2604.08523ClawBench: Can AI Agents Complete Everyday Online Tasks?
PDF
cs.CL, cs.AI93Realistic live-web agent benchmark (153 tasks/144 platforms); strong eval value for agentic systemsagents, benchmark, web, evaluation, tool-use, robustness
2604.07988LogAct: Enabling Agentic Reliability via Shared Logs
PDF
cs.DC, cs.AI93Shared-log state-machine abstraction enables pre-execution veto, recovery, and auditing for agents.agents, reliability, safety, auditing, execution-control, fault-tolerance, governance
2604.07775ACIArena: Toward Unified Evaluation for Agent Cascading Injection
PDF
cs.AI, cs.CL, cs.CR92Unified evaluation for multi-agent cascading injection across surfaces/objectives; fills MAS security gapmulti-agent, agent-security, cascading-injection, evaluation-suite, robustness, exfiltration
2604.07749Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
PDF
cs.CL92New benchmark for epistemic attacks beyond sycophancy; useful for robustness evalsLLM-safety, robustness, evaluation, jailbreaks, sociotechnical, benchmark
2604.07695AITH: A Post-Quantum Continuous Delegation Protocol for Human-AI Trust Establishment
PDF
cs.CR, cs.AI91Cryptographic continuous delegation + revocation for AI agents; concrete protocol for bounded autonomyagent-security, delegation, access-control, cryptography, post-quantum, governance
2604.08064ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
PDF
cs.AI91New benchmark for implicit (procedural/priming/conditioning) memory in LLM agents; safety-relevant behavior drift.benchmark, agent-memory, evaluation, behavior, implicit-learning, reliability
2604.08178Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
PDF
cs.AI90Trajectory-level RM benchmark for tool-using agents; tests safety refusal & tool constraints beyond RLHFreward-modeling, agents, benchmark, tool-use, trajectory-eval, alignment, safety
2604.08423Synthetic Data for any Differentiable Target
PDF
cs.CL, cs.AI, cs.LG, stat.ML90Optimizes synthetic data to steer models via higher-order gradients; big safety+misuse implicationsdata-poisoning, model-steering, synthetic-data, data-attribution, security, alignment
2604.08059Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules
PDF
cs.RO, cs.AI89Systems framework for safe capability upgrades with compatibility checks and runtime rollback in robotsembodied-agents, governance, safe-upgrades, rollback, runtime-safety, modularity
2604.07776Structured Distillation of Web Agent Capabilities Enables Generalization
PDF
cs.LG89Structured synthetic trajectories distill web-agent skills into 9B; strong WebArena gains for open models.web-agents, distillation, synthetic-data, tool-use, evaluation, open-weights
2604.07831Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
PDF
cs.CR, cs.CL, cs.CV88Practical red-teaming for GUI agents via semantic UI overlay injection; no white-box access requiredgui-agents, red-teaming, vision-attacks, visual-grounding, adversarial-ui, robustness
2604.08395Phantasia: Context-Adaptive Backdoors in Vision Language Models
PDF
cs.CV, cs.AI88Shows stealth of VLM backdoors overestimated; proposes/assesses defenses for multimodal backdoorssecurity, backdoors, VLM, data-poisoning, adversarial, defenses
2604.08525Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
PDF
cs.AI, cs.CL, cs.CY88Analyzes LLM conflicts of interest with ads; important for deployment incentives & alignmentalignment, deployment, conflicts-of-interest, ads, policy, AI-governance
2604.08005Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
PDF
cs.LG87Vision-side attack on computer-use agents by attention redirection to adversarial patch; preference hijackcomputer-use-agents, multimodal-security, adversarial-patch, attention, manipulation, vision
2604.08297Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models
PDF
cs.CR86ESI identifies safety-critical parameters and enables targeted safety interventions across dense vs MoEmechanistic-safety, parameter-intervention, MoE, interpretability, safety-control, robustness
2604.07755An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations
PDF
cs.CL, cs.SE86Quantifies how far static analysis can detect/mitigate library hallucinations; clear limits + upper boundshallucinations, code, static-analysis, reliability, evaluation, tooling
2604.07927EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
PDF
cs.AI86Adds structured query/evidence tools to deep-research agents; reduces redundancy and improves evidence use.agents, deep-research, web-search, tooling, evidence, reasoning
2604.08527Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
PDF
cs.CL, cs.LG86Identifies OPD length-inflation instability and proposes stabilization; relevant to post-trainingLLM-training, distillation, RLHF, post-training, stability, repetition
2604.08476Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
PDF
cs.CV, cs.AI85Constrained RL (Faithful GRPO) targets CoT faithfulness/visual grounding, not just accuracy, in MRMsmultimodal, RLVR, GRPO, faithfulness, grounding, reasoning-eval
2604.08046Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation
PDF
cs.CL85Targets RAG integration bottleneck via joint decoding that forces evidence extraction over parametric priors.RAG, grounding, faithfulness, decoding, hallucination, knowledge-integration
2604.07754The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
PDF
cs.CR, cs.CL84Systematic study of how fine-tuning can misalign/realign safety-aligned LLMs; relevant for model reusemisalignment, post-training, fine-tuning, realignment, safety, open-models
2604.07877MemReader: From Passive to Active Extraction for Long-Term Agent Memory
PDF
cs.CL84Active long-term memory extraction (GRPO/ReAct) to reduce memory pollution and improve consistency in agents.agent-memory, personalization, RL, GRPO, information-extraction, reliability
2604.08426KV Cache Offloading for Context-Intensive Tasks
PDF
cs.LG, cs.AI, cs.CL84Evaluates KV-cache offloading on context-intensive tasks; releases Text2JSON benchmarklong-context, systems, KV-cache, efficiency, benchmark, information-extraction
2604.08169Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
PDF
cs.AI83Runtime activation steering methods to maintain aligned open-ended generation beyond first tokensactivation-steering, runtime-guardrails, alignment, robustness, representation, generation
2604.07801TEMPER: Testing Emotional Perturbation in Quantitative Reasoning
PDF
cs.CL, cs.AI83TEMPER dataset shows emotion framing alone drops math accuracy 2–10pp; useful robustness stress testrobustness, evaluation, reasoning, emotion, dataset, GSM8K
2604.07789ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents
PDF
cs.MA, cs.CL, cs.SE83Quantifies value of oracle signals for SWE agents; clarifies what context helps mostagents, software-engineering, evaluation, tool-use, oracles, benchmarks
2604.07929Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems
PDF
cs.IR, cs.AI82Trace-level human vs GUI-agent behavior comparison in production search; goes beyond success metricsGUI-agents, evaluation, human-comparison, behavior-traces, search, deployment

AI Paper Insight Brief

2026-04-11

0) Executive takeaways (read this first)

  • “Safe to act” is becoming a first-class output of agent systems: conformal set-valued decisions for debate (risk-budgeted escalation) and log/vote-based execution gating both reduce catastrophic automated actions by turning uncertain outputs into structured refusal / review.
  • Agent security is shifting from prompt injection to system and supply-chain attack surfaces: cascading multi-agent injection, malicious API routers rewriting tool calls, and visual/UI-level manipulation of GUI/CUA agents all bypass classic text-only defenses.
  • Post-training is a dual-use battleground: preference tuning (ORPO) can rapidly misalign safety-aligned open models (even with tiny data via LoRA), while targeted parameter-level methods (ESI→SET/SPA) and activation steering offer efficient realignment/preservation—when you have white-box access.
  • Benchmarks are getting more realistic and more diagnostic: live-web tasks (ClawBench) expose a large gap vs sandbox benchmarks; trajectory-level reward modeling (Plan-RewardBench) shows evaluators collapse at long contexts; implicit memory (ImplicitMemBench) reveals “unconscious” adaptation failures not fixed by retrieval.
  • RAG and code reliability work is moving beyond retrieval: joint decoding for evidence integration (GuarantRAG) targets “retrieved-but-ignored” failures; static analysis catches a meaningful but bounded fraction of Python library hallucinations, with clear upper bounds.
  • Training/inference infrastructure details matter for robustness: OPD can collapse via length inflation; KV-cache offloading can silently degrade accuracy on context-intensive tasks—both are “systems” failure modes that look like model failures.

2) Key themes (clusters)

Theme: Risk-controlled autonomy (refusal, gating, recovery)

Theme: Agent attack surfaces beyond text (multi-agent, UI/vision, routers)

Theme: Post-training alignment is fragile (and can be targeted)

Theme: Evaluation realism + long-horizon diagnostics for agents

Theme: Grounding and integration (RAG, memory, code)

3) Technical synthesis

  • Multiple papers converge on “structured intermediates” as the reliability lever: prediction sets (conformal), typed logs (AgentBus), explicit tool-arguments (Q+), oracle signals (ORACLE-SWE), and trajectory pairs (Plan-RewardBench).
  • Selection effects are repeatedly exploited: conformal singletons are accurate because they abstain; judge-filtered synthetic trajectories outperform larger unfiltered sets; shadow deployment catches regressions sandbox misses.
  • LLM-as-judge is everywhere, but papers increasingly report judge validation (e.g., PPT-Bench human agreement; FGRPO κ=0.997 vs GPT-5) and/or add judge-independent signals (activation steering uses embedding similarity, cross-entropy, ELO).
  • Robustness failures are increasingly non-adversarial in appearance: emotional framing degrades math; UI icons are “safety-aligned”; router rewrites remain schema-valid; OPD collapse looks like “model got worse” but is a training dynamic.
  • There’s a clear split between black-box deployable defenses (conformal layer; prompt mitigations; static analysis; client-side router gates/logging) and white-box mechanistic defenses (activation steering; ESI parameter interventions; PRAC patch crafting).
  • Long-horizon settings expose evaluator brittleness: Plan-RewardBench shows pairwise LLM judges collapse past ~32k tokens, motivating more robust discriminative RMs or hierarchical evaluation.
  • “Memory” is bifurcating into explicit stores (MemReader active writes) vs implicit behavioral adaptation (ImplicitMemBench), and the latter is not solved by retrieval alone.
  • Systems work (KV offloading, OPD stability) shows that inference/training optimizations can silently change task accuracy, so robustness evaluation must include infrastructure variants.
  • Security evaluation is moving toward ecosystem measurement (router markets, poisoning studies) rather than only lab attacks, yielding concrete prevalence numbers and operational mitigations.

4) Top 5 papers (with “why now”)

1) Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

  • Quantifies a real, under-discussed risk: API routers terminate TLS and can rewrite executable tool-call JSON.
  • Large ecosystem measurement (28 paid + 400 free routers) with observed active injection and credential touching, plus poisoning studies.
  • Evaluates practical client-side mitigations (policy gate, anomaly screening, transparency logging) and shows compatibility of the attack proxy across agent frameworks.
  • Skepticism: client-side defenses don’t provide cryptographic provenance; measurement may miss untriggered adaptive behaviors.

2) From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation

  • Reframes debate output as act vs escalate under a user-set risk budget α, with split-conformal marginal coverage.
  • Empirically targets a key failure: wrong unanimous consensus (23.9% of initially-disagreeing cases converge to unanimous wrong by round 3); conformal layer intercepts 81.9% at α=0.05 by escalating.
  • Black-box and post-hoc: deployable on proprietary models via verbalized probabilities + aggregation.
  • Skepticism: guarantees are marginal and assume exchangeability; evaluated in closed-set multiple-choice.

3) ClawBench: Can AI Agents Complete Everyday Online Tasks?

  • Live-web benchmark with safe interception of terminal submissions and five-layer trace recording—bridges realism and safety.
  • Shows a stark gap vs sandbox benchmarks: best model reported (Claude Sonnet 4.6) at 33.3% SR; GPT-5.4 at 6.5%.
  • Provides traceable failure diagnostics via an agentic evaluator comparing to human trajectories.
  • Skepticism: live-web variability and manual endpoint annotation limit scalability and reproducibility.

4) ACIArena: Toward Unified Evaluation for Agent Cascading Injection

  • Standardizes multi-agent cascading injection evaluation across 28 attacks, 3 surfaces, 3 objectives, and integrates six MAS frameworks.
  • Finds high vulnerability (code tasks often 90–100% ASR; LLM Debate cited at 100% hijacking ASR) and that some defenses can fail or trade utility.
  • Proposes ACI-SENTINEL (semantic minimality pruning) with large ASR reductions in reported cases.
  • Skepticism: evaluation scale constrained by query cost; defense introduces utility trade-offs and isn’t universally effective.

5) The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

  • Maps attacker/defender dynamics across common methods: ORPO strongest for misalignment; DPO strongest for realignment (often with utility cost).
  • Shows misalignment can be data-efficient (LoRA effective with as few as 13 unsafe samples in some settings).
  • Highlights model-specific resistance patterns (Gemma2 resists SFT misalignment but not ORPO).
  • Skepticism: unsafety relies on LLM-judge ensemble; excludes proprietary models and full RLHF.

5) Practical next steps

  • Add an act/escalate layer to any multi-agent or ensemble system: implement split conformal on aggregated probabilities (or analogous scores) and measure automated-error reduction vs escalation rate.
  • For tool-using agents, treat routers as untrusted: deploy fail-closed policies for high-risk tools, add response anomaly screening, and implement append-only transparency logs for forensics.
  • Red-team GUI/CUA stacks with non-text attacks: semantic UI icon injection and visual preference redirection; measure persistence and cross-model transfer, not just single-shot success.
  • If you ship open-weight models, assume post-training misalignment is cheap: test ORPO/LoRA-style adversarial tuning on your release candidates; evaluate how well DPO or targeted interventions recover safety and what utility you lose.
  • Upgrade your evaluation to include live-web or trace-level metrics (navigation divergence, tool-use effort bias) and long-context judge failure checks (e.g., >32k token trajectories).
  • For SWE agents, prioritize reproduction test generation/extraction and richer execution context: ORACLE-SWE suggests reproduction tests dominate oracle gains and combined signals approach near-complete success.
  • Audit infrastructure changes (KV offloading, OPD/RL pipelines) with context-intensive benchmarks and length/repetition monitoring; treat “optimization” as a potential accuracy regression source.
  • For RAG, measure integration failures (parametric override / disjointed integration) and test dual-path + fusion approaches; don’t assume retrieval improvements translate to factuality.

Generated from per-paper analyses; no external browsing.