Daily AI Paper Report (2026-04-11)
Published:
Chinese version: [中文]
Run stats
- Candidates: 285
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-04-09T00:00:00Z → 2026-04-10T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2604.08407 | Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain | cs.CR | 95 | First systematic study of malicious LLM API routers: injection+secret exfiltration threat model & measurements | agent-security, supply-chain, tool-calling, api-routers, exfiltration, threat-model, measurement |
2604.07667 | From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation | cs.AI, cs.MA, cs.SI | 95 | Calibrated act-vs-escalate layer for multi-agent debate; conformal guarantees against wrong consensus. | agent-safety, multi-agent, debate, calibration, conformal-prediction, decision-making, escalation |
2604.08499 | PIArena: A Platform for Prompt Injection Evaluation | cs.CR, cs.AI, cs.CL, cs.LG | 93 | Unified, extensible prompt-injection evaluation platform to compare attacks/defenses across datasets/tasks | prompt-injection, evaluation, benchmarking, security, defenses, attacks, platform |
2604.08523 | ClawBench: Can AI Agents Complete Everyday Online Tasks? | cs.CL, cs.AI | 93 | Realistic live-web agent benchmark (153 tasks/144 platforms); strong eval value for agentic systems | agents, benchmark, web, evaluation, tool-use, robustness |
2604.07988 | LogAct: Enabling Agentic Reliability via Shared Logs | cs.DC, cs.AI | 93 | Shared-log state-machine abstraction enables pre-execution veto, recovery, and auditing for agents. | agents, reliability, safety, auditing, execution-control, fault-tolerance, governance |
2604.07775 | ACIArena: Toward Unified Evaluation for Agent Cascading Injection | cs.AI, cs.CL, cs.CR | 92 | Unified evaluation for multi-agent cascading injection across surfaces/objectives; fills MAS security gap | multi-agent, agent-security, cascading-injection, evaluation-suite, robustness, exfiltration |
2604.07749 | Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models | cs.CL | 92 | New benchmark for epistemic attacks beyond sycophancy; useful for robustness evals | LLM-safety, robustness, evaluation, jailbreaks, sociotechnical, benchmark |
2604.07695 | AITH: A Post-Quantum Continuous Delegation Protocol for Human-AI Trust Establishment | cs.CR, cs.AI | 91 | Cryptographic continuous delegation + revocation for AI agents; concrete protocol for bounded autonomy | agent-security, delegation, access-control, cryptography, post-quantum, governance |
2604.08064 | ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models | cs.AI | 91 | New benchmark for implicit (procedural/priming/conditioning) memory in LLM agents; safety-relevant behavior drift. | benchmark, agent-memory, evaluation, behavior, implicit-learning, reliability |
2604.08178 | Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling | cs.AI | 90 | Trajectory-level RM benchmark for tool-using agents; tests safety refusal & tool constraints beyond RLHF | reward-modeling, agents, benchmark, tool-use, trajectory-eval, alignment, safety |
2604.08423 | Synthetic Data for any Differentiable Target | cs.CL, cs.AI, cs.LG, stat.ML | 90 | Optimizes synthetic data to steer models via higher-order gradients; big safety+misuse implications | data-poisoning, model-steering, synthetic-data, data-attribution, security, alignment |
2604.08059 | Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules | cs.RO, cs.AI | 89 | Systems framework for safe capability upgrades with compatibility checks and runtime rollback in robots | embodied-agents, governance, safe-upgrades, rollback, runtime-safety, modularity |
2604.07776 | Structured Distillation of Web Agent Capabilities Enables Generalization | cs.LG | 89 | Structured synthetic trajectories distill web-agent skills into 9B; strong WebArena gains for open models. | web-agents, distillation, synthetic-data, tool-use, evaluation, open-weights |
2604.07831 | Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection | cs.CR, cs.CL, cs.CV | 88 | Practical red-teaming for GUI agents via semantic UI overlay injection; no white-box access required | gui-agents, red-teaming, vision-attacks, visual-grounding, adversarial-ui, robustness |
2604.08395 | Phantasia: Context-Adaptive Backdoors in Vision Language Models | cs.CV, cs.AI | 88 | Shows stealth of VLM backdoors overestimated; proposes/assesses defenses for multimodal backdoors | security, backdoors, VLM, data-poisoning, adversarial, defenses |
2604.08525 | Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest | cs.AI, cs.CL, cs.CY | 88 | Analyzes LLM conflicts of interest with ads; important for deployment incentives & alignment | alignment, deployment, conflicts-of-interest, ads, policy, AI-governance |
2604.08005 | Preference Redirection via Attention Concentration: An Attack on Computer Use Agents | cs.LG | 87 | Vision-side attack on computer-use agents by attention redirection to adversarial patch; preference hijack | computer-use-agents, multimodal-security, adversarial-patch, attention, manipulation, vision |
2604.08297 | Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models | cs.CR | 86 | ESI identifies safety-critical parameters and enables targeted safety interventions across dense vs MoE | mechanistic-safety, parameter-intervention, MoE, interpretability, safety-control, robustness |
2604.07755 | An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations | cs.CL, cs.SE | 86 | Quantifies how far static analysis can detect/mitigate library hallucinations; clear limits + upper bounds | hallucinations, code, static-analysis, reliability, evaluation, tooling |
2604.07927 | EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools | cs.AI | 86 | Adds structured query/evidence tools to deep-research agents; reduces redundancy and improves evidence use. | agents, deep-research, web-search, tooling, evidence, reasoning |
2604.08527 | Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models | cs.CL, cs.LG | 86 | Identifies OPD length-inflation instability and proposes stabilization; relevant to post-training | LLM-training, distillation, RLHF, post-training, stability, repetition |
2604.08476 | Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization | cs.CV, cs.AI | 85 | Constrained RL (Faithful GRPO) targets CoT faithfulness/visual grounding, not just accuracy, in MRMs | multimodal, RLVR, GRPO, faithfulness, grounding, reasoning-eval |
2604.08046 | Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation | cs.CL | 85 | Targets RAG integration bottleneck via joint decoding that forces evidence extraction over parametric priors. | RAG, grounding, faithfulness, decoding, hallucination, knowledge-integration |
2604.07754 | The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training | cs.CR, cs.CL | 84 | Systematic study of how fine-tuning can misalign/realign safety-aligned LLMs; relevant for model reuse | misalignment, post-training, fine-tuning, realignment, safety, open-models |
2604.07877 | MemReader: From Passive to Active Extraction for Long-Term Agent Memory | cs.CL | 84 | Active long-term memory extraction (GRPO/ReAct) to reduce memory pollution and improve consistency in agents. | agent-memory, personalization, RL, GRPO, information-extraction, reliability |
2604.08426 | KV Cache Offloading for Context-Intensive Tasks | cs.LG, cs.AI, cs.CL | 84 | Evaluates KV-cache offloading on context-intensive tasks; releases Text2JSON benchmark | long-context, systems, KV-cache, efficiency, benchmark, information-extraction |
2604.08169 | Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence | cs.AI | 83 | Runtime activation steering methods to maintain aligned open-ended generation beyond first tokens | activation-steering, runtime-guardrails, alignment, robustness, representation, generation |
2604.07801 | TEMPER: Testing Emotional Perturbation in Quantitative Reasoning | cs.CL, cs.AI | 83 | TEMPER dataset shows emotion framing alone drops math accuracy 2–10pp; useful robustness stress test | robustness, evaluation, reasoning, emotion, dataset, GSM8K |
2604.07789 | ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents | cs.MA, cs.CL, cs.SE | 83 | Quantifies value of oracle signals for SWE agents; clarifies what context helps most | agents, software-engineering, evaluation, tool-use, oracles, benchmarks |
2604.07929 | Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems | cs.IR, cs.AI | 82 | Trace-level human vs GUI-agent behavior comparison in production search; goes beyond success metrics | GUI-agents, evaluation, human-comparison, behavior-traces, search, deployment |
AI Paper Insight Brief
2026-04-11
0) Executive takeaways (read this first)
- “Safe to act” is becoming a first-class output of agent systems: conformal set-valued decisions for debate (risk-budgeted escalation) and log/vote-based execution gating both reduce catastrophic automated actions by turning uncertain outputs into structured refusal / review.
- Agent security is shifting from prompt injection to system and supply-chain attack surfaces: cascading multi-agent injection, malicious API routers rewriting tool calls, and visual/UI-level manipulation of GUI/CUA agents all bypass classic text-only defenses.
- Post-training is a dual-use battleground: preference tuning (ORPO) can rapidly misalign safety-aligned open models (even with tiny data via LoRA), while targeted parameter-level methods (ESI→SET/SPA) and activation steering offer efficient realignment/preservation—when you have white-box access.
- Benchmarks are getting more realistic and more diagnostic: live-web tasks (ClawBench) expose a large gap vs sandbox benchmarks; trajectory-level reward modeling (Plan-RewardBench) shows evaluators collapse at long contexts; implicit memory (ImplicitMemBench) reveals “unconscious” adaptation failures not fixed by retrieval.
- RAG and code reliability work is moving beyond retrieval: joint decoding for evidence integration (GuarantRAG) targets “retrieved-but-ignored” failures; static analysis catches a meaningful but bounded fraction of Python library hallucinations, with clear upper bounds.
- Training/inference infrastructure details matter for robustness: OPD can collapse via length inflation; KV-cache offloading can silently degrade accuracy on context-intensive tasks—both are “systems” failure modes that look like model failures.
2) Key themes (clusters)
Theme: Risk-controlled autonomy (refusal, gating, recovery)
- Why it matters: As agents take real actions, the key question is often when not to act. Mechanisms that convert model outputs into calibrated escalation, auditable gating, and recoverable execution reduce irreversible failures.
- Representative papers:
- Common approach:
- Turn point predictions into structured decisions (set-valued outputs; commit/abort; staged activation).
- Add explicit policy layers (voters/deciders; compatibility checkers; action hierarchies).
- Emphasize auditability + rollback/recovery (durable logs; shadow deployment; rollback controllers).
- Open questions / failure modes:
- Guarantees are often marginal / population-level (e.g., split conformal) rather than conditional/per-instance.
- Distribution shift and changing environments can break calibration and governance assumptions.
- Safety layers can impose a utility/latency tax (more human review; more tool calls; slower execution).
Theme: Agent attack surfaces beyond text (multi-agent, UI/vision, routers)
- Why it matters: Many deployed agent stacks have intermediaries (routers), multiple agents (trust propagation), and visual grounding (GUI/CUA). These surfaces enable attacks that remain schema-valid and “benign-looking,” evading text-only filters.
- Representative papers:
- Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
- ACIArena: Toward Unified Evaluation for Agent Cascading Injection
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
- Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
- Common approach:
- Benchmarking with standardized suites (ACIArena’s 1,356 cases; router market measurement; GUI injection metrics L1/L2).
- Attacks that preserve surface validity (schema-valid JSON rewrites; safety-aligned UI icons; small ℓ∞ patches).
- Evaluate transferability/persistence (cross-victim icon transfer; prompt-variant robustness; adaptive evasion triggers).
- Open questions / failure modes:
- Long-term fix for routers likely needs provider-backed integrity/provenance, not just client heuristics.
- GUI/CUA defenses are underdeveloped; current work shows attacks but limited mitigation evaluation.
- Multi-agent defenses can trade off utility; pruning state (ACI-SENTINEL) helps but isn’t universal.
Theme: Post-training alignment is fragile (and can be targeted)
- Why it matters: Open models can be misaligned quickly with accessible fine-tuning; defenders need efficient ways to restore or preserve safety without destroying utility.
- Representative papers:
- Common approach:
- Compare fine-tuning methods as attacker/defender tools (ORPO vs DPO; PEFT vs PFT).
- Use mechanistic/white-box levers (parameter ranking via ESI; activation-space steering with per-token gating).
- Measure safety with ASR/unsafety and track utility regressions (MMLU/GSM8K/etc.; coherence metrics).
- Open questions / failure modes:
- White-box methods (ESI, steering) don’t transfer to closed APIs.
- Safety metrics often rely on LLM judges, which can be imperfect.
- Dual-use risk: the same tools that realign can also be used to misalign or evade.
Theme: Evaluation realism + long-horizon diagnostics for agents
- Why it matters: Outcome-only metrics hide failure modes (navigation divergence, evaluator length collapse, implicit memory gaps). New benchmarks emphasize traces, long contexts, and real websites.
- Representative papers:
- ClawBench: Can AI Agents Complete Everyday Online Tasks?
- Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
- ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
- Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems
- Common approach:
- Collect trace-level artifacts (multi-layer recordings; state transition graphs; tool trajectories).
- Stress long-context and multi-step settings where judges/RMs degrade.
- Add hard negatives / near-misses to prevent superficial scoring.
- Open questions / failure modes:
- Live-web benchmarks face reproducibility drift; evaluators can be model-dependent.
- Trajectory judges collapse beyond ~32k tokens; scaling evaluation remains unsolved.
- Implicit memory failures persist even with memory-augmented agents (per reported results).
Theme: Grounding and integration (RAG, memory, code)
- Why it matters: Reliability failures often come from integration, not retrieval: ignoring evidence, writing polluted memory, or inventing APIs. Practical pipelines are emerging with measurable bounds.
- Representative papers:
- Common approach:
- Separate concerns: reasoning vs evidence (Inner-Answer vs Refer-Answer; active memory actions).
- Use post-hoc or tool-based layers (joint decoding interventions; search/add/buffer/ignore; static analyzers + repair).
- Quantify upper bounds and trade-offs (static analysis catchability bounds; memory update correctness; hallucination reductions).
- Open questions / failure modes:
- Added latency/compute (dual-path generation; multi-step memory management).
- Dependence on docstrings/type info limits static methods; dynamic languages remain hard.
- Long-term online stability of memory managers remains to be validated beyond benchmarks.
3) Technical synthesis
- Multiple papers converge on “structured intermediates” as the reliability lever: prediction sets (conformal), typed logs (AgentBus), explicit tool-arguments (Q+), oracle signals (ORACLE-SWE), and trajectory pairs (Plan-RewardBench).
- Selection effects are repeatedly exploited: conformal singletons are accurate because they abstain; judge-filtered synthetic trajectories outperform larger unfiltered sets; shadow deployment catches regressions sandbox misses.
- LLM-as-judge is everywhere, but papers increasingly report judge validation (e.g., PPT-Bench human agreement; FGRPO κ=0.997 vs GPT-5) and/or add judge-independent signals (activation steering uses embedding similarity, cross-entropy, ELO).
- Robustness failures are increasingly non-adversarial in appearance: emotional framing degrades math; UI icons are “safety-aligned”; router rewrites remain schema-valid; OPD collapse looks like “model got worse” but is a training dynamic.
- There’s a clear split between black-box deployable defenses (conformal layer; prompt mitigations; static analysis; client-side router gates/logging) and white-box mechanistic defenses (activation steering; ESI parameter interventions; PRAC patch crafting).
- Long-horizon settings expose evaluator brittleness: Plan-RewardBench shows pairwise LLM judges collapse past ~32k tokens, motivating more robust discriminative RMs or hierarchical evaluation.
- “Memory” is bifurcating into explicit stores (MemReader active writes) vs implicit behavioral adaptation (ImplicitMemBench), and the latter is not solved by retrieval alone.
- Systems work (KV offloading, OPD stability) shows that inference/training optimizations can silently change task accuracy, so robustness evaluation must include infrastructure variants.
- Security evaluation is moving toward ecosystem measurement (router markets, poisoning studies) rather than only lab attacks, yielding concrete prevalence numbers and operational mitigations.
4) Top 5 papers (with “why now”)
1) Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
- Quantifies a real, under-discussed risk: API routers terminate TLS and can rewrite executable tool-call JSON.
- Large ecosystem measurement (28 paid + 400 free routers) with observed active injection and credential touching, plus poisoning studies.
- Evaluates practical client-side mitigations (policy gate, anomaly screening, transparency logging) and shows compatibility of the attack proxy across agent frameworks.
- Skepticism: client-side defenses don’t provide cryptographic provenance; measurement may miss untriggered adaptive behaviors.
2) From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation
- Reframes debate output as act vs escalate under a user-set risk budget α, with split-conformal marginal coverage.
- Empirically targets a key failure: wrong unanimous consensus (23.9% of initially-disagreeing cases converge to unanimous wrong by round 3); conformal layer intercepts 81.9% at α=0.05 by escalating.
- Black-box and post-hoc: deployable on proprietary models via verbalized probabilities + aggregation.
- Skepticism: guarantees are marginal and assume exchangeability; evaluated in closed-set multiple-choice.
3) ClawBench: Can AI Agents Complete Everyday Online Tasks?
- Live-web benchmark with safe interception of terminal submissions and five-layer trace recording—bridges realism and safety.
- Shows a stark gap vs sandbox benchmarks: best model reported (Claude Sonnet 4.6) at 33.3% SR; GPT-5.4 at 6.5%.
- Provides traceable failure diagnostics via an agentic evaluator comparing to human trajectories.
- Skepticism: live-web variability and manual endpoint annotation limit scalability and reproducibility.
4) ACIArena: Toward Unified Evaluation for Agent Cascading Injection
- Standardizes multi-agent cascading injection evaluation across 28 attacks, 3 surfaces, 3 objectives, and integrates six MAS frameworks.
- Finds high vulnerability (code tasks often 90–100% ASR; LLM Debate cited at 100% hijacking ASR) and that some defenses can fail or trade utility.
- Proposes ACI-SENTINEL (semantic minimality pruning) with large ASR reductions in reported cases.
- Skepticism: evaluation scale constrained by query cost; defense introduces utility trade-offs and isn’t universally effective.
- Maps attacker/defender dynamics across common methods: ORPO strongest for misalignment; DPO strongest for realignment (often with utility cost).
- Shows misalignment can be data-efficient (LoRA effective with as few as 13 unsafe samples in some settings).
- Highlights model-specific resistance patterns (Gemma2 resists SFT misalignment but not ORPO).
- Skepticism: unsafety relies on LLM-judge ensemble; excludes proprietary models and full RLHF.
5) Practical next steps
- Add an act/escalate layer to any multi-agent or ensemble system: implement split conformal on aggregated probabilities (or analogous scores) and measure automated-error reduction vs escalation rate.
- For tool-using agents, treat routers as untrusted: deploy fail-closed policies for high-risk tools, add response anomaly screening, and implement append-only transparency logs for forensics.
- Red-team GUI/CUA stacks with non-text attacks: semantic UI icon injection and visual preference redirection; measure persistence and cross-model transfer, not just single-shot success.
- If you ship open-weight models, assume post-training misalignment is cheap: test ORPO/LoRA-style adversarial tuning on your release candidates; evaluate how well DPO or targeted interventions recover safety and what utility you lose.
- Upgrade your evaluation to include live-web or trace-level metrics (navigation divergence, tool-use effort bias) and long-context judge failure checks (e.g., >32k token trajectories).
- For SWE agents, prioritize reproduction test generation/extraction and richer execution context: ORACLE-SWE suggests reproduction tests dominate oracle gains and combined signals approach near-complete success.
- Audit infrastructure changes (KV offloading, OPD/RL pipelines) with context-intensive benchmarks and length/repetition monitoring; treat “optimization” as a potential accuracy regression source.
- For RAG, measure integration failures (parametric override / disjointed integration) and test dual-path + fusion approaches; don’t assume retrieval improvements translate to factuality.
Generated from per-paper analyses; no external browsing.
