Daily AI Paper Report (2026-04-11)

Published: April 11, 2026

Chinese version: [中文]

Run stats

Candidates: 285
Selected: 30
Deepread completed: 30
Window (UTC): 2026-04-09T00:00:00Z → 2026-04-10T00:00:00Z (arxiv_announce, expanded=0)

Show selected papers

arXiv ID	Title / Links	Categories	Score	Why	Tags
`2604.08407`	Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain PDF	cs.CR	95	First systematic study of malicious LLM API routers: injection+secret exfiltration threat model & measurements	agent-security, supply-chain, tool-calling, api-routers, exfiltration, threat-model, measurement
`2604.07667`	From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation PDF	cs.AI, cs.MA, cs.SI	95	Calibrated act-vs-escalate layer for multi-agent debate; conformal guarantees against wrong consensus.	agent-safety, multi-agent, debate, calibration, conformal-prediction, decision-making, escalation
`2604.08499`	PIArena: A Platform for Prompt Injection Evaluation PDF	cs.CR, cs.AI, cs.CL, cs.LG	93	Unified, extensible prompt-injection evaluation platform to compare attacks/defenses across datasets/tasks	prompt-injection, evaluation, benchmarking, security, defenses, attacks, platform
`2604.08523`	ClawBench: Can AI Agents Complete Everyday Online Tasks? PDF	cs.CL, cs.AI	93	Realistic live-web agent benchmark (153 tasks/144 platforms); strong eval value for agentic systems	agents, benchmark, web, evaluation, tool-use, robustness
`2604.07988`	LogAct: Enabling Agentic Reliability via Shared Logs PDF	cs.DC, cs.AI	93	Shared-log state-machine abstraction enables pre-execution veto, recovery, and auditing for agents.	agents, reliability, safety, auditing, execution-control, fault-tolerance, governance
`2604.07775`	ACIArena: Toward Unified Evaluation for Agent Cascading Injection PDF	cs.AI, cs.CL, cs.CR	92	Unified evaluation for multi-agent cascading injection across surfaces/objectives; fills MAS security gap	multi-agent, agent-security, cascading-injection, evaluation-suite, robustness, exfiltration
`2604.07749`	Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models PDF	cs.CL	92	New benchmark for epistemic attacks beyond sycophancy; useful for robustness evals	LLM-safety, robustness, evaluation, jailbreaks, sociotechnical, benchmark
`2604.07695`	AITH: A Post-Quantum Continuous Delegation Protocol for Human-AI Trust Establishment PDF	cs.CR, cs.AI	91	Cryptographic continuous delegation + revocation for AI agents; concrete protocol for bounded autonomy	agent-security, delegation, access-control, cryptography, post-quantum, governance
`2604.08064`	ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models PDF	cs.AI	91	New benchmark for implicit (procedural/priming/conditioning) memory in LLM agents; safety-relevant behavior drift.	benchmark, agent-memory, evaluation, behavior, implicit-learning, reliability
`2604.08178`	Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling PDF	cs.AI	90	Trajectory-level RM benchmark for tool-using agents; tests safety refusal & tool constraints beyond RLHF	reward-modeling, agents, benchmark, tool-use, trajectory-eval, alignment, safety
`2604.08423`	Synthetic Data for any Differentiable Target PDF	cs.CL, cs.AI, cs.LG, stat.ML	90	Optimizes synthetic data to steer models via higher-order gradients; big safety+misuse implications	data-poisoning, model-steering, synthetic-data, data-attribution, security, alignment
`2604.08059`	Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules PDF	cs.RO, cs.AI	89	Systems framework for safe capability upgrades with compatibility checks and runtime rollback in robots	embodied-agents, governance, safe-upgrades, rollback, runtime-safety, modularity
`2604.07776`	Structured Distillation of Web Agent Capabilities Enables Generalization PDF	cs.LG	89	Structured synthetic trajectories distill web-agent skills into 9B; strong WebArena gains for open models.	web-agents, distillation, synthetic-data, tool-use, evaluation, open-weights
`2604.07831`	Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection PDF	cs.CR, cs.CL, cs.CV	88	Practical red-teaming for GUI agents via semantic UI overlay injection; no white-box access required	gui-agents, red-teaming, vision-attacks, visual-grounding, adversarial-ui, robustness
`2604.08395`	Phantasia: Context-Adaptive Backdoors in Vision Language Models PDF	cs.CV, cs.AI	88	Shows stealth of VLM backdoors overestimated; proposes/assesses defenses for multimodal backdoors	security, backdoors, VLM, data-poisoning, adversarial, defenses
`2604.08525`	Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest PDF	cs.AI, cs.CL, cs.CY	88	Analyzes LLM conflicts of interest with ads; important for deployment incentives & alignment	alignment, deployment, conflicts-of-interest, ads, policy, AI-governance
`2604.08005`	Preference Redirection via Attention Concentration: An Attack on Computer Use Agents PDF	cs.LG	87	Vision-side attack on computer-use agents by attention redirection to adversarial patch; preference hijack	computer-use-agents, multimodal-security, adversarial-patch, attention, manipulation, vision
`2604.08297`	Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models PDF	cs.CR	86	ESI identifies safety-critical parameters and enables targeted safety interventions across dense vs MoE	mechanistic-safety, parameter-intervention, MoE, interpretability, safety-control, robustness
`2604.07755`	An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations PDF	cs.CL, cs.SE	86	Quantifies how far static analysis can detect/mitigate library hallucinations; clear limits + upper bounds	hallucinations, code, static-analysis, reliability, evaluation, tooling
`2604.07927`	EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools PDF	cs.AI	86	Adds structured query/evidence tools to deep-research agents; reduces redundancy and improves evidence use.	agents, deep-research, web-search, tooling, evidence, reasoning
`2604.08527`	Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models PDF	cs.CL, cs.LG	86	Identifies OPD length-inflation instability and proposes stabilization; relevant to post-training	LLM-training, distillation, RLHF, post-training, stability, repetition
`2604.08476`	Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization PDF	cs.CV, cs.AI	85	Constrained RL (Faithful GRPO) targets CoT faithfulness/visual grounding, not just accuracy, in MRMs	multimodal, RLVR, GRPO, faithfulness, grounding, reasoning-eval
`2604.08046`	Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation PDF	cs.CL	85	Targets RAG integration bottleneck via joint decoding that forces evidence extraction over parametric priors.	RAG, grounding, faithfulness, decoding, hallucination, knowledge-integration
`2604.07754`	The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training PDF	cs.CR, cs.CL	84	Systematic study of how fine-tuning can misalign/realign safety-aligned LLMs; relevant for model reuse	misalignment, post-training, fine-tuning, realignment, safety, open-models
`2604.07877`	MemReader: From Passive to Active Extraction for Long-Term Agent Memory PDF	cs.CL	84	Active long-term memory extraction (GRPO/ReAct) to reduce memory pollution and improve consistency in agents.	agent-memory, personalization, RL, GRPO, information-extraction, reliability
`2604.08426`	KV Cache Offloading for Context-Intensive Tasks PDF	cs.LG, cs.AI, cs.CL	84	Evaluates KV-cache offloading on context-intensive tasks; releases Text2JSON benchmark	long-context, systems, KV-cache, efficiency, benchmark, information-extraction
`2604.08169`	Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence PDF	cs.AI	83	Runtime activation steering methods to maintain aligned open-ended generation beyond first tokens	activation-steering, runtime-guardrails, alignment, robustness, representation, generation
`2604.07801`	TEMPER: Testing Emotional Perturbation in Quantitative Reasoning PDF	cs.CL, cs.AI	83	TEMPER dataset shows emotion framing alone drops math accuracy 2–10pp; useful robustness stress test	robustness, evaluation, reasoning, emotion, dataset, GSM8K
`2604.07789`	ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents PDF	cs.MA, cs.CL, cs.SE	83	Quantifies value of oracle signals for SWE agents; clarifies what context helps most	agents, software-engineering, evaluation, tool-use, oracles, benchmarks
`2604.07929`	Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems PDF	cs.IR, cs.AI	82	Trace-level human vs GUI-agent behavior comparison in production search; goes beyond success metrics	GUI-agents, evaluation, human-comparison, behavior-traces, search, deployment

AI Paper Insight Brief

2026-04-11

0) Executive takeaways (read this first)

“Safe to act” is becoming a first-class output of agent systems: conformal set-valued decisions for debate (risk-budgeted escalation) and log/vote-based execution gating both reduce catastrophic automated actions by turning uncertain outputs into structured refusal / review.
Agent security is shifting from prompt injection to system and supply-chain attack surfaces: cascading multi-agent injection, malicious API routers rewriting tool calls, and visual/UI-level manipulation of GUI/CUA agents all bypass classic text-only defenses.
Post-training is a dual-use battleground: preference tuning (ORPO) can rapidly misalign safety-aligned open models (even with tiny data via LoRA), while targeted parameter-level methods (ESI→SET/SPA) and activation steering offer efficient realignment/preservation—when you have white-box access.
Benchmarks are getting more realistic and more diagnostic: live-web tasks (ClawBench) expose a large gap vs sandbox benchmarks; trajectory-level reward modeling (Plan-RewardBench) shows evaluators collapse at long contexts; implicit memory (ImplicitMemBench) reveals “unconscious” adaptation failures not fixed by retrieval.
RAG and code reliability work is moving beyond retrieval: joint decoding for evidence integration (GuarantRAG) targets “retrieved-but-ignored” failures; static analysis catches a meaningful but bounded fraction of Python library hallucinations, with clear upper bounds.
Training/inference infrastructure details matter for robustness: OPD can collapse via length inflation; KV-cache offloading can silently degrade accuracy on context-intensive tasks—both are “systems” failure modes that look like model failures.

2) Key themes (clusters)

Theme: Risk-controlled autonomy (refusal, gating, recovery)

Why it matters: As agents take real actions, the key question is often when not to act. Mechanisms that convert model outputs into calibrated escalation, auditable gating, and recoverable execution reduce irreversible failures.
Representative papers:
Common approach:
- Turn point predictions into structured decisions (set-valued outputs; commit/abort; staged activation).
- Add explicit policy layers (voters/deciders; compatibility checkers; action hierarchies).
- Emphasize auditability + rollback/recovery (durable logs; shadow deployment; rollback controllers).
Open questions / failure modes:
- Guarantees are often marginal / population-level (e.g., split conformal) rather than conditional/per-instance.
- Distribution shift and changing environments can break calibration and governance assumptions.
- Safety layers can impose a utility/latency tax (more human review; more tool calls; slower execution).

Theme: Agent attack surfaces beyond text (multi-agent, UI/vision, routers)

Why it matters: Many deployed agent stacks have intermediaries (routers), multiple agents (trust propagation), and visual grounding (GUI/CUA). These surfaces enable attacks that remain schema-valid and “benign-looking,” evading text-only filters.
Representative papers:
Common approach:
- Benchmarking with standardized suites (ACIArena’s 1,356 cases; router market measurement; GUI injection metrics L1/L2).
- Attacks that preserve surface validity (schema-valid JSON rewrites; safety-aligned UI icons; small ℓ∞ patches).
- Evaluate transferability/persistence (cross-victim icon transfer; prompt-variant robustness; adaptive evasion triggers).
Open questions / failure modes:
- Long-term fix for routers likely needs provider-backed integrity/provenance, not just client heuristics.
- GUI/CUA defenses are underdeveloped; current work shows attacks but limited mitigation evaluation.
- Multi-agent defenses can trade off utility; pruning state (ACI-SENTINEL) helps but isn’t universal.

Theme: Post-training alignment is fragile (and can be targeted)

Why it matters: Open models can be misaligned quickly with accessible fine-tuning; defenders need efficient ways to restore or preserve safety without destroying utility.
Representative papers:
Common approach:
- Compare fine-tuning methods as attacker/defender tools (ORPO vs DPO; PEFT vs PFT).
- Use mechanistic/white-box levers (parameter ranking via ESI; activation-space steering with per-token gating).
- Measure safety with ASR/unsafety and track utility regressions (MMLU/GSM8K/etc.; coherence metrics).
Open questions / failure modes:
- White-box methods (ESI, steering) don’t transfer to closed APIs.
- Safety metrics often rely on LLM judges, which can be imperfect.
- Dual-use risk: the same tools that realign can also be used to misalign or evade.

Theme: Evaluation realism + long-horizon diagnostics for agents

Why it matters: Outcome-only metrics hide failure modes (navigation divergence, evaluator length collapse, implicit memory gaps). New benchmarks emphasize traces, long contexts, and real websites.
Representative papers:
Common approach:
- Collect trace-level artifacts (multi-layer recordings; state transition graphs; tool trajectories).
- Stress long-context and multi-step settings where judges/RMs degrade.
- Add hard negatives / near-misses to prevent superficial scoring.
Open questions / failure modes:
- Live-web benchmarks face reproducibility drift; evaluators can be model-dependent.
- Trajectory judges collapse beyond ~32k tokens; scaling evaluation remains unsolved.
- Implicit memory failures persist even with memory-augmented agents (per reported results).

Theme: Grounding and integration (RAG, memory, code)

Why it matters: Reliability failures often come from integration, not retrieval: ignoring evidence, writing polluted memory, or inventing APIs. Practical pipelines are emerging with measurable bounds.
Representative papers:
Common approach:
- Separate concerns: reasoning vs evidence (Inner-Answer vs Refer-Answer; active memory actions).
- Use post-hoc or tool-based layers (joint decoding interventions; search/add/buffer/ignore; static analyzers + repair).
- Quantify upper bounds and trade-offs (static analysis catchability bounds; memory update correctness; hallucination reductions).
Open questions / failure modes:
- Added latency/compute (dual-path generation; multi-step memory management).
- Dependence on docstrings/type info limits static methods; dynamic languages remain hard.
- Long-term online stability of memory managers remains to be validated beyond benchmarks.

3) Technical synthesis

Multiple papers converge on “structured intermediates” as the reliability lever: prediction sets (conformal), typed logs (AgentBus), explicit tool-arguments (Q+), oracle signals (ORACLE-SWE), and trajectory pairs (Plan-RewardBench).
Selection effects are repeatedly exploited: conformal singletons are accurate because they abstain; judge-filtered synthetic trajectories outperform larger unfiltered sets; shadow deployment catches regressions sandbox misses.
LLM-as-judge is everywhere, but papers increasingly report judge validation (e.g., PPT-Bench human agreement; FGRPO κ=0.997 vs GPT-5) and/or add judge-independent signals (activation steering uses embedding similarity, cross-entropy, ELO).
Robustness failures are increasingly non-adversarial in appearance: emotional framing degrades math; UI icons are “safety-aligned”; router rewrites remain schema-valid; OPD collapse looks like “model got worse” but is a training dynamic.
There’s a clear split between black-box deployable defenses (conformal layer; prompt mitigations; static analysis; client-side router gates/logging) and white-box mechanistic defenses (activation steering; ESI parameter interventions; PRAC patch crafting).
Long-horizon settings expose evaluator brittleness: Plan-RewardBench shows pairwise LLM judges collapse past ~32k tokens, motivating more robust discriminative RMs or hierarchical evaluation.
“Memory” is bifurcating into explicit stores (MemReader active writes) vs implicit behavioral adaptation (ImplicitMemBench), and the latter is not solved by retrieval alone.
Systems work (KV offloading, OPD stability) shows that inference/training optimizations can silently change task accuracy, so robustness evaluation must include infrastructure variants.
Security evaluation is moving toward ecosystem measurement (router markets, poisoning studies) rather than only lab attacks, yielding concrete prevalence numbers and operational mitigations.

4) Top 5 papers (with “why now”)

1) Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

Quantifies a real, under-discussed risk: API routers terminate TLS and can rewrite executable tool-call JSON.
Large ecosystem measurement (28 paid + 400 free routers) with observed active injection and credential touching, plus poisoning studies.
Evaluates practical client-side mitigations (policy gate, anomaly screening, transparency logging) and shows compatibility of the attack proxy across agent frameworks.
Skepticism: client-side defenses don’t provide cryptographic provenance; measurement may miss untriggered adaptive behaviors.

2) From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation

Reframes debate output as act vs escalate under a user-set risk budget α, with split-conformal marginal coverage.
Empirically targets a key failure: wrong unanimous consensus (23.9% of initially-disagreeing cases converge to unanimous wrong by round 3); conformal layer intercepts 81.9% at α=0.05 by escalating.
Black-box and post-hoc: deployable on proprietary models via verbalized probabilities + aggregation.
Skepticism: guarantees are marginal and assume exchangeability; evaluated in closed-set multiple-choice.

3) ClawBench: Can AI Agents Complete Everyday Online Tasks?

Live-web benchmark with safe interception of terminal submissions and five-layer trace recording—bridges realism and safety.
Shows a stark gap vs sandbox benchmarks: best model reported (Claude Sonnet 4.6) at 33.3% SR; GPT-5.4 at 6.5%.
Provides traceable failure diagnostics via an agentic evaluator comparing to human trajectories.
Skepticism: live-web variability and manual endpoint annotation limit scalability and reproducibility.

4) ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Standardizes multi-agent cascading injection evaluation across 28 attacks, 3 surfaces, 3 objectives, and integrates six MAS frameworks.
Finds high vulnerability (code tasks often 90–100% ASR; LLM Debate cited at 100% hijacking ASR) and that some defenses can fail or trade utility.
Proposes ACI-SENTINEL (semantic minimality pruning) with large ASR reductions in reported cases.
Skepticism: evaluation scale constrained by query cost; defense introduces utility trade-offs and isn’t universally effective.

5) The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Maps attacker/defender dynamics across common methods: ORPO strongest for misalignment; DPO strongest for realignment (often with utility cost).
Shows misalignment can be data-efficient (LoRA effective with as few as 13 unsafe samples in some settings).
Highlights model-specific resistance patterns (Gemma2 resists SFT misalignment but not ORPO).
Skepticism: unsafety relies on LLM-judge ensemble; excludes proprietary models and full RLHF.

5) Practical next steps

Add an act/escalate layer to any multi-agent or ensemble system: implement split conformal on aggregated probabilities (or analogous scores) and measure automated-error reduction vs escalation rate.
For tool-using agents, treat routers as untrusted: deploy fail-closed policies for high-risk tools, add response anomaly screening, and implement append-only transparency logs for forensics.
Red-team GUI/CUA stacks with non-text attacks: semantic UI icon injection and visual preference redirection; measure persistence and cross-model transfer, not just single-shot success.
If you ship open-weight models, assume post-training misalignment is cheap: test ORPO/LoRA-style adversarial tuning on your release candidates; evaluate how well DPO or targeted interventions recover safety and what utility you lose.
Upgrade your evaluation to include live-web or trace-level metrics (navigation divergence, tool-use effort bias) and long-context judge failure checks (e.g., >32k token trajectories).
For SWE agents, prioritize reproduction test generation/extraction and richer execution context: ORACLE-SWE suggests reproduction tests dominate oracle gains and combined signals approach near-complete success.
Audit infrastructure changes (KV offloading, OPD/RL pipelines) with context-intensive benchmarks and length/repetition monitoring; treat “optimization” as a potential accuracy regression source.
For RAG, measure integration failures (parametric override / disjointed integration) and test dual-path + fusion approaches; don’t assume retrieval improvements translate to factuality.

Generated from per-paper analyses; no external browsing.

Di Tang

AI Paper Insight Brief

2026-04-11

0) Executive takeaways (read this first)

2) Key themes (clusters)

Theme: Risk-controlled autonomy (refusal, gating, recovery)

Theme: Agent attack surfaces beyond text (multi-agent, UI/vision, routers)

Theme: Post-training alignment is fragile (and can be targeted)

Theme: Evaluation realism + long-horizon diagnostics for agents

Theme: Grounding and integration (RAG, memory, code)

3) Technical synthesis

4) Top 5 papers (with “why now”)

5) Practical next steps