Daily AI Paper Report (2026-03-26)
Published:
Chinese version: [中文]
Run stats
- Candidates: 232
- Selected: 30
- Deepread completed: 30
- Window (UTC): 2026-03-24T00:00:00Z → 2026-03-25T00:00:00Z (arxiv_announce, expanded=0)
Show selected papers
| arXiv ID | Title / Links | Categories | Score | Why | Tags |
|---|---|---|---|---|---|
2603.22928 | SoK: The Attack Surface of Agentic AI -- Tools, and Autonomy | cs.CR | 96 | SoK mapping agentic AI attack surface (RAG/tools/multi-agent); strong taxonomy + synthesis for defenders | agentic-ai, security, prompt-injection, tool-security, rag-poisoning, multi-agent, survey |
2603.23064 | Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution | cs.CR, cs.AI, cs.SI | 95 | Finds agent memory pollution via background “heartbeat”; concrete vuln model for personal agents | agent-security, memory-poisoning, prompt-injection, tool-agents, provenance, background-execution |
2603.22853 | Agent Audit: A Security Analysis System for LLM Agent Applications | cs.CR, cs.AI | 93 | Practical static/security analysis for LLM agent apps (code+configs+creds+privileges); deployable outputs | agents, appsec, static-analysis, mcp, credentials, tooling, deployment-security |
2603.23171 | Robust Safety Monitoring of Language Models via Activation Watermarking | cs.CR, cs.AI, cs.CY, cs.LG | 92 | Frames robust LLM monitoring as a security game; targets adaptive evasion with activation watermarking | monitoring, misuse-detection, adaptive-adversary, watermarking, inference-security, safety |
2603.23269 | Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs | cs.CR, cs.AI, cs.LG | 92 | Query-efficient jailbreak fuzzing via token-importance; enables stronger red-teaming under budgets | jailbreaks, fuzzing, adversarial-prompts, red-teaming, surrogate-models, evaluation |
2603.23184 | ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment | cs.CL, cs.AI, stat.AP | 92 | Unbiased reward modeling from implicit feedback; tackles bias + missing negatives for RLHF at scale | alignment, RLHF, reward-modeling, implicit-feedback, debiasing, preference-learning |
2603.22934 | ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning | cs.AI | 90 | Training-free retriever-side defense for RAG corpus poisoning using probe gradients + perturbation tests | rag, retrieval-security, data-poisoning, dense-retriever, defense, robustness |
2603.22767 | Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases | cs.AI, cs.CL | 90 | RWE-bench tests long-horizon agents executing real DB observational studies; structured evidence eval | agent-evaluation, benchmarks, tool-use, long-horizon, databases, healthcare |
2603.22829 | Improving Safety Alignment via Balanced Direct Preference Optimization | cs.AI | 90 | Targets safety-alignment overfitting in DPO; proposes balanced objective to improve robustness. | alignment, DPO, RLHF, safety, overfitting, preference-learning |
2603.23268 | SafeSeek: Universal Attribution of Safety Circuits in Language Models | cs.LG, cs.AI | 88 | Optimization-based attribution of safety circuits; aims for generalizable mechanistic safety interpretability | mechanistic-interpretability, safety-circuits, jailbreaks, backdoors, attribution, sparse-masks |
2603.22744 | Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks | cs.AI | 88 | LH-Bench evaluates subjective enterprise long-horizon workflows with rubrics + artifact-based signals | agent-evaluation, benchmarks, enterprise, rubrics, long-horizon, LLM-judges |
2603.23047 | Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation | cs.CL, cs.AI, cs.CE | 88 | RAG fine-tuning analysis + new factual attribution eval (TriFEX) and parametric-knowledge metric (PKP). | RAG, evaluation, factuality, attribution, metrics, fine-tuning |
2603.23114 | Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment | cs.AI, cs.CL, cs.CY, cs.HC | 88 | Contextual moral dilemmas dataset; shows LLM moral sensitivity differs from humans; control problem | safety, alignment, evaluation, moral-judgment, dataset, context-sensitivity, human-comparison |
2603.22868 | Agent-Sentry: Bounding LLM Agents via Execution Provenance | cs.CR, cs.AI | 87 | Execution provenance to bound/validate agent behavior vs irrelevant/compromised actions; security+privacy angle | agents, provenance, runtime-verification, policy-bounding, auditability, security |
2603.22882 | TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration | cs.LG, cs.CV | 86 | Autonomous red-teaming for VLMs via hierarchical strategy exploration; seeks novel/diverse exploits | red-teaming, vlm-safety, adversarial-testing, automation, evaluation, attack-discovery |
2603.23355 | Off-Policy Value-Based Reinforcement Learning for Large Language Models | cs.LG, cs.CL | 86 | Off-policy value-based RL for LLMs with replay; could improve sample efficiency for reasoning RL | RL-for-LLMs, off-policy, value-learning, reasoning, verification-signals, sample-efficiency |
2603.22751 | CIPL: A Target-Independent Framework for Channel-Inversion Privacy Leakage in Agents | cs.CR | 85 | General framework for privacy leakage in agents as channel inversion; broadens beyond memory leakage | privacy, agents, information-leakage, attack-framework, side-channels, threat-modeling |
2603.23117 | TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches | cs.CR | 84 | Shows CoT in VLA robots enables targeted control hijacking via adversarial patches; important embodied risk | robotics, vla, chain-of-thought, adversarial-patches, embodied-security, attack |
2603.22717 | Does Teaming-Up LLMs Improve Secure Code Generation? A Comprehensive Evaluation with Multi-LLMSecCodeEval | cs.CR, cs.SE | 84 | Evaluates multi-LLM collaboration + static analysis for secure codegen; practical security pipeline data | secure-code-generation, LLM-ensembles, static-analysis, software-security, evaluation |
2603.23292 | LLM Olympiad: Why Model Evaluation Needs a Sealed Exam | cs.AI, cs.CL | 84 | Proposes sealed-exam 'LLM Olympiad' to reduce benchmark leakage/chasing and improve trust in evals. | evaluation, benchmarks, data-contamination, leaderboards, reproducibility, governance |
2603.23485 | Failure of contextual invariance in gender inference with large language models | cs.CL, cs.AI, cs.CY | 84 | Finds large instability under contextually equivalent prompts in gender inference; evaluation warning | reliability, robustness, bias, evaluation, prompt-sensitivity, context-invariance |
2603.23501 | MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage | cs.CV, cs.AI, cs.CL | 83 | Benchmark for medical VLM input-validity sanity checks; targets a safety-critical failure mode | evaluation, medical-ai, vlm, robustness, input-validation, benchmark |
2603.22714 | PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset | cs.CY, cs.AI | 83 | PopResume dataset enables causal/path-specific fairness audits for LLM/VLM resume screeners at scale | fairness, auditing, datasets, causal-evaluation, hiring, VLM |
2603.22812 | Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration | cs.CL | 83 | Adaptive semantic-entropy hallucination detection cuts sampling cost by adjusting budget to uncertainty. | hallucination, uncertainty, semantic-entropy, Bayesian, efficient-eval, reliability |
2603.23483 | SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning | cs.CV, cs.CL | 82 | Speculative planning to cut agentic multimodal tool-loop latency; system-level speedups for MLLM agents | agentic-MLLM, speculative-decoding, tool-use, latency, planning, efficiency |
2603.22754 | PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation | cs.CL | 82 | Joint step+layer reasoning diagnostics; identifies failure modes like verification loops/divergence | interpretability, reasoning, analysis, hidden-states, failure-modes, diagnostics |
2603.22823 | Empirical Comparison of Agent Communication Protocols for Task Orchestration | cs.AI | 81 | Benchmark comparing tool-only vs delegation vs hybrid multi-agent protocols; useful for agent design. | agents, multi-agent, tool-use, orchestration, benchmarks, protocols |
2603.23013 | Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents | cs.CL | 81 | Memory-augmented routing for persistent agents; big cost cuts without training; strong deployment angle | agents, memory, efficiency, routing, long-term-interaction, serving |
2603.23149 | Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models | cs.AI | 80 | Fast 'describe-then-act' steering layer predicts outcomes from latent+actions; aims at proactive safety. | agents, world-models, steering, safety, latents, planning |
2603.22879 | Confidence Calibration under Ambiguous Ground Truth | cs.LG, cs.AI | 79 | Calibration breaks with annotator disagreement; proposes ambiguity-aware post-hoc calibrators | calibration, uncertainty, evaluation, ambiguous-labels, reliability, post-hoc |
AI Paper Insight Brief
2026-03-26
0) Executive takeaways (read this first)
- Agent security is shifting from “prompt injection” to “system surfaces”: multiple papers show the dominant risks are in channels (tool args/returns, traces), architectures (heartbeat background execution), and runtime behavior (execution provenance), not just final text.
- Practical defenses are becoming more “systems-y” and measurable: retriever-side reranking against RAG poisoning (no generator calls), runtime tool-call bounding via provenance graphs, and agent-aware static analysis for MCP configs all report strong security/utility trade-offs.
- Evaluation is moving beyond single-number correctness: long-horizon subjective enterprise tasks (rubric + artifact contracts + human validation), end-to-end medical observational studies on a real DB backend, and medical VLM “input sanity checks” expose failures that classic benchmarks miss.
- Alignment training and monitoring are getting more “distribution-aware”: B-DPO targets preference-pair comprehension imbalance; ImplicitRM makes implicit-feedback reward modeling unbiased under missing negatives and propensity bias; activation watermarking adds keyed, adversary-aware monitoring.
- Context sensitivity is a recurring failure mode: minimal context changes can flip gender pronoun inference; moral judgments shift with small contextual cues and differ from humans; these effects are controllable (activation steering) but not free (small capability drops).
- Cost/latency optimizations increasingly rely on gating + verification: speculative “tool-free” bypass for agentic MLLMs and memory-augmented routing show large speedups/cost cuts, but hinge on confidence/calibration and retrieval fidelity.
2) Key themes (clusters)
Theme: Agent security across channels, tools, and autonomy
- Why it matters: Real agent deployments leak and get hijacked through observable channels (tool args/returns, traces), tool boundaries, and autonomous background behaviors—often without obvious prompt injection.
- Representative papers:
- Common approach:
- Define a system-level threat model (pipelines, stages, channels, provenance) rather than only output text.
- Measure security with explicit rates (CER/AER, ASR, utility success) and controlled ablations/coverage regimes.
- Add enforcement/analysis at boundaries: static scanning of tool code + MCP configs; runtime interception of tool calls.
- Open questions / failure modes:
- Coverage gaps: unseen benign behaviors become “ambiguous” and require intent checks (Agent-Sentry).
- Mimicry attacks that follow benign tool sequences but manipulate parameters (Agent-Sentry).
- Toy/limited targets and string-match leakage metrics may undercount semantic leakage (CIPL).
- Product-level design flaws (shared-session heartbeat) create “zero-click-like” memory pollution pathways (HEARTBEAT).
Theme: RAG and memory: robustness, poisoning, and “knowledge vs model size”
- Why it matters: Retrieval and memory can dominate correctness and cost, but they also introduce poisoning and leakage surfaces; defenses and metrics are becoming retriever-side and attribution-aware.
- Representative papers:
- Common approach:
- Keep defenses training-free and generator-free where possible (retriever-side reranking; no extra LLM calls).
- Separate “what the model knows” from “what it copied from context” (triple attribution; PKP, PR, SK decomposition).
- Use hybrid retrieval (dense + BM25) and analyze per-type gains and regressions (e.g., temporal reasoning).
- Open questions / failure modes:
- Clean-utility trade-offs and compute overhead from probing repeats/candidate buffers (ProGRank).
- Memory fidelity risks: storing summaries can poison retrieval; temporal reasoning degrades with unstructured turn-pairs (Memory routing).
- Attribution noise: LLM-based triple attribution ~80% accuracy; fixed top-k evidence may miss support (TRIFEX/PKP).
Theme: Evaluation for long-horizon, subjective, and high-stakes workflows
- Why it matters: Deployed agents fail in ways that unit tests and final-answer accuracy don’t capture—process quality, intermediate artifacts, cohort correctness, and input validity become first-class.
- Representative papers:
- Common approach:
- Build environments with artifact contracts and runtime verification hooks (containers, tool servers, manifests).
- Use multi-judge/rubric scoring and validate with human pairwise preferences (LH-Bench).
- Gate end-to-end success on intermediate correctness (cohort audit tables; masked numeric results with verification).
- Open questions / failure modes:
- Generalization beyond a few environments/databases (LH-Bench: 2 envs; RWE-bench: MIMIC-IV only).
- High cost of automated screening/judging (RWE-bench cohort screening ~200M tokens).
- Format sensitivity and false alarms in medical VLM gatekeeping (MedObvious).
Theme: Alignment & monitoring under distribution shift and adversaries
- Why it matters: Safety failures often come from overfitting (preference optimization), biased feedback channels (implicit signals), and adaptive attackers (jailbreaks); robust monitoring and better objectives are emerging.
- Representative papers:
- Improving Safety Alignment via Balanced Direct Preference Optimization
- ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
- Robust Safety Monitoring of Language Models via Activation Watermarking
- Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs
- Common approach:
- Model the data-generation process (implicit feedback as preference × action propensity; MI-based comprehension imbalance).
- Evaluate under adaptive/low-budget attack regimes (AutoDAN; query budgets; ultra-low FPR monitoring).
- Add keyed or internal signals (activation watermarking) rather than relying only on external text guards.
- Open questions / failure modes:
- Watermarking assumes black-box attackers; no provable guarantees; utility drops on math (ActWM).
- Jailbreak fuzzing depends on a local surrogate and automated success judges (TriageFuzz).
- ImplicitRM assumes positive actions always indicate positive preference (misclick noise not handled).
Theme: Context sensitivity in fairness and moral judgment (and controllability)
- Why it matters: Small, “theoretically irrelevant” context can dominate outputs, undermining bias audits and making behavior adversarially steerable; some of this sensitivity is controllable but with trade-offs.
- Representative papers:
- Common approach:
- Controlled perturbations to isolate context effects (priming sentences; contextual moral variations; resume formats/photos).
- Go beyond outcome disparities to mechanisms: causal path decomposition (BIE vs RIE) and formal contextuality tests (CbD).
- Explore interventions: activation steering to tune contextual sensitivity; causal estimators with CIs for fairness pathways.
- Open questions / failure modes:
- Dataset realism and mediator choices affect legal/causal interpretation (PopResume).
- Gender inference task is binary pronouns and minimal contexts; no human invariance baseline (contextual invariance paper).
- Steering generality beyond one model family and ecological validity beyond binary dilemmas (moral judgment).
3) Technical synthesis
- Many works converge on “gating + fallback” architectures: SpecEyes gates tool-free answers via separability; memory routing gates escalation via logprob confidence; Agent-Sentry gates tool calls via provenance graphs + intent judge; hallucination detection gates sampling via entropy-variance stopping.
- Judge dependence is everywhere, but used differently: LH-Bench uses multiple judges + human validation; RWE-bench uses gated questions and a cohort judge; TreeTeaming and MedObvious highlight format/judge sensitivity risks; TRIFEX uses LLM attribution with measured accuracy (~80%).
- Security measurement is becoming rate-based and lifecycle-aware: secure-code rate across gen/detect/patch; ASR vs utility; CER/AER for leakage; poison hit/recall at retrieval stage plus downstream ASR.
- Training-free defenses are favored for deployability: ProGRank reranks without retraining; Agent Audit is static; SpecEyes is routing; memory routing is retrieval + confidence; these contrast with fine-tuning-based monitoring (activation watermarking) and alignment (B-DPO).
- Causal/structural decompositions are spreading: PopResume decomposes protected-attribute effects into direct vs mediated (business necessity vs redlining); ImplicitRM decomposes implicit feedback into preference vs action propensity; calibration paper decomposes “true-label” vs voted-label calibration targets.
- Sparse structure keeps appearing: TriageFuzz finds refusal dominated by sparse token regions; SafeSeek finds extremely sparse safety/backdoor circuits; both imply defenses/attacks can focus on small substructures.
- Context and modality increase direct access to sensitive attributes: PopResume shows photos increase direct effects (NDE) in VLM screeners; TRAP shows CoT can dominate action generation and be hijacked via visual patches.
- Cost is now a first-class metric: multi-LLM secure coding quantifies CodeQL runtime dominance; MCP vs A2A shows token bloat crossover; SpecEyes formalizes throughput speedup; memory routing reports ~96% effective-cost reduction vs large model.
- Reliability hinges on intermediate artifacts: cohort audit tables, manifests/screenshots, tool-call provenance, and triple attributions are increasingly used as “contracts” for evaluation and debugging.
4) Top 5 papers (with “why now”)
1) Agent-Sentry: Bounding LLM Agents via Execution Provenance
- Introduces functionality graphs from traces (benign/adversarial/ambiguous) and runtime interception of tool calls.
- Uses an intent-alignment judge restricted to trusted inputs only (prompt + tool specs + tool history), explicitly excluding retrieved content.
- Reports strong security/utility trade-offs on a new 6,733-trace benchmark (utility 94.61%, ASR 9.46% with full coverage).
- Skepticism: coverage dependence and mimicry attacks (benign paths with malicious parameters) can evade.
2) ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning
- Training-free, retriever-side defense using probe-gradient instability under perturbations + score gating.
- Reduces poisoned Top-K exposure and reports strong downstream robustness (macro-average judge-based ASR reported as 0.000 at Top-5 in their eval).
- Much faster than costly baselines (mean 4.73 s/query vs 118.17 s/query for RAGuard).
- Skepticism: compute overhead depends on probing repeats/candidate buffer; clean utility trade-offs are dataset-dependent.
3) Agent Audit: A Security Analysis System for LLM Agent Applications
- Agent-aware static analysis for Python agents + MCP config semantics, with tool-boundary tainting and confidence tiers.
- New benchmark AVB (22 samples, 42 vulns); reports 95.24% recall (40/42) vs Semgrep/Bandit ~24–30% recall.
- CI/IDE-ready outputs (SARIF) and sub-second scanning on 22k LOC.
- Skepticism: intra-procedural taint only; MCP heuristics contribute false positives; limited JS/TS support.
- Provides a population-grounded resume dataset (60,884) and path-specific causal decomposition: direct vs mediated, and mediated into business-necessity (BIE) vs redlining (RIE).
- Finds discrimination patterns masked by outcome-only metrics (e.g., cancellation where TE≈0 but NDE/NIE nonzero; mixed mediation in 53/120 cases).
- Shows adding photos can increase direct discrimination magnitude in VLMs (NDE increases in 8/20 paired cases).
- Skepticism: synthetic rendering + U.S.-specific population assumptions; mediator grouping (B vs R) is context/jurisdiction dependent.
5) Robust Safety Monitoring of Language Models via Activation Watermarking
- Keyed internal monitoring: fine-tune harmful activations to align with secret directions; detect via cosine similarity at inference.
- Reports higher AUROC across jailbreak families (e.g., AutoDAN AUROC 0.9048) and improved low-FPR operation; adds a “secret extraction” attribution game (~80% diagonal).
- Low inference overhead (projection) vs extra forward passes for guard models.
- Skepticism: assumes black-box attackers; no provable guarantees; some utility drops (notably GSM8K −7.13 pp).
5) Practical next steps
- Instrument your agent for channel-level leakage: enumerate Obs channels (final text, tool args/returns, traces) and measure leakage with CIPL-style metrics (AER/CER) rather than only output redaction.
- Add pre-deployment agent-aware static checks: scan for tool-boundary taint flows, prompt construction risks, and MCP over-privilege/unverified servers (Agent Audit-style SARIF in CI).
- Deploy runtime tool-call bounding: log execution provenance and enforce allow/block based on learned benign/adversarial paths; route ambiguous calls to an intent judge that excludes untrusted retrieved content (Agent-Sentry pattern).
- Harden RAG against poisoning at retrieval time: try retriever-side reranking/penalties on top-B candidates; track poison hit/recall and downstream ASR, plus clean EM trade-offs (ProGRank-style evaluation).
- Treat memory as a security boundary: separate background “heartbeat” context from user-facing context; require provenance + explicit user visibility before promoting to long-term memory (HEARTBEAT E→M→B).
- Upgrade fairness audits from outcomes to mechanisms: for high-stakes scoring (hiring), estimate path-specific effects (direct vs mediated; business necessity vs proxy/redlining) and test photo-induced direct effects (PopResume).
- Stress-test context sensitivity: add “contextually irrelevant” primes to fairness and safety probes; measure invariance failures (gender inference) and contextual shifts (moral judgment) before deployment.
- If using preference optimization or implicit feedback: check for preference-pair comprehension imbalance (B-DPO idea) and propensity bias / missing negatives (ImplicitRM) before trusting reward models.
- Adopt long-horizon evaluation contracts: require intermediate artifacts (manifests, cohort audit tables, screenshots) and rubric-based scoring with human validation for subjective enterprise tasks (LH-Bench/RWE-bench patterns).
Generated from per-paper analyses; no external browsing.
