AI 论文日报（2026-02-10）

Published: February 10, 2026

English version: /paper-news/2026-02-10/

AI & AI Safety Daily Paper Report

2026-02-10

Generation Time: 2026-02-12 03:21:39
Number of Papers: 25

1. Research Problems

Other

This work focuses on the capability gap from “symbolic manipulation-style problem solving” to “science-level reasoning”: in contexts such as physics Olympiads, models must not only understand the problem statement and visual information such as diagrams/schematics, but also perform multi-step scientific reasoning and produce rigorous conclusions. The core question is how to effectively connect visual perception with chain-of-thought aimed at physics reasoning, so as to achieve more reliable understanding and derivation on Olympiad-style problems (based only on the abstract and excerpts; details are 信息不足).
This paper studies the “answer aggregation and reliability assessment” problem in multi-agent systems (MAS) for LLM reasoning tasks: when multiple agents produce different reasoning paths and conclusions, common approaches such as majority vote or using another LLM as a judge (LLM-as-Judge) may fail to consistently select the most trustworthy reasoning. The core question is how to more effectively leverage the “reasoning tree” generated by multiple agents to audit the reasoning process, so as to more accurately identify high-quality reasoning and improve final answer performance (abstract truncated; task/setup details are 信息不足).
The paper studies the “efficiency decomposition” problem for reasoning-oriented LLMs: such models typically trade longer reasoning processes (more reasoning tokens) for higher accuracy, but how this cost varies across models and tasks—and what factors dominate the “how many tokens for how much accuracy” tradeoff—is unclear. The core question is to provide a quantifiable decomposition of reasoning efficiency, explain the tradeoff between reasoning-token overhead and performance, and offer analytical grounding for training/inference of more efficient reasoning models (abstract truncated; specific definitions require the main text).
LLM agents for cloud Root Cause Analysis (RCA) show low accuracy; evaluations often only score final answers, obscuring where and why the agent’s reasoning/interaction failed.
Automatically designing environments/curricula for RL agents can be highly sample-inefficient if it requires many student rollouts to evaluate candidate environments. Unsupervised environment design needs scalable ways to estimate student learning progress and generate useful environment variations.
RL for long-form reasoning (e.g., multi-step CoT) can be unstable and sample-inefficient: the same prompt/trajectory can oscillate between helpful/harmful as the policy changes, and naive preference/return weighting can overfit easy wins or amplify noise.
Optical tactile sensing representations often overfit to specific sensors/tasks and struggle with dynamic interactions (sliding, varying force, transient contacts); datasets and models may under-represent temporal/force dynamics.
If AI agents can self-modify and form societies that evolve policies/cultures over time, maintaining alignment/safety constraints may be fundamentally difficult; safety may drift even absent explicit malicious intent.
Brain connectome data are structured (graphs/hypergraphs), noisy, and heterogeneous across subjects/scanners; supervised labels are scarce. Standard graph self-supervision (random masking/dropping) may ignore neurobiological structure and yield less informative pretraining signals.
Internet-wide measurement requires fast, reliable scanning. IPv6 scanning is especially challenging due to the vast address space; existing tools often focus on IPv4 or require heavy target generation heuristics that limit coverage or speed.

Safety/Privacy

As AI agents execute actions (API calls, filesystem operations, payments, infrastructure changes), prompt injection and tool misuse become primary threats. Existing mitigations are often ad hoc, lack standardized interception points, and do not provide strong audit trails.
AI governance proposals often require verifiable signals about model training/inference compute usage (for compliance, auditing, or misuse detection). Current hardware/software stacks provide limited trustworthy observability, and software logs can be forged.
Organizations often mix access control models (RBAC, ABAC, DAC), creating fragmentation and complexity. Policy authoring and auditing are difficult, and translating intent across models can be error-prone.

Machine Learning

LLMs often produce unnecessarily long chain-of-thought reasoning, increasing compute/latency; existing GRPO-style methods also suffer from inefficient data utilization and entropy collapse.
Task-vector model merging methods (TIES, TSV-M, Iso-C/CTS) often treat layers uniformly, but large vision transformers exhibit strong layer-wise heterogeneity where shallow layers are more interference-prone and deeper layers encode more stable task features.
Graph generative models for materials are typically trained on relatively small atomic graphs, but practical use may require generating substantially larger structures. It is unclear how far these models can scale before validity/quality deteriorates, and what scaling behavior governs that deterioration.
Tabular outlier detection research is hard to compare across papers due to small numbers of datasets, inconsistent preprocessing/splits, and heavy sensitivity to dataset idiosyncrasies—leading to unreliable conclusions about method superiority.
Comparisons between parameter-efficient fine-tuning methods (LoRA and its variants) can be biased if hyperparameters—especially batch size—are not tuned consistently across methods, leading to contradictory conclusions in the literature.

Language/Alignment

Post-training alignment workflows are fragmented across backend-specific tools and glue code, which introduces backend interference, reward fragmentation, and irreproducible pipelines that make alignment experiments hard to compare and replicate.
LLMs hallucinate fluent but false content; many mitigation methods require retraining or external verifiers, while practitioners want inference-time techniques that generalize across model families and tasks.
Bias evaluations often focus on plain LLM prompting, but production systems increasingly use RAG and sometimes add explicit reasoning prompts. The net effect of retrieval and reasoning on social bias is under-characterized, and may differ across bias types.
Generalist agents need broad, realistic tool-interaction experience spanning domains, but human-collected trajectories are expensive and narrow; naive synthetic data often lacks coherent cross-domain semantics and stateful correctness.
As chatbots become pervasive, users report harms including emotional dependency, compulsive use, and perceived loss of autonomy, but these risks are not well-characterized from the perspective of user-generated narratives.

Robustness

This paper targets reliability issues in “Tool-Integrated Reasoning (TIR)”: when LLM agents perform multi-step reasoning by calling external tools (retrieval, computation, execution, etc.), an error at one step often triggers cascading failures downstream; moreover, some errors are hard to recover from or correct once they occur. The core question is how to learn from these “irrecoverable failures” so that the policy can better localize where errors occur and optimize accordingly, thereby improving overall reasoning and tool-use success rates (abstract truncated; more background/setup information is 信息不足).
This paper studies the reliability of natural language generation (NLG) systems under data poisoning: when training data are maliciously injected with triggers or biased samples, model generations may exhibit controllable shifts or safety risks. The core question is how to provide certifiable guarantees or lower bounds for an NLG model’s “poisoning robustness”—i.e., under a given poisoning capability/budget, to what extent the model’s output behavior remains reliable (abstract truncated; threat-model details are 信息不足).

2. Methods & Approaches

Other

From the provided excerpts, this paper is a Technical Report on “bridging visual perception and scientific reasoning in physics Olympiads.” Methodologically, it likely centers on combining vision-language modeling with reasoning capabilities—for example, constructing/curating visual problems for physics Olympiads, designing model I/O and reasoning formats, and proposing joint training or evaluation frameworks for visual understanding and scientific reasoning. However, both the abstract and excerpts are truncated; key details such as model architecture, data construction, training objectives, and evaluation protocols are 信息不足, so its technical route and innovations cannot be determined.
The paper proposes a method called “AgentAuditor” (per abstract excerpt). From the title, it audits “multi-agent reasoning trees,” rather than only voting on final answers or using a simple judge score—suggesting it explicitly models/checks consistency, key nodes, and evidence chains within the reasoning tree, and then makes selection or correction decisions. The excerpt only provides the method name and a relative improvement clue (“up to 5%”), without key implementation details such as auditing criteria, how the tree is constructed, number of agents, or interaction mechanisms.
The abstract indicates a focus on “trade off inference tokens against accuracy,” and proposes a decomposition analysis of reasoning efficiency. The excerpt includes statistical clues like “Spearman ρ=0.63” and “9× overhead,” suggesting cross-model/cross-task experiments measuring token consumption, accuracy, and their correlation, and decomposing efficiency into interpretable factors (e.g., step length, redundancy, error backtracking). Because the material is truncated, the specific decomposition framework, dataset scope, and control-variable design are 信息不足.
Run the full benchmark at scale, then label/cluster observed failure modes by where they arise in the agent pipeline (reasoning, communication, environment interaction), enabling targeted interventions rather than only end-metric scoring.
The method (excerpt references SHED) uses a hierarchical representation of policies/behaviors to approximate student performance with fewer direct interactions. A teacher model is trained to propose environments; it uses evaluation environments to approximate the student, and a diffusion model is used to augment data (per excerpt).
信息不足/Not provided
信息不足/Not provided
信息不足/Not provided
信息不足/Not provided
信息不足/Not provided

Safety/Privacy

The work defines a specification and reference architecture for “runtime management” of agent actions: actions are mediated by an enforcement layer that can inspect context, apply allow/deny rules, require confirmations, and log cryptographically/tamper-evidently (as suggested by “tamper-evident receipts”). The excerpt references a threat model focused on prompt injection and describes multiple architectures (four mentioned).
The authors analyze or propose telemetry based on measurable GPU behaviors, including timing and VRAM residency. The excerpt cites four primitives: PoW, VDF, GEMM, and VRAM residency—suggesting benchmark-like kernels or cryptographic/verification workloads whose execution leaves measurable traces.
Likely formulates access requests and policy context as structured prompts (subjects/objects/actions/attributes/roles/ownership) and uses an LLM to infer decisions and explanations, possibly with templates, constrained decoding, or post-hoc verification. May include a dataset of policy scenarios spanning RBAC/ABAC/DAC proportions (excerpt: RBAC 14.5, ABAC 58.5, DAC 27.5—likely distribution).

Machine Learning

Treat multiple sampled solutions as a group; rather than weighting whole responses uniformly, split them into smaller units and compute weights using compression-relevant signals (length, entropy) so the policy learns to preserve informative steps while dropping redundant verbosity.
Before merging task vectors, apply layer-dependent scaling factors derived from data-free proxies; use a simple deterministic schedule (tiered two/three-level scaling) to downweight fragile early layers and upweight stable later layers, then run any standard task-vector aggregation.
The authors evaluate graph/material generative models under controlled size extrapolation, using the RADII setting (as referenced in the excerpt) and sweeping atom counts from small training-like sizes up to very large sizes (~75k atoms). They measure quality/validity degradation as size increases and fit/characterize observed scaling trends (e.g., an exponent α≈1/3 mentioned in the excerpt).
Benchmark construction: gather/curate many tabular datasets, organize into benchmark tracks, define standardized train/test (and potentially contamination/outlier injection rules for tracks like synthetic), and run/host comparative evaluations across multiple detectors under common metrics.
Empirical study varying batch size across LoRA and multiple variants on one or more tasks/models, measuring performance under different tuning protocols. Compares fixed-batch evaluations vs per-method tuned batch sizes; analyzes sensitivity and rank reversals.

Language/Alignment

Architect the alignment pipeline so backend-specific logic is isolated behind a single factory boundary; expose a common API for training and reward/evaluation components so researchers can swap backends while keeping the rest constant.
During decoding, measure inter-layer “confusion/consistency” signals from mid-layer representations; use them to adjust token probabilities (penalize high-instability continuations). CoCoA-SIG scales this penalty based on token self-information to selectively intervene when generations are surprising/unstable.
The authors conduct an empirical bias evaluation across multiple bias types (excerpt mentions 13). They compare baseline generation vs. RAG-augmented generation, and also study the effect of explicit reasoning/CoT prompting on bias metrics, analyzing when context helps or harms.
Pipeline likely: (1) define domain modules/tools and a shared semantic schema; (2) generate tasks and decompositions into a DAG of states/actions; (3) simulate personas/users; (4) produce tool-call traces with expected outputs and intermediate states; (5) filter/validate trajectories for executability/consistency; (6) train agent policies (LLM) on these traces.
Collects posts/comments from selected subreddits, applies qualitative coding (open/axial or similar), iteratively refines a codebook, and reports themes with illustrative excerpts and prevalence/relationships where applicable.

Robustness

The paper proposes “Error-Localized Policy Optimization (ELPO).” From the excerpt, it centers on error localization and recovery capability (Recovery Rate), suggesting the training signal is focused on the key steps that cause failure rather than optimizing the entire trajectory uniformly; it may include decomposing failed trajectories, labeling/inferring error steps, and locally weighted updates in policy-gradient or preference optimization. Since both the abstract and excerpts are truncated, details such as the exact algorithm form, loss function, tool environment, and data construction are 信息不足.
The paper proposes/discusses “Poisoning Robustness Certification,” and the excerpt includes keywords like “TPA,” “0.5%,” and “8-token,” suggesting it defines a poisoning attack/evaluation protocol (TPA) and a corresponding certification method, providing verifiable robustness conclusions or bounds for specific trigger lengths (e.g., 8 tokens) and poisoning rates (e.g., 0.5%). Because the abstract and excerpts are truncated, it is unclear whether the certification is based on theoretical bounds, sampling-based verification, smoothing, or other formal techniques.

3. Trends & Insights

Hot Topics

Multiple works in Other (10 total)
Multiple works in Safety/Privacy (3 total)
Multiple works in Machine Learning (5 total)
Multiple works in Language/Alignment (5 total)
Multiple works in Robustness (2 total)

Technical Trends

Reliability, communication, and runtime governance of agent/multi-agent systems
Evaluation and engineering methodology for parameter-efficient fine-tuning/model merging
Inference-time hallucination reduction / factuality improvement
Evaluation around RAG and analysis of its bias impacts

Cross-domain Intersections

Clear intersection between Safety/Privacy and agent tool-calling/runtime control (e.g., action interception, auditing, governance telemetry)
Evaluation methodology (benchmarks/hyperparameter bias) interacts with model training/deployment cost (reasoning efficiency, CoT compression)

4. Top 5 Recommended Papers

Five papers ranked by importance and innovation:

1. Towards Poisoning Robustness Certification for Natural Language Generation

Category: Robustness
Importance: ⭐⭐⭐⭐
Authors: Mihnea Ghitu, Matthew Wicker
Link: http://arxiv.org/abs/2602.09757v1

Core Problem:
This paper studies the reliability of natural language generation (NLG) systems under data poisoning: when training data are maliciously injected with triggers or biased samples, model generations may exhibit controllable shifts or safety risks. The core question is how to provide certifiable guarantees or lower bounds for an NLG model’s “poisoning robustness”—i.e., under a given poisoning capability/budget, to what extent the model’s output behavior remains reliable (abstract truncated; threat-model details are 信息不足).

Method Innovation:
The paper proposes/discusses “Poisoning Robustness Certification,” and the excerpt includes keywords like “TPA,” “0.5%,” and “8-token,” suggesting it defines a poisoning attack/evaluation protocol (TPA) and a corresponding certification method, providing verifiable robustness conclusions or bounds for specific trigger lengths (e.g., 8 tokens) and poisoning rates (e.g., 0.5%). Because the abstract and excerpts are truncated, it is unclear whether the certification is based on theoretical bounds, sampling-based verification, smoothing, or other formal techniques.

Main Contributions:

Introduces/advances “robustness certification” in the NLG poisoning setting (per title)
Provides experimental/setup clues related to poisoning rate and trigger length (0.5%, 8-token, TPA) (per excerpt)

Experimental Results:
The excerpt mentions “TPA … 0.5% … 8-token …” but does not provide clear certification strength, success/failure criteria, or comparative results; the abstract does not provide complete quantitative conclusions and requires reading the full paper.

2. Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

Category: Robustness
Importance: ⭐⭐⭐⭐
Authors: Qiao Liang, Yuke Zhu, Chao Ge
Link: http://arxiv.org/abs/2602.09598v1

Core Problem:
This paper targets reliability issues in “Tool-Integrated Reasoning (TIR)”: when LLM agents perform multi-step reasoning by calling external tools (retrieval, computation, execution, etc.), an error at one step often triggers cascading failures downstream; moreover, some errors are hard to recover from or correct once they occur. The core question is how to learn from these “irrecoverable failures” so that the policy can better localize where errors occur and optimize accordingly, thereby improving overall reasoning and tool-use success rates (abstract truncated; more background/setup information is 信息不足).

Method Innovation:
The paper proposes “Error-Localized Policy Optimization (ELPO).” From the excerpt, it centers on error localization and recovery capability (Recovery Rate), suggesting the training signal is focused on the key steps that cause failure rather than optimizing the entire trajectory uniformly; it may include decomposing failed trajectories, labeling/inferring error steps, and locally weighted updates in policy-gradient or preference optimization. Since both the abstract and excerpts are truncated, details such as the exact algorithm form, loss function, tool environment, and data construction are 信息不足.

Main Contributions:

Proposes an “error-localized” policy optimization idea for tool-integrated reasoning (ELPO)
Uses “irrecoverable failures” as learning signals, aiming to improve recovery/error-correction related metrics (per Recovery Rate clue in excerpt)

Experimental Results:
The abstract does not provide complete quantitative results; the excerpt only indicates a focus on “Recovery Rate,” but the improvement magnitude, baselines, and statistical significance are unclear.

3. Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

Category: Other
Importance: ⭐⭐⭐⭐
Authors: Wei Yang, Shixuan Li, Heng Ping
Link: http://arxiv.org/abs/2602.09341v1

Core Problem:
This paper studies the “answer aggregation and reliability assessment” problem in multi-agent systems (MAS) for LLM reasoning tasks: when multiple agents produce different reasoning paths and conclusions, common approaches such as majority vote or using another LLM as a judge (LLM-as-Judge) may fail to consistently select the most trustworthy reasoning. The core question is how to more effectively leverage the “reasoning tree” generated by multiple agents to audit the reasoning process, so as to more accurately identify high-quality reasoning and improve final answer performance (abstract truncated; task/setup details are 信息不足).

Method Innovation:
The paper proposes a method called “AgentAuditor” (per abstract excerpt). From the title, it audits “multi-agent reasoning trees,” rather than only voting on final answers or using a simple judge score—suggesting it explicitly models/checks consistency, key nodes, and evidence chains within the reasoning tree, and then makes selection or correction decisions. The excerpt only provides the method name and a relative improvement clue (“up to 5%”), without key implementation details such as auditing criteria, how the tree is constructed, number of agents, or interaction mechanisms.

Main Contributions:

Proposes an aggregation framework that audits “multi-agent LLM reasoning trees” (AgentAuditor)
Achieves better performance than “majority vote” and “LLM-as-Judge” baselines (per title)

Experimental Results:
Per the excerpt: AgentAuditor yields “up to 5%” improvement over baseline methods; however, specific datasets, metric definitions, and variance/significance information are missing and require reading the full paper.

4. XMap: Fast Internet-wide IPv4 and IPv6 Network Scanner

Category: Other
Importance: ⭐⭐⭐
Authors: 信息不足
Link: 信息不足

Core Problem:
Internet-wide measurement requires fast, reliable scanning. IPv6 scanning is especially challenging due to the vast address space; existing tools often focus on IPv4 or require heavy target generation heuristics that limit coverage or speed.

Method Innovation:
信息不足

Main Contributions:

Engineer a scanner that combines optimized packet generation/handling with scalable target management (including IPv6 strategies) to enable fast, repeatable measurements across IPv4 and meaningful IPv6 subsets.

Experimental Results:
信息不足

5. Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

Category: Other
Importance: ⭐⭐⭐
Authors: Taeyoon Kim, Woohyeok Park, Hoyeong Yun
Link: http://arxiv.org/abs/2602.09937v1

Core Problem:
LLM agents for cloud Root Cause Analysis (RCA) show low accuracy; evaluations often only score final answers, obscuring where and why the agent’s reasoning/interaction failed.

Method Innovation:
Run the full benchmark at scale, then label/cluster observed failure modes by where they arise in the agent pipeline (reasoning, communication, environment interaction), enabling targeted interventions rather than only end-metric scoring.

Main Contributions:

Process-level diagnostic evaluation of RCA agents: 1,675 runs across five LLMs on OpenRCA.
A 12-pitfall taxonomy spanning intra-agent reasoning, inter-agent communication, and agent-environment interaction.
Empirical finding that dominant pitfalls (e.g., hallucinated data interpretation, incomplete exploration) persist across models, suggesting architectural causes.
Controlled mitigation experiments: prompt engineering is insufficient for dominant pitfalls; enriching inter-agent communication reduces communication-related failures by up to 15 percentage points.

Experimental Results:
Low detection accuracy persists even with stronger models (per abstract description). Most prevalent pitfalls are model-agnostic, indicating shared architectural issues. Enhanced inter-agent communication protocol yields up to +15 percentage points reduction in communication-related failures.

5. Summary & Outlook

Today’s Highlights

This daily report is generated from in-depth reading and synthesis of 25 papers
Covered categories: Other, Safety/Privacy, Machine Learning, Language/Alignment, Robustness

Key Progress

Multiple works focus on reliability and security governance of agent systems (runtime interception/auditing/communication failure diagnosis, etc.)
Reasoning cost and evaluation methodology are more explicitly treated as research objects (CoT compression, reasoning efficiency decomposition, LoRA hyperparameter bias)

Future Directions

Combine process-level auditing/uncertainty signals with runtime policies to form a deployable closed loop of “agent firewall + auditing”
Validate generalization and robustness in more realistic tool environments and cross-domain tasks (areas marked 信息不足 require waiting for the full text/code)

Report generation time: 2026-02-12 03:21:39

Di Tang

AI & AI Safety Daily Paper Report

2026-02-10

1. Research Problems

Other

Safety/Privacy

Machine Learning

Language/Alignment

Robustness

2. Methods & Approaches

Other

Safety/Privacy

Machine Learning

Language/Alignment

Robustness

3. Trends & Insights

Hot Topics

Technical Trends

Cross-domain Intersections

4. Top 5 Recommended Papers

1. Towards Poisoning Robustness Certification for Natural Language Generation

2. Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

3. Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

4. XMap: Fast Internet-wide IPv4 and IPv6 Network Scanner

5. Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

5. Summary & Outlook

Today’s Highlights

Key Progress

Future Directions